Author name cluster

Vitor Guizilini

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

23 papers

2 author rows

ICLR Conference 2025 Conference Paper

PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding

Wei Chow
Jiageng Mao
Boyi Li 0001
Daniel Seita
Vitor Guizilini
Yue Wang 0041

Understanding the physical world is a fundamental challenge in embodied AI, critical for enabling agents to perform complex tasks and operate safely in real-world environments. While Vision-Language Models (VLMs) have shown great promise in reasoning and task planning for embodied agents, their ability to comprehend physical phenomena remains extremely limited. To close this gap, we introduce PhysBench, a comprehensive benchmark designed to evaluate VLMs' physical world understanding capability across a diverse set of tasks. PhysBench contains 10,002 entries of interleaved video-image-text data, categorized into four major domains: physical object properties, physical object relationships, physical scene understanding, and physics-based dynamics, further divided into 19 subclasses and 8 distinct capability dimensions. Our extensive experiments, conducted on 75 representative VLMs, reveal that while these models excel in common-sense reasoning, they struggle with understanding the physical world---likely due to the absence of physical knowledge in their training data and the lack of embedded physical priors. To tackle the shortfall, we introduce PhysAgent, a novel framework that combines the generalization strengths of VLMs with the specialized expertise of vision models, significantly enhancing VLMs' physical understanding across a variety of tasks, including an 18.4\% improvement on GPT-4o. Furthermore, our results demonstrate that enhancing VLMs' physical world understanding capabilities can help embodied agents such as MOKA. We believe that PhysBench and PhysAgent offer valuable insights and contribute to bridging the gap between VLMs and physical world understanding. [Project Page is here](https://physbench.github.io/)

IROS Conference 2025 Conference Paper

Self-Supervised Geometry-Guided Initialization for Robust Monocular Visual Odometry

Takayuki Kanai
Igor Vasiljevic
Vitor Guizilini
Kazuhiro Shintani

Monocular visual odometry is a key technology in various autonomous systems. Traditional feature-based methods suffer from failures due to poor lighting, insufficient texture, and large motions. In contrast, recent learning-based dense SLAM methods exploit iterative dense bundle adjustment to address such failure cases, and achieve robust and accurate localization in a wide variety of real environments, without depending on domain-specific supervision. However, despite its potential, the methods still struggle with scenarios involving large motion and object dynamics. In this study, we diagnose key weaknesses in a popular learning-based dense SLAM model (DROID-SLAM) by analyzing major failure cases on outdoor benchmarks and exposing various shortcomings of its optimization process. We then propose the use of self-supervised priors leveraging a frozen large-scale pre-trained monocular depth estimator to initialize the dense bundle adjustment process, leading to robust visual odometry without the need to fine-tune the SLAM backbone. Despite its simplicity, the proposed method demonstrates significant improvements on KITTI odometry, as well as the challenging DDAD benchmark. The project page: https://toyotafrc.github.io/SGInit-Proj/

NeurIPS Conference 2024 Conference Paper

$SE(3)$ Equivariant Ray Embeddings for Implicit Multi-View Depth Estimation

Yinshuang Xu
Dian Chen
Katherine Liu
Sergey Zakharov
Rares Ambrus
Kostas Daniilidis
Vitor Guizilini

Incorporating inductive bias by embedding geometric entities (such as rays) as input has proven successful in multi-view learning. However, the methods adopting this technique typically lack equivariance, which is crucial for effective 3D learning. Equivariance serves as a valuable inductive prior, aiding in the generation of robust multi-view features for 3D scene understanding. In this paper, we explore the application of equivariant multi-view learning to depth estimation, not only recognizing its significance for computer vision and robotics but also addressing the limitations of previous research. Most prior studies have either overlooked equivariance in this setting or achieved only approximate equivariance through data augmentation, which often leads to inconsistencies across different reference frames. To address this issue, we propose to embed $SE(3)$ equivariance into the Perceiver IO architecture. We employ Spherical Harmonics for positional encoding to ensure 3D rotation equivariance, and develop a specialized equivariant encoder and decoder within the Perceiver IO architecture. To validate our model, we applied it to the task of stereo depth estimation, achieving state of the art results on real-world datasets without explicit geometric constraints or extensive data augmentation.

PDF Details DOI

IROS Conference 2024 Conference Paper

Transcrib3D: 3D Referring Expression Resolution through Large Language Models

Jiading Fang
Xiangshan Tan
Shengjie Lin
Igor Vasiljevic
Vitor Guizilini
Hongyuan Mei
Rares Ambrus
Gregory Shakhnarovich

If robots are to work effectively alongside people, they must be able to interpret natural language references to objects in their 3D environment. Understanding 3D referring expressions is challenging—it requires the ability to both parse the 3D structure of the scene and correctly ground free-form language in the presence of distraction and clutter. We introduce Transcrib3D, an approach that brings together 3D detection methods and the emergent reasoning capabilities of large language models (LLMs). Transcrib3D uses text as the unifying medium, which allows us to sidestep the need to learn shared representations connecting multi-modal inputs, which would require massive amounts of annotated 3D data. As a demonstration of its effectiveness, Transcrib3D achieves state-of-the-art results on 3D reference resolution benchmarks, with a great leap in performance from previous multi-modality baselines. To improve upon zero-shot performance and facilitate local deployment on edge computers and robots, we propose self-correction for fine-tuning that trains smaller models, resulting in performance close to that of large models. We show that our method enables a real robot to perform pick-and-place tasks given queries that contain challenging referring expressions. Code will be available at https://ripl.github.io/Transcrib3D.

ICRA Conference 2023 Conference Paper

Depth Is All You Need for Monocular 3D Detection

Dennis Park
Jie Li 0031
Dian Chen 0005
Vitor Guizilini
Adrien Gaidon

A key contributor to recent progress in 3D detection from single images is monocular depth estimation. Existing methods focus on how to leverage depth explicitly, by generating pseudo-pointclouds or providing attention cues for image features. More recent works leverage depth prediction as a pretraining task and fine-tune the depth representation while training it for 3D detection. However, the adaptation is limited in scale by manual labels. In this work, we propose further aligning the depth representation with the target domain in an unsupervised fashion. Our methods leverage commonly available LiDAR or RGB videos during training time to fine-tune the depth representation, which leads to improved 3D detectors. Especially when using RGB videos, we show that our two-stage training by first generating depth pseudo-labels is critical, because of the inconsistency in loss distribution between the two tasks. With either type of reference data, our multi-task learning approach improves over the state of the art on both KITTI and NuScenes, while matching the test-time complexity of its single-task sub-network. Source code and pretrained models are available on https://github.com/TRI-ML/DD3D.

IROS Conference 2023 Conference Paper

Robust Self-Supervised Extrinsic Self-Calibration

Takayuki Kanai
Igor Vasiljevic
Vitor Guizilini
Adrien Gaidon
Rares Ambrus

Autonomous vehicles and robots need to operate over a wide variety of scenarios in order to complete tasks efficiently and safely. Multi-camera self-supervised monocular depth estimation from videos is a promising way to reason about the environment, as it generates metrically scaled geometric predictions from visual data without requiring additional sensors. However, most works assume well-calibrated extrinsics to fully leverage this multi-camera setup, even though accurate and efficient calibration is still a challenging problem. In this work, we introduce a novel method for extrinsic calibration that builds upon the principles of self-supervised monocular depth and ego-motion learning. Our proposed curriculum learning strategy uses monocular depth and pose estimators with velocity supervision to estimate extrinsics, and then jointly learns extrinsic calibration along with depth and pose for a set of overlapping cameras rigidly attached to a moving vehicle. Experiments on a benchmark multi-camera dataset (DDAD) demonstrate that our method enables self-calibration in various scenes robustly and efficiently compared to a traditional vision-based pose estimation pipeline. Furthermore, we demonstrate the benefits of extrinsics self-calibration as a way to improve depth prediction via joint optimization. The project page: https://sites.google.com/tri.global/tri-sesc

ICRA Conference 2022 Conference Paper

Self-Supervised Camera Self-Calibration from Video

Jiading Fang
Igor Vasiljevic
Vitor Guizilini
Rares Ambrus
Gregory Shakhnarovich
Adrien Gaidon
Matthew R. Walter

Camera calibration is integral to robotics and computer vision algorithms that seek to infer geometric properties of the scene from visual input streams. In practice, calibration is a laborious procedure requiring specialized data collection and careful tuning. This process must be repeated whenever the parameters of the camera change, which can be a frequent occurrence for mobile robots and autonomous vehicles. In contrast, self-supervised depth and ego-motion estimation approaches can bypass explicit calibration by in-ferring per-frame projection models that optimize a view-synthesis objective. In this paper, we extend this approach to explicitly calibrate a wide range of cameras from raw videos in the wild. We propose a learning algorithm to regress per-sequence calibration parameters using an efficient family of general camera models. Our procedure achieves self-calibration results with sub-pixel reprojection error, outperforming other learning-based methods. We validate our approach on a wide variety of camera geometries, including perspective, fisheye, and catadioptric. Finally, we show that our approach leads to improvements in the downstream task of depth estimation, achieving state-of-the-art results on the EuRoC dataset with greater computational efficiency than contemporary methods. The project page: https://sites.google.com/ttic.edu/self-sup-self-calib

NeurIPS Conference 2021 Conference Paper

MarioNette: Self-Supervised Sprite Learning

Dmitriy Smirnov
MICHAEL GHARBI
Matthew Fisher
Vitor Guizilini
Alexei Efros
Justin M. Solomon

Artists and video game designers often construct 2D animations using libraries of sprites---textured patches of objects and characters. We propose a deep learning approach that decomposes sprite-based video animations into a disentangled representation of recurring graphic elements in a self-supervised manner. By jointly learning a dictionary of possibly transparent patches and training a network that places them onto a canvas, we deconstruct sprite-based content into a sparse, consistent, and explicit representation that can be easily used in downstream tasks, like editing or analysis. Our framework offers a promising approach for discovering recurring visual patterns in image collections without supervision.

ICLR Conference 2020 Conference Paper

Neural Outlier Rejection for Self-Supervised Keypoint Learning

Jiexiong Tang
Hanme Kim
Vitor Guizilini
Sudeep Pillai
Rares Ambrus

Identifying salient points in images is a crucial component for visual odometry, Structure-from-Motion or SLAM algorithms. Recently, several learned keypoint methods have demonstrated compelling performance on challenging benchmarks. However, generating consistent and accurate training data for interest-point detection in natural images still remains challenging, especially for human annotators. We introduce IO-Net (i.e. InlierOutlierNet), a novel proxy task for the self-supervision of keypoint detection, description and matching. By making the sampling of inlier-outlier sets from point-pair correspondences fully differentiable within the keypoint learning framework, we show that are able to simultaneously self-supervise keypoint description and improve keypoint matching. Second, we introduce KeyPointNet, a keypoint-network architecture that is especially amenable to robust keypoint detection and description. We design the network to allow local keypoint aggregation to avoid artifacts due to spatial discretizations commonly used for this task, and we improve fine-grained keypoint descriptor performance by taking advantage of efficient sub-pixel convolutions to upsample the descriptor feature-maps to a higher operating resolution. Through extensive experiments and ablative analysis, we show that the proposed self-supervised keypoint learning method greatly improves the quality of feature matching and homography estimation on challenging benchmarks over the state-of-the-art.

ICLR Conference 2020 Conference Paper

Semantically-Guided Representation Learning for Self-Supervised Monocular Depth

Vitor Guizilini
Rui Hou 0007
Jie Li 0031
Rares Ambrus
Adrien Gaidon

Self-supervised learning is showing great promise for monocular depth estimation, using geometry as the only source of supervision. Depth networks are indeed capable of learning representations that relate visual appearance to 3D properties by implicitly leveraging category-level patterns. In this work we investigate how to leverage more directly this semantic structure to guide geometric representation learning, while remaining in the self-supervised regime. Instead of using semantic labels and proxy losses in a multi-task approach, we propose a new architecture leveraging fixed pretrained semantic segmentation networks to guide self-supervised representation learning via pixel-adaptive convolutions. Furthermore, we propose a two-stage training process to overcome a common semantic bias on dynamic objects via resampling. Our method improves upon the state of the art for self-supervised monocular depth prediction over all pixels, fine-grained details, and per semantic categories.

ICRA Conference 2019 Conference Paper

Dynamic Hilbert Maps: Real-Time Occupancy Predictions in Changing Environments

Vitor Guizilini
Ransalu Senanayake
Fabio Ramos 0001

This paper addresses the problem of learning instantaneous occupancy levels of dynamic environments and predicting future occupancy levels. Due to the complexity of most real environments, such as urban streets or crowded areas, the efficient and robust incorporation of temporal dependencies into otherwise static occupancy models remains a challenge. We propose a method to capture the uncertainty of moving objects and incorporate this uncertainty information into a continuous occupancy map represented in a rich high-dimensional feature space. This data-efficient model not only allows us to learn the occupancy states incrementally, but also makes predictions about what the future occupancy states will be. Experiments performed using 2D and 3D laser data collected from crowded unstructured outdoor environments show that the proposed methodology can accurately predict occupancy states for areas of around 1000 m 2 at 10 Hz, making the proposed framework ideal for online applications under real-time constraints.

AAAI Conference 2018 Conference Paper

Iterative Continuous Convolution for 3D Template Matching and Global Localization

Vitor Guizilini
Fabio Ramos

This paper introduces a novel methodology for 3D template matching that is scalable to higher-dimensional spaces and larger kernel sizes. It uses the Hilbert Maps framework to model raw pointcloud information as a continuous occupancy function, and we derive a closed-form solution to the convolution operation that takes place directly in the Reproducing Kernel Hilbert Space deﬁning these functions. The result is a third function modeling activation values, that can be queried at arbitrary resolutions with logarithmic complexity, and by iteratively searching for high similarity areas we can determine matching candidates. Experimental results show substantial speed gains over standard discrete convolution techniques, such as sliding window and fast Fourier transform, along with a signiﬁcant decrease in memory requirements, without accuracy loss. This efﬁciency allows the proposed methodology to be used in areas where discrete convolution is currently infeasible. As a practical example we explore the key problem in robotics of global localization, in which a vehicle must be positioned on a map using only its current sensor information, and provide comparisons with other state-of-the-art techniques in terms of computational speed and accuracy.

ICRA Conference 2018 Conference Paper

Learning to Race Through Coordinate Descent Bayesian Optimisation

Rafael Oliveira 0001
Fernando H. M. Rocha
Lionel Ott
Vitor Guizilini
Fabio Ramos 0001
Valdir Grassi Jr.

In the automation of many kinds of processes, the observable outcome can often be described as the combined effect of an entire sequence of actions, or controls, applied throughout the process execution. In these cases, strategies to optimise control policies for individual stages of the process are not applicable, and instead the whole policy needs to be optimised at once. On the other hand, the cost to evaluate the policy's performance might also be high, being desirable that a solution can be found with as few interactions as possible with the real system. We consider the problem of optimising control policies to allow a robot to complete a given race track within a minimum amount of time. We assume that the robot has no prior information about the track or its own dynamical model, just an initial valid driving example. Localisation is only applied to monitor the robot and to provide an indication of its position along the track's centre axis. With that in mind, we propose a method for finding a policy that minimises the time per lap while keeping the vehicle on the track using a Bayesian optimisation (BO) approach over a reproducing kernel Hilbert space. We apply an algorithm to search more efficiently over high-dimensional policy-parameter spaces with BO, by iterating over each dimension individually, in a sequential coordinate descent-like scheme. Experiments demonstrate the performance of the algorithm against other methods in a simulated car racing environment.

IROS Conference 2017 Conference Paper

Markovian jump linear systems-based filtering for visual and GPS aided inertial navigation system

Roberto S. Inoue
Vitor Guizilini
Marco H. Terra
Fabio Ramos 0001

Visual-Inertial SLAM methods have become a very important technology for several applications in robotics. This kind of approach usually is composed by sensors as rate gyros, accelerometers and monocular cameras. Magnetometers and GPS modules generally used for outdoors are absent in the SLAM system observation, since the magnetometer measurements deteriorate in the presence of ferromagnetic materials and the GPS module signals are unavailable indoors or in urban environments. In order to make use of all these sensors, we propose Markovian jump linear systems (MJLS) to model the modes of operation of the navigation system based on available sensors and their reliability. An extended Kalman filter for MJLS fuses the sensor data and estimates the motion using the best mode of operation for each particular time instant. Experimental results are presented to show the effectiveness of the proposed method, in situations that would pose a challenge for standard data fusion techniques.

AAAI Conference 2017 Conference Paper

Unsupervised Feature Learning for 3D Scene Reconstruction with Occupancy Maps

Vitor Guizilini
Fabio Ramos

This paper addresses the task of unsupervised feature learning for three-dimensional occupancy mapping, as a way to segment higher-level structures based on raw unorganized point cloud data. In particular, we focus on detecting planar surfaces, which are common in most structured or semistructured environments. This segmentation is then used to minimize the amount of parameters necessary to properly create a 3D occupancy model of the surveyed space, thus increasing computational speed and decreasing memory requirements. As the 3D modeling tool, an extension to Hilbert Maps (Ramos and Ott 2015) recently proposed in (Guizilini and Ramos 2016) was selected, since it naturally uses a feature-based representation of the environment to achieve real-time performance. Experiments conducted in simulated and real large-scale datasets show a substantial gain in performance, while decreasing the amount of stored information by orders of magnitude without sacriﬁcing accuracy.

IROS Conference 2016 Conference Paper

Large-scale 3D scene reconstruction with Hilbert Maps

Vitor Guizilini
Fabio Ramos 0001

3D scene reconstruction involves the volumetric modeling of space, and it is a fundamental step in a wide variety of robotic applications, including grasping, obstacle avoidance, path planning, mapping and many others. Nowadays, sensors are able to quickly collect vast amounts of data, and the challenge has become one of storing and processing all this information in a timely manner, especially if real-time performance is required. Recently, a novel technique for the stochastic learning of discriminative models through continuous occupancy maps was proposed: Hilbert Maps [18], that is able to represent the input space at an arbitrary resolution while capturing statistical relationships between measurements. The original framework was proposed for 2D environments, and here we extend it to higher-dimensional spaces, addressing some of the challenges brought by the curse of dimensionality. Namely, we propose a method for the automatic selection of feature coordinate locations, and introduce the concept of localized automatic relevance determination (LARD) to the Hilbert Maps framework, in which different dimensions in the projected Hilbert space operate within independent length-scale values. The proposed technique was tested against other state-of-the-art 3D scene reconstruction tools in three different datasets: a simulated indoors environment, RIEGL laser scans and dense LSD-SLAM pointclouds. The results testify to the proposed framework's ability to model complex structures and correctly interpolate over unobserved areas of the input space while achieving real-time training and querying performances.

ICRA Conference 2016 Conference Paper

Route planning for active classification with UAVs

Kelen Vivaldini
Vitor Guizilini
Matheus Della Croce Oliveira
Thiago H. Martinelli
Denis Fernando Wolf
Fabio Ramos 0001

The mapping of agricultural crops by capturing images obtained with UAVs enables fast environmental monitoring and diagnosis in large areas. Airborne monitoring in agriculture can a substantially impacts on the identification of diseases and produce accurate information on affected areas. The problem can be formulated as a classification task on aerial images with significant opportunities to impact other fields. This paper presents an active learning method through route planning for improvements in the knowledge on visited areas and minimization uncertainties about the classification of diseases in crops. Binary Logistic Regression and Gaussian Process were used for the detection of pathologies and map interpolation, respectively. A Bayesian optimization strategy is also proposed for the planning of an informative trajectory, which resulted in a maximized search for affected areas in an initially unknown environment.

AAAI Conference 2015 Conference Paper

A Nonparametric Online Model for Air Quality Prediction

Vitor Guizilini
Fabio Ramos

We introduce a novel method for the continuous online prediction of particulate matter in the air (more specifically, PM10 and PM2. 5) given sparse sensor information. A nonparametric model is developed using Gaussian Processes, which eschews the need for an explicit formulation of internal – and usually very complex – dependencies between meteorological variables. Instead, it uses historical data to extrapolate pollutant values both spatially (in areas with no sensor information) and temporally (the near future). Each prediction also contains a respective variance, indicating its uncertainty level and thus allowing a probabilistic treatment of results. A novel training methodology (Structural Cross- Validation) is presented, which preserves the spatiotemporal structure of available data during the hyperparameter optimization process. Tests were conducted using a real-time feed from a sensor network in an area of roughly 50 × 80 km, alongside comparisons with other techniques for air pollution prediction. The promising results motivated the development of a smartphone applicative and a website, currently in use to increase the efﬁciency of air quality monitoring and control in the area.

ICRA Conference 2015 Conference Paper

Automatic detection of Ceratocystis wilt in Eucalyptus crops from aerial images

Jefferson R. Souza
Caio César Teodoro Mendes
Vitor Guizilini
Kelen Vivaldini
Adimara Colturato
Fabio Ramos 0001
Denis Fernando Wolf

One of the challenges in precision agriculture is the detection of diseased crops in agricultural environments. This paper presents a methodology to detect the Ceratocystis wilt disease in Eucalyptus crops. An unmanned aerial vehicle is used to obtain high-resolution RGB images of a predefined area. The methodology enables the extraction of visual features from image regions and uses several supervised machine learning (ML) techniques to classify regions into three classes: ground, healthy and diseased plants. Several learning techniques were compared using data obtained from a commercial Eucalyptus plantation. Experimental results show that the GP learning model is more reliable than the other learning methods for accurately identifying diseased trees.

ICRA Conference 2014 Conference Paper

Online self-supervised multi-instance segmentation of dynamic objects

Alex Bewley
Vitor Guizilini
Fabio Ramos 0001
Ben Upcroft

This paper presents a method for the continuous segmentation of dynamic objects using only a vehicle mounted monocular camera without any prior knowledge of the object's appearance. Prior work in online static/dynamic segmentation [1] is extended to identify multiple instances of dynamic objects by introducing an unsupervised motion clustering step. These clusters are then used to update a multi-class classifier within a self-supervised framework. In contrast to many tracking-by-detection based methods, our system is able to detect dynamic objects without any prior knowledge of their visual appearance shape or location. Furthermore, the classifier is used to propagate labels of the same object in previous frames, which facilitates the continuous tracking of individual objects based on motion. The proposed system is evaluated using recall and false alarm metrics in addition to a new multi-instance labelled dataset to measure the performance of segmenting multiple instances of objects.

ICRA Conference 2013 Conference Paper

Online self-supervised segmentation of dynamic objects

Vitor Guizilini
Fabio Ramos 0001

We address the problem of automatically segmenting dynamic objects in an urban environment from a moving camera without manual labelling, in an online, self-supervised learning manner. We use input images obtained from a single uncalibrated camera placed on top of a moving vehicle, extracting and matching pairs of sparse features that represent the optical flow information between frames. This optical flow information is initially divided into two classes, static or dynamic, where the static class represents features that comply to the constraints provided by the camera motion and the dynamic class represents the ones that do not. This initial classification is used to incrementally train a Gaussian Process (GP) classifier to segment dynamic objects in new images. The hyperparameters of the GP covariance function are optimized online during navigation, and the available self-supervised dataset is updated as new relevant data is added and redundant data is removed, resulting in a near-constant computing time even after long periods of navigation. The output is a vector containing the probability that each pixel in the image belongs to either the static or dynamic class (ranging from 0 to 1), along with the corresponding uncertainty estimate of the classification. Experiments conducted in an urban environment, with cars and pedestrians as dynamic objects and no prior knowledge or additional sensors, show promising results even when the vehicle is moving at considerable speeds (up to 50 km/h). This scenario produces a large quantity of featureless regions and false matches that is very challenging for conventional approaches. Results obtained using a portable camera device also testify to our algorithm's ability to generalize over different environments and configurations without any fine-tuning of parameters.

ICRA Conference 2012 Conference Paper

Semi-parametric models for visual odometry

Vitor Guizilini
Fabio Ramos 0001

This paper introduces a novel framework for estimating the motion of a robotic car from image information, a scenario widely known as visual odometry. Most current monocular visual odometry algorithms rely on a calibrated camera model and recover relative rotation and translation by tracking image features and applying geometrical constraints. This approach has some drawbacks: translation is recovered up to a scale, it requires camera calibration which can be tricky under certain conditions, and uncertainty estimates are not directly obtained. We propose an alternative approach that involves the use of semi-parametric statistical models as means to recover scale, infer camera parameters and provide uncertainty estimates given a training dataset. As opposed to conventional non-parametric machine learning procedures, where standard models for egomotion would be neglected, we present a novel framework in which the existing parametric models and powerful non-parametric Bayesian learning procedures are combined. We devise a multiple output Gaussian Process (GP) procedure, named Coupled GP, that uses a parametric model as the mean function and a non-stationary covariance function to map image features directly into vehicle motion. Additionally, this procedure is also able to infer joint uncertainty estimates (full covariance matrices) for rotation and translation. Experiments performed using data collected from a single camera under challenging conditions show that this technique outperforms traditional methods in trajectories of several kilometers.

ICRA Conference 2011 Conference Paper

Visual odometry learning for unmanned aerial vehicles

Vitor Guizilini
Fabio Ramos 0001

This paper addresses the problem of using visual information to estimate vehicle motion (a. k. a. visual odometry) from a machine learning perspective. The vast majority of current visual odometry algorithms are heavily based on geometry, using a calibrated camera model to recover relative translation (up to scale) and rotation by tracking image features over time. Our method eliminates the need for a parametric model by jointly learning how image structure and vehicle dynamics affect camera motion. This is achieved with a Gaussian Process extension, called Coupled GP, which is trained in a supervised manner to infer the underlying function mapping optical flow to relative translation and rotation. Matched image features parameters are used as inputs and linear and angular velocities are the outputs in our non-linear multi-task regression problem. We show here that it is possible, using a single uncalibrated camera and establishing a first-order temporal dependency between frames, to jointly estimate not only a full 6 DoF motion (along with a full covariance matrix) but also relative scale, a non-trivial problem in monocular configurations. Experiments were performed with imagery collected with an unmanned aerial vehicle (UAV) flying over a deserted area at speeds of 100–120 km/h and altitudes of 80-100 m, a scenario that constitutes a challenge for traditional visual odometry estimators.