Author name cluster

Jongwoo Lim

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

23 papers

2 author rows

NeurIPS Conference 2024 Conference Paper

4D Gaussian Splatting in the Wild with Uncertainty-Aware Regularization

Mijeong Kim
Jongwoo Lim
Bohyung Han

Novel view synthesis of dynamic scenes is becoming important in various applications, including augmented and virtual reality. We propose a novel 4D Gaussian Splatting (4DGS) algorithm for dynamic scenes from casually recorded monocular videos. To overcome the overfitting problem of existing work for these real-world videos, we introduce an uncertainty-aware regularization that identifies uncertain regions with few observations and selectively imposes additional priors based on diffusion models and depth smoothness on such regions. This approach improves both the performance of novel view synthesis and the quality of training image reconstruction. We also identify the initialization problem of 4DGS in fast-moving dynamic regions, where the Structure from Motion (SfM) algorithm fails to provide reliable 3D landmarks. To initialize Gaussian primitives in such regions, we present a dynamic region densification method using the estimated depth maps and scene flow. Our experiments show that the proposed method improves the performance of 4DGS reconstruction from a video captured by a handheld monocular camera and also exhibits promising results in few-shot static scene reconstruction.

PDF Details DOI

IROS Conference 2024 Conference Paper

Fast Spatial Reasoning of Implicit 3D Maps through Explicit Near-Far Sampling Range Prediction

Chaerin Min
Sehyun Cha
Changhee Won
Jongwoo Lim

3D mapping is critical for many robotics applications, such as autonomous navigation and object manipulation. Recently, deep implicit mapping approaches have received much attention for their compactness and ability to represent fine-grained details. However, without explicit guidance, such implicit representations are often cumbersome for searching the full range on the rays to find the object surfaces. As a result, several approaches, including hierarchical sampling, occupancy grids, and zero-level set baking, have been proposed to improve sampling where costly forward passes of the neural network should be performed. However, hierarchical sampling is still suboptimal in that it requires uniform coarse samples. Discrete occupancy grids of Instant NGP and zero-level sets of various baking methods are less suitable for large and noisy real scenes. In this paper, we present a novel framework for adaptively predicting the near-far range for sampling the query positions of the deep implicit map. For this purpose, the truncated signed distance grid for the map is pre-constructed and used to provide hints for near-far prediction during rendering. In addition, our recovery algorithm automatically detects failed near-far predictions and recovers only those rays by directly using the implicit map. We conduct extensive experiments on a synthetic dataset, a public real dataset, and a real dataset captured by our multi-camera robot system. The experimental results show that our algorithm achieves the same rendering quality with surprisingly fewer samples compared to the existing methods, which means that the robot can reason about the image and depth properties of the scene much faster. Finally, a thorough analysis of the sample distribution along the rays is provided to give a better understanding of our method’s strong efficiency, adaptability, and robustness. https://chaerinmin.github.io/TSDF-sampling/

Details

IROS Conference 2022 Conference Paper

Online Extrinsic Correction of Multi-Camera Systems by Low-Dimensional Parameterization of Physical Deformation

Sangheon Yang
Jongwoo Lim

In this paper, we propose the online extrinsic correction method that effectively optimizes the extrinsic parameters of multi-camera systems used in visual SLAM. In the typical visual SLAM systems that use multi-camera settings, the intrinsic and extrinsic parameters of the cameras are calculated through offline calibration, which is used as the fixed constraints in online execution. However, the camera rig can be physically deformed by shock or vibration, and the deviation from the offline calibration parameters can adversely affect the accuracy of triangulation and pose estimation. Therefore, it is crucial to maintain the accurate calibration of the camera rigs continuously throughout the execution. The previous online calibration methods optimize the extrinsic camera parameters in a full degree of freedom(DoF) by minimizing the reprojection error, but the limited visual information available online may bias the resulting camera poses. From the observation that the cameras are mounted on a physical body and the patterns that the body can be deformed is restricted and not completely free, we propose to model the pattern of physical rig deformation by external forces in advance, and then use the pre-trained low-dimensional deformation model to robustly and accurately estimate the changed camera poses in real-time. The proposed method consists of two steps. First, the physical model of the camera system is constructed in a simulator and the actual deformations by various external disturbances are recorded, and the deformation patterns are modeled by a PCA algorithm to build a low-dimensional model. In online execution, the camera poses are updated by minimizing the reprojection errors of visual features within the pre-trained low-dimensional parameterization, instead of optimizing all camera poses independently. Through the experiments in synthetic environments, the proposed online extrinsic correction method shows that it produces more accurate and robust camera pose estimation results than the existing method even when inaccurate 3D-2D correspondences exist or 2D feature positions are noisy.

Details

ICLR Conference 2020 Conference Paper

Generalized Convolutional Forest Networks for Domain Generalization and Visual Recognition

Jongbin Ryu
Gitaek Kwon
Ming-Hsuan Yang 0001
Jongwoo Lim

When constructing random forests, it is of prime importance to ensure high accuracy and low correlation of individual tree classifiers for good performance. Nevertheless, it is typically difficult for existing random forest methods to strike a good balance between these conflicting factors. In this work, we propose a generalized convolutional forest networks to learn a feature space to maximize the strength of individual tree classifiers while minimizing the respective correlation. The feature space is iteratively constructed by a probabilistic triplet sampling method based on the distribution obtained from the splits of the random forest. The sampling process is designed to pull the data of the same label together for higher strength and push away the data frequently falling to the same leaf nodes. We perform extensive experiments on five image classification and two domain generalization datasets with ResNet-50 and DenseNet-161 backbone networks. Experimental results show that the proposed algorithm performs favorably against state-of-the-art methods.

Details

ICRA Conference 2020 Conference Paper

OmniSLAM: Omnidirectional Localization and Dense Mapping for Wide-baseline Multi-camera Systems

Changhee Won
Hochang Seok
Zhaopeng Cui
Marc Pollefeys
Jongwoo Lim

In this paper, we present an omnidirectional localization and dense mapping system for a wide-baseline multiview stereo setup with ultra-wide field-of-view (FOV) fisheye cameras, which has a 360° coverage of stereo observations of the environment. For more practical and accurate reconstruction, we first introduce improved and light-weighted deep neural networks for the omnidirectional depth estimation, which are faster and more accurate than the existing networks. Second, we integrate our omnidirectional depth estimates into the visual odometry (VO) and add a loop closing module for global consistency. Using the estimated depth map, we reproject keypoints onto each other view, which leads to a better and more efficient feature matching process. Finally, we fuse the omnidirectional depth maps and the estimated rig poses into the truncated signed distance function (TSDF) volume to acquire a 3D map. We evaluate our method on synthetic datasets with ground-truth and real-world sequences of challenging environments, and the extensive experiments show that the proposed system generates excellent reconstruction results in both synthetic and real-world environments.

Details

IROS Conference 2020 Conference Paper

Unified Calibration for Multi-camera Multi-LiDAR Systems using a Single Checkerboard

Wonmyung Lee
Changhee Won
Jongwoo Lim

In this paper, we propose a unified calibration method for multi-camera multi-LiDAR systems. Only using a single planar checkerboard, the captured checkerboard frames by each sensor are classified as either global frames if they are observed by at least two sensors, or a local frame if observed by a single camera. Both global and local frames of each camera are used to estimate its intrinsic parameters, whereas the global frames between sensors are for computing their relative poses. In contrast to the previous methods that simply combine the pairwise poses (e. g. , camera-to-camera or camera-to-LiDAR) that are separately estimated, we further optimize the sensor poses in the system globally using all observations as the constraints in the optimization problem. We find that the point-to-plane distances are effective as camera-to-LiDAR constraints where the points are 3D positions of the checkerboard corners and the planes are estimated from the LiDAR point-cloud. Also, abundant corner observations in the local frames enable the joint optimization of intrinsic and extrinsic parameters in a unified framework. The proposed calibration method utilizes entire observations in a unified global optimization framework, and it significantly reduces the error caused by a simple composition of the relative sensor poses. We extensively evaluate the proposed algorithm qualitatively and quantitatively using real and synthetic datasets. We plan to make the implementation open to the public with the paper publication.

Details

ICRA Conference 2019 Conference Paper

ROVO: Robust Omnidirectional Visual Odometry for Wide-baseline Wide-FOV Camera Systems

Hochang Seok
Jongwoo Lim

In this paper we propose a robust visual odometry system for a wide-baseline camera rig with wide field-of-view (FOV) fisheye lenses, which provides full omnidirectional stereo observations of the environment. For more robust and accurate ego-motion estimation we adds three components to the standard VO pipeline, 1) the hybrid projection model for improved feature matching, 2) multi-view P3P RANSAC algorithm for pose estimation, and 3) online update of rig extrinsic parameters. The hybrid projection model combines the perspective and cylindrical projection to maximize the overlap between views and minimize the image distortion that degrades feature matching performance. The multi-view P3P RANSAC algorithm extends the conventional P3P RANSAC to multi-view images so that all feature matches in all views are considered in the inlier counting for robust pose estimation. Finally the online extrinsic calibration is seamlessly integrated in the backend optimization framework so that the changes in camera poses due to shocks or vibrations can be corrected automatically. The proposed system is extensively evaluated with synthetic datasets with ground-truth and real sequences of highly dynamic environment, and its superior performance is demonstrated.

Details

ICRA Conference 2019 Conference Paper

SweepNet: Wide-baseline Omnidirectional Depth Estimation

Changhee Won
Jongbin Ryu
Jongwoo Lim

Omnidirectional depth sensing has its advantage over the conventional stereo systems since it enables us to recognize the objects of interest in all directions without any blind regions. In this paper, we propose a novel wide-baseline omnidirectional stereo algorithm which computes the dense depth estimate from the fisheye images using a deep convolutional neural network. The capture system consists of multiple cameras mounted on a wide-baseline rig with ultra-wide field of view (FOV) lenses, and we present the calibration algorithm for the extrinsic parameters based on the bundle adjustment. Instead of estimating depth maps from multiple sets of rectified images and stitching them, our approach directly generates one dense omnidirectional depth map with full 360° coverage at the rig global coordinate system. To this end, the proposed neural network is designed to output the cost volume from the warped images in the sphere sweeping method, and the final depth map is estimated by taking the minimum cost indices of the aggregated cost volume by SGM. For training the deep neural network and testing the entire system, realistic synthetic urban datasets are rendered using Blender. The experiments using the synthetic and real-world datasets show that our algorithm outperforms the conventional depth estimation methods and generate highly accurate depth maps.

Details

IROS Conference 2017 Conference Paper

Unified image retrieval and keypoint matching by local geometric consistency and non-linear diffusion

Sehyung Lee
Jongwoo Lim
Il Hong Suh

Feature-based image retrieval and feature matching have been used together in many applications, but they have been treated as two separate problems. We propose an unified approach which, for a query image, finds a set of candidate images together with feature matching results. By considering the local geometric consistency of neighboring features, we can find more and better feature matches even in challenging situations. Since the proposed forward/backward matching and non-linear diffusion run very efficiently, they can be used in the candidate image selection and improve the image retrieval performance significantly. Through quantitative comparisons we show that the proposed approach performs better than the recent state-of-the-art feature matching algorithms and image retrieval algorithms.

Details

IROS Conference 2017 Conference Paper

Visual inertial odometry using coupled nonlinear optimization

Euntae Hong
Jongwoo Lim

Visual inertial odometry (VIO) gained lots of interest recently for efficient and accurate ego-motion estimation of robots and automobiles. With a monocular camera and an inertial measurement unit (IMU) rigidly attached, VIO aims to estimate the 3D pose trajectory of the device in a global metric space. We propose a novel visual inertial odometry algorithm which directly optimizes the camera poses with noisy IMU data and visual feature locations. Instead of running separate filters for IMU and visual data, we put them into a unified non-linear optimization framework in which the perspective reprojection costs of visual features and the motion costs on the acceleration and angular velocity from the IMU and pose trajectory are jointly optimized. The proposed system is tested on the EuRoC dataset for quantitative comparison with the state-of-the-art in visual-inertial odometry and on the mobile phone data as a real-world application. The proposed algorithm is conceptually very clear and simple, achieves good accuracy, and can be easily implemented using publicly available non-linear optimization toolkits.

Details

IROS Conference 2016 Conference Paper

Anytime RRBT for handling uncertainty and dynamic objects

Hyunchul Yang
Jongwoo Lim
Sung-Eui Yoon

We present an efficient anytime motion planner for mobile robots that considers both other dynamic obstacles and uncertainty caused by various sensors and low-level controllers. Our planning algorithm, which is an anytime extension of the Rapidly-exploring Random Belief Tree (RRBT), maintains the best possible path throughout the robot execution, and the generated path gets closer to the optimal one as more computation resources are allocated. We propose a branch-and-bound method to cull out unpromising areas by considering path lengths and uncertainty. We also propose an uncertainty-aware velocity obstacle as a simple local analysis to avoid dynamic obstacles efficiently by finding a collision-free velocity. We have tested our method with three benchmarks that have non-linear measurement regions or potential collisions with dynamic obstacles. By using the proposed methods, we achieve up to five times faster performance given a fixed path cost.

Details

IROS Conference 2016 Conference Paper

Keyframe-based online object learning and detection

Sehyung Lee
Jongwoo Lim
Il Hong Suh

In this paper, we propose a keyframe-based online object learning and detection method. To manage appearance changes of target objects, the proposed method incrementally updates an object database using detection results. One of the major problems in updating the appearance model is that the object model can gradually be degraded by accumulated errors and biased to specific views. To solve this problem, our object model is updated according to the selected keyframes, which not only help memorize important views of target objects, but also prevent the holistic appearance model from overfitting. The database is represented as a graph of the registered images, and the importance of the database images is measured by analyzing the constructed graph. Then, the redundant or less important images are discarded from the database. As a result, the database is efficiently maintained while new views of the objects are gradually added. The experimental results demonstrate that the proposed algorithm efficiently maintains the object database and improves the detection performance compared to previous incremental object learning and detection algorithms.

Details

IROS Conference 2015 Conference Paper

Incremental learning from a single seed image for object detection

Sehyung Lee
Jongwoo Lim
Il Hong Suh

In this paper, we propose a novel online multiobject learning and detection algorithm. From single seed images of the target objects, our algorithm detects these objects in the input sequence, and incrementally updates the databases with the detection results. Reasonably sized databases are maintained as graphs of the registered images, while new views of the objects are added as the detection proceeds. The importance of the registered images is computed using our ranking algorithm, and redundant images are pruned from the database. The proposed method fully utilizes graphical representation to detect and recognize objects. A 3D model of a candidate object is built on-the-fly using the retrieved images, and initially undetected features are hallucinated for further matching and verification. This process improves the detection performance compared to the baseline algorithm. Object/background feature classification and object-likelihood maps effectively keep noisy background features from being added to the databases. The experimental results demonstrate that the proposed algorithm efficiently maintains the object databases and achieves better performance.

Details

ICRA Conference 2014 Conference Paper

Outdoor place recognition in urban environments using straight lines

Jin Han Lee
Sehyung Lee
Guoxuan Zhang
Jongwoo Lim
Wan Kyun Chung
Il Hong Suh

In this paper, we propose a visual place recognition algorithm which uses only straight line features in challenging outdoor environments. Compared to point features used in most existing place recognition methods, line features are easily found in man-made environments and more robust to environmental changes such as illumination, viewing direction, or occlusion because they are more likely to be extracted from structures. Candidate matches are found using a vocabulary tree and their geometric consistency is verified by a motion estimation algorithm using line segments. The proposed algorithm operates in real-time, and it is tested with a challenging real-world dataset with more than 10, 000 database images acquired in urban driving scenarios.

Details

ICRA Conference 2014 Conference Paper

Real-time 6-DOF monocular visual SLAM in a large-scale environment

Hyon Lim
Jongwoo Lim
H. Jin Kim

Real-time approach for monocular visual simultaneous localization and mapping (SLAM) within a large-scale environment is proposed. From a monocular video sequence, the proposed method continuously computes the current 6-DOF camera pose and 3D landmarks position. The proposed method successfully builds consistent maps from challenging outdoor sequences using a monocular camera as the only sensor, while existing approaches have utilized additional structural information such as camera height from the ground. By using a binary descriptor and metric-topological mapping, the system demonstrates real-time performance on a large-scale outdoor environment without utilizing GPUs or reducing input image size. The effectiveness of the proposed method is demonstrated on various challenging video sequences including the KITTI dataset and indoor video captured on a micro aerial vehicle.

Details

ICRA Conference 2013 Conference Paper

Place recognition using straight lines for vision-based SLAM

Jin Han Lee
Guoxuan Zhang
Jongwoo Lim
Il Hong Suh

Most visual simultaneous localization and mapping systems use point features as their landmarks and adopt point-based feature descriptors to recognize them. Compared to point landmarks, however, lines have strength in conveying the structural information of the environment. Despite the benefit, they have not been widely used because lines are more difficult in detecting, tracking, and recognizing, and this delayed the use of lines as landmarks. In this paper, we propose a place recognition algorithm using straight line features, which enables reliable loop closure detections in large complex environments under significant illumination changes. A vocabulary tree trained with mean standard-deviation line descriptor is used in finding the candidate matches between keyframes, and a Bayesian filtering framework enables reliable keyframe matching for large-scale loop closures. The proposed algorithm is compared with state-of-the-art point-based methods using scale-invariant feature transform or speeded up robust features. The experimental results show that the proposed method outperforms the others in challenging indoor environments.

Details

IROS Conference 2011 Conference Paper

Stereo depth map fusion for robot navigation

Christian Häne
Christopher Zach
Jongwoo Lim
Ananth Ranganathan
Marc Pollefeys

We present a method to reconstruct indoor environments from stereo image pairs, suitable for the navigation of robots. To enable a robot to navigate solely using visual cues it receives from a stereo camera, the depth information needs to be extracted from the image pairs and combined into a common representation. The initially determined raw depthmaps are fused into a two level heightmap representation which contains a floor and a ceiling height level. To reduce the noise in the height maps we employ a total variation regularized energy functional. With this 2. 5D representation of the scene the computational complexity of the energy optimization is reduced by one dimension in contrast to other fusion techniques that work on the full 3D space such as volumetric fusion. While we show only results for indoor environments the approach can be extended to generate heightmaps for outdoor environments.

Details

IROS Conference 2011 Conference Paper

Visual place categorization in maps

Ananth Ranganathan
Jongwoo Lim

Categorizing areas such as rooms and corridors using a discrete set of labels has been of long-standing interest to the robotics community. A map with labels such as kitchen, lab, copy room etc provides a basic amount of semantic information that can enable a robot to perform a number of tasks specified in human-centric terms rather than just map coordinates. In this work, we propose a method to label areas in a pre-built map using information from camera images. In contrast to most existing approaches, our method labels the area that is viewed in the camera image rather than just the current robot location. Place labels are generated from the image input using the PLISS system [14]. The label information on the viewed areas is integrated in a Conditional Random Field (CRF) that also considers higher level semantics such as adjacency and place boundaries. We demonstrate our technique on maps built using from laser and visual SLAM systems. We obtain the correct place categorization of a very high percentage of the map areas even when the place categorization system is trained using images only from the internet.

Details

IROS Conference 2010 Conference Paper

Parallel, real-time visual SLAM

Brian Clipp
Jongwoo Lim
Jan-Michael Frahm
Marc Pollefeys

In this paper we present a novel system for real-time, six degree of freedom visual simultaneous localization and mapping using a stereo camera as the only sensor. The system makes extensive use of parallelism both on the graphics processor and through multiple CPU threads. Working together these threads achieve real-time feature tracking, visual odometry, loop detection and global map correction using bundle adjustment. The resulting corrections are fed back into to the visual odometry system to limit its drift over long sequences. We demonstrate our system on a series videos from challenging indoor environments with moving occluders, visually homogenous regions with few features, scene parts with large changes in lighting and fast camera motion. The total system performs its task of global map building in real time including loop detection and bundle adjustment on typical office building scale scenes.

Details

ICRA Conference 2009 Conference Paper

Optimized projection pattern supplementing stereo systems

Jongwoo Lim

Stereo camera systems are widely used in many real applications including indoor and outdoor robotics. They are very easy to use and provide accurate depth estimates on well-textured scenes, but often fail when the scene does not have enough texture. It is possible to help the system work better in this situation by actively projecting certain light patterns to the scene to create artificial texture on the scene surface. The question we try to answer in ths paper is what would be the best pattern(s) to project. This paper introduces optimized projection patterns based on a novel concept of (symmetric) non-recurring De Bruijn sequences, and describes algorithms to generate such sequences. A projected pattern creates an artificial texture which does not contain any duplicate patterns over epipolar lines within certain range, thus it makes the correspondence match simple and unique. The proposed patterns are compatible with most existing stereo algorithms, meaning that they can be used without any changes in the stereo algorithm and one can immediately get much denser depth estimates without any additional computational cost. It is also argued that the proposed patterns are optimal binary patterns, and finally a few experimental result using stereo and space-time stereo algorithms are presented.

Details

IROS Conference 2008 Conference Paper

The memory game: Creating a human-robot interactive scenario for ASIMO

Victor Ng-Thow-Hing
Jongwoo Lim
Joel Wormer
Ravi Kiran Sarvadevabhatla
Carlos Rocha
Kikuo Fujimura
Yoshiaki Sakagami

We present a human-robot interactive scenario consisting of a memory card game between Honda’s humanoid robot ASIMO and a human player. The game features perception exclusively through ASIMO’s on-board cameras and both reactive and proactive behaviors specific to different situational contexts in the memory game. ASIMO is able to build a dynamic environmental map of relevant objects in the game such as the table and card layout as well as understand activities from the player such as pointing at cards, flipping cards and removing them from the table. Our system architecture, called the Cognitive Map, treats the memory game as a multi-agent system, with modules acting independently and communicating with each other via messages through a shared blackboard system. The game behavior module can model game state and contextual information to make decisions based on different pattern recognition modules. Behavior is then sent through high-level command interfaces to be resolved into actual physical actions by the robot via a multi-modal communication module. The experience gained in modeling this interactive scenario will allow us to reuse the architecture to create new scenarios and explore new research directions in learning how to respond to new interactive situations.

Details

NeurIPS Conference 2004 Conference Paper

Adaptive Discriminative Generative Model and Its Applications

Ruei-sung Lin
David Ross
Jongwoo Lim
Ming-Hsuan Yang

This paper presents an adaptive discriminative generative model that gen- eralizes the conventional Fisher Linear Discriminant algorithm and ren- ders a proper probabilistic interpretation. Within the context of object tracking, we aim to find a discriminative generative model that best sep- arates the target from the background. We present a computationally efficient algorithm to constantly update this discriminative model as time progresses. While most tracking algorithms operate on the premise that the object appearance or ambient lighting condition does not significantly change as time progresses, our method adapts a discriminative genera- tive model to reflect appearance variation of the target and background, thereby facilitating the tracking task in ever-changing environments. Nu- merous experiments show that our method is able to learn a discrimina- tive generative model for tracking target objects undergoing large pose and lighting changes.

PDF Details

NeurIPS Conference 2004 Conference Paper

Incremental Learning for Visual Tracking

Jongwoo Lim
David Ross
Ruei-sung Lin
Ming-Hsuan Yang

Most existing tracking algorithms construct a representation of a target object prior to the tracking task starts, and utilize invariant features to handle appearance variation of the target caused by lighting, pose, and view angle change. In this paper, we present an efficient and effec- tive online algorithm that incrementally learns and adapts a low dimen- sional eigenspace representation to reflect appearance changes of the tar- get, thereby facilitating the tracking task. Furthermore, our incremental method correctly updates the sample mean and the eigenbasis, whereas existing incremental subspace update methods ignore the fact the sample mean varies over time. The tracking problem is formulated as a state inference problem within a Markov Chain Monte Carlo framework and a particle filter is incorporated for propagating sample distributions over time. Numerous experiments demonstrate the effectiveness of the pro- posed tracking algorithm in indoor and outdoor environments where the target objects undergo large pose and lighting changes. 1 Introduction The main challenges of visual tracking can be attributed to the difficulty in handling appear- ance variability of a target object. Intrinsic appearance variabilities include pose variation and shape deformation of a target object, whereas extrinsic illumination change, camera motion, camera viewpoint, and occlusions inevitably cause large appearance variation. Due to the nature of the tracking problem, it is imperative for a tracking algorithm to model such appearance variation. Here we developed a method that, during visual tracking, constantly and efficiently up- dates a low dimensional eigenspace representation of the appearance of the target object. The advantages of this adaptive subspace representation are several folds. The eigenspace representation provides a compact notion of the "thing" being tracked rather than treating the target as a set of independent pixels, i. e. , "stuff" [1]. The use of an incremental method continually updates the eigenspace to reflect the appearance change caused by intrinsic and extrinsic factors, thereby facilitating the tracking process. To estimate the locations of the target objects in consecutive frames, we used a sampling algorithm with likelihood estimates, which is in direct contrast to other tracking methods that usually solve complex optimization problems using gradient-descent approach. The proposed method differs from our prior work [14] in several aspects. First, the pro- posed algorithm does not require any training images of the target object before the tracking task starts. That is, our tracker learns a low dimensional eigenspace representation on-line and incrementally updates it as time progresses (We assume, like most tracking algorithms, that the target region has been initialized in the first frame). Second, we extend our sam- pling method to incorporate a particle filter so that the sample distributions are propagated over time. Based on the eigenspace model with updates, an effective likelihood estimation function is developed. Third, we extend the R-SVD algorithm [6] so that both the sample mean and eigenbasis are correctly updated as new data arrive. Though there are numerous subspace update algorithms in the literature, only the method by Hall et al. [8] is also able to update the sample mean. However, their method is based on the addition of a single col- umn (single observation) rather than blocks (a number of observations in our case) and thus is less efficient than ours. While our formulation provides an exact solution, their algorithm gives only approximate updates and thus it may suffer from numerical instability. Finally, the proposed tracker is extended to use a robust error norm for likelihood estimation in the presence of noisy data or partial occlusions, thereby rendering more accurate and robust tracking results. 2 Previous Work and Motivation Black et al. [4] proposed a tracking algorithm using a pre-trained view-based eigenbasis representation and a robust error norm. Instead of relying on the popular brightness con- stancy working principal, they advocated the use of subspace constancy assumption for visual tracking. Although their algorithm demonstrated excellent empirical results, it re- quires to build a set of view-based eigenbases before the tracking task starts. Furthermore, their method assumes that certain factors, such as illumination conditions, do not change significantly as the eigenbasis, once constructed, is not updated. Hager and Belhumeur [7] presented a tracking algorithm to handle the geometry and illu- mination variations of target objects. Their method extends a gradient-based optical flow algorithm to incorporate research findings in [2] for object tracking under varying illumi- nation conditions. Prior to the tracking task starts, a set of illumination basis needs to be constructed at a fixed pose in order to account for appearance variation of the target due to lighting changes. Consequently, it is not clear whether this method is effective if a target object undergoes changes in illumination with arbitrary pose. In [9] Isard and Blake developed the Condensation algorithm for contour tracking in which multiple plausible interpretations are propagated over time. Though their probabilistic ap- proach has demonstrated success in tracking contours in clutter, the representation scheme is rather primitive, i. e. , curves or splines, and is not updated as the appearance of a target varies due to pose or illumination change. Mixture models have been used to describe appearance change for motion estimation [3] [10]. In Black et al. [3] four possible causes are identified in a mixture model for estimating appearance change in consecutive frames, and thereby more reliable image motion can be obtained. A more elaborate mixture model with an online EM algorithm was recently proposed by Jepson et al. [10] in which they use three components and wavelet filters to account for appearance changes during tracking. Their method is able to handle variations in pose, illumination and expression. However, their WSL appearance model treats pixels within the target region independently, and therefore does not have notion of the "thing" being tracked. This may result in modeling background rather than the foreground, and fail to track the target. In contrast to the eigentracking algorithm [4], our algorithm does not require a training phase but learns the eigenbases on-line during the object tracking process, and constantly updates this representation as the appearance changes due to pose, view angle, and illumi- nation variation. Further, our method uses a particle filter for motion parameter estimation rather than the Gauss-Newton method which often gets stuck in local minima or is dis- tracted by outliers [4]. Our appearance-based model provides a richer description than simple curves or splines as used in [9], and has notion of the "thing" being tracked. In addition, the learned representation can be utilized for other tasks such as object recog- nition. In this work, an eigenspace representation is learned directly from pixel values within a target object in the image space. Experiments show that good tracking results can be obtained with this representation without resorting to wavelets as used in [10], and better performance can potentially be achieved using wavelet filters. Note also that the view-based eigenspace representation has demonstrated its ability to model appearance of objects at different pose [13], and under different lighting conditions [2]. 3 Incremental Learning for Tracking We present the details of the proposed incremental learning algorithm for object tracking in this section. 3. 1 Incremental Update of Eigenbasis and Mean The appearance of a target object may change drastically due to intrinsic and extrinsic factors as discussed earlier. Therefore it is important to develop an efficient algorithm to update the eigenspace as the tracking task progresses. Numerous algorithms have been developed to update eigenbasis from a time-varying covariance matrix as more data arrive [6] [8] [11] [5]. However, most methods assume zero mean in updating the eigenbasis except the method by Hall et al. [8] in which they consider the change of the mean when updating eigenbasis as each new datum arrives. Their update algorithm only handles one datum per update and gives approximate results, while our formulation handles multiple data at the same time and renders exact solutions. We extend the work of the classic R-SVD method [6] in which we update the eigenbasis while taking the shift of the sample mean into account. To the best of our knowledge, this formulation with mean update is new in the literature. Given a d n data matrix A = {I1, .. ., In} where each column Ii is an observation (a d- dimensional image vector in this paper), we can compute the singular value decomposition (SVD) of A, i. e. , A = U V. When a dm matrix E of new observations is available, the R-SVD algorithm efficiently computes the SVD of the matrix A = (A|E) = U V based on the SVD of A as follows: 1. Apply QR decomposition to and get orthonormal basis ~ E of E, and U = (U | ~ E). 2. Let V = V 0 0 I where Im is an m m identity matrix. It follows then, m = U A V = U (A|E) V 0 = U AV U E = U E. ~ E 0 Im ~ E AV ~ E E 0 ~ E E 3. Compute the SVD of = ~ U ~ ~ V and the SVD of A is A = U ( ~ U ~ ~ V )V = (U ~ U ) ~ ( ~ V V ). Exploiting the properties of orthonormal bases and block structures, the R-SVD algorithm computes the new eigenbasis efficiently. The computational complexity analysis and more details are described in [6]. One problem with the R-SVD algorithm is that the eigenbasis U is computed from AA with the zero mean assumption. We modify the R-SVD algorithm and compute the eigen- basis with mean update. The following derivation is based on scatter matrix, which is same as covariance matrix except a scalar factor. Proposition 1 Let Ip = {I1, I2, .. ., In}, Iq= {In+1, In+2, .. ., In+m}, and Ir = (Ip|Iq). Denote the means and scatter matrices of Ip, Iq, Ir as Ip, Iq, Ir, and Sp, Sq, Sr respec- tively, then Sr = Sp + Sq + nm (I n+m q - Ip)(Iq - Ip). Proof: By definition, I r = n I I (I n+m p + m n+m q, Ip - Ir = m n+m p - Iq); Iq - Ir = n (I n+m q - Ip) and, Sr = n ( ( i=1 Ii - Ir)(Ii - Ir) + n+m i=n+1 Ii - Ir)(Ii - Ir) = n ( i=1 Ii - Ip + Ip - Ir)(Ii - Ip + Ip - Ir) + n+m ( i=m+1 Ii - Iq + Iq - Ir)(Ii - Iq + Iq - Ir) = Sp + n(Ip - Ir)(Ip - Ir) + Sq + m(Iq - Ir)(Iq - Ir) = Sp + nm2 ( ( ( I I n+m)2 p - Iq)(Ip - Iq) + Sq + n2m (n+m)2 p - Iq)(Ip - Iq) = Sp + Sq + nm (I n+m p - Iq)(Ip - Iq) Let ^ Ip = {I1 - Ip, .. ., In - Ip}, ^ Iq = {In+1 - Iq, .. ., In+m - Iq}, and ^ Ir = {I1 - Ir, .. ., In+m - Ir}, and the SVD of ^Ir = UrrVr. Let ~ E = ^ Iq| nm (I n+m p - Iq), and use Proposition 1, Sr = (^ Ip| ~ E)(^ Ip| ~ E). Therefore, we compute SVD on ( ^ Ip| ~ E) to get Ur. This can be done efficiently by the R-SVD algorithm as described above. In summary, given the mean Ip and the SVD of existing data Ip, i. e. , UppVp and new data Iq, we can compute the the mean Ir and the SVD of Ir, i. e. , UrrVr easily: 1. Compute I r = n I I (I n+m p + m n+m q, and ~ E = Iq - Ir 1(1m) | nm n+m p - Iq). 2. Compute R-SVD with (UppVp ) and ~ E to obtain (UrrVr ). In numerous vision problems, we can further exploit the low dimensional approximation of image data and put larger weights on the recent observations, or equivalently downweight the contributions of previous observations. For example as the appearance of a target object gradually changes, we may want to put more weights on recent observations in updating the eigenbasis since they are more likely to be similar to the current appearance of the target. The forgetting factor f can be used under this premise as suggested in [11], i. e. , A = (f A |E) = (U (f )V |E) where A and A are original and weighted data matrices, respectively. 3. 2 Sequential Inference Model The visual tracking problem is cast as an inference problem with a Markov model and hidden state variable, where a state variable Xt describes the affine motion parameters (and thereby the location) of the target at time t. Given a set of observed images It = {I1, .. ., It}. we aim to estimate the value of the hidden state variable Xt. Using Bayes' theorem, we have p(Xt| It) p(It|Xt) p(Xt|Xt-1) p(Xt-1| It-1) dXt-1 The tracking process is governed by the observation model p(It|Xt) where we estimate the likelihood of Xt observing It, and the dynamical model between two states p(Xt|Xt-1). The Condensation algorithm [9], based on factored sampling, approximates an arbitrary distribution of observations with a stochastically generated set of weighted samples. We use a variant of the Condensation algorithm to model the distribution over the object's location, as it evolves over time. 3. 3 Dynamical and Observation Models The motion of a target object between two consecutive frames can be approximated by an affine image warping. In this work, we use the six parameters of affine transform to model the state transition from Xt-1 to Xt of a target object being tracked. Let Xt = (xt, yt, t, st, t, t) where xt, yt, t, st, t, t, denote x, y translation, rotation angle, scale, aspect ratio, and skew direction at time t. Each parameter in Xt is modeled independently by a Gaussian distribution around its counterpart in Xt-1. That is, p(Xt|Xt-1) = N (Xt; Xt-1, ) where is a diagonal covariance matrix whose elements are the corresponding variances of affine parameters, i. e. , 2x, 2y, 2, 2. s, 2, 2 Since our goal is to use a representation to model the "thing" that we are tracking, we model the image observations using a probabilistic interpretation of principal component analysis [16]. Given an image patch predicated by Xt, we assume the observed image It was generated from a subspace spanned by U centered at. The probability that a sample being generated from the subspace is inversely proportional to the distance d from the sample to the reference point (i. e. , center) of the subspace, which can be decomposed into the distance-to-subspace, dt, and the distance-within-subspace from the projected sample to the subspace center, dw. This distance formulation, based on a orthonormal subspace and its complement space, is similar to [12] in spirit. The probability of a sample generated from a subspace, pd (I t t|Xt), is governed by a Gaus- sian distribution: pd (I t t | Xt) = N (It; , U U + I ) where I is an identity matrix, is the mean, and I term corresponds to the additive Gaus- sian noise in the observation process. It can be shown [15] that the negative exponential distance from It to the subspace spanned by U, i. e. , exp(-||(It - ) - U U (It - )||2), is proportional to N (It; , U U + I) as 0. Within a subspace, the likelihood of the projected sample can be modeled by the Maha- lanobis distance from the mean as follows: pd (I w t | Xt) = N (It; , U -2U ) where is the center of the subspace and is the matrix of singular values corresponding to the columns of U. Put together, the likelihood of a sample being generated from the subspace is governed by p(It|Xt) = pd (I (I t t|Xt) pdw t|Xt) = N (It; , U U + I) N (It; , U-2U ) (1) Given a drawn sample Xt and the corresponding image region It, we aim to compute p(It|Xt) using (1). To minimize the effects of noisy pixels, we utilize a robust error norm [4], (x, ) = x2 instead of the Euclidean norm d(x) = ||x||2, to ignore the "outlier" 2+x2 pixels (i. e. , the pixels that are not likely to appear inside the target region given the current eigenspace). We use a method similar to that used in [4] in order to compute dt and dw. This robust error norm is helpful especially when we use a rectangular region to enclose the target (which inevitably contains some noisy background pixels). 4 Experiments To test the performance of our proposed tracker, we collected a number of videos recorded in indoor and outdoor environments where the targets change pose in different lighting con- ditions. Each video consists of 320 240 gray scale images and are recorded at 15 frames per second unless specified otherwise. For the eigenspace representation, each target image region is resized to 32 32 patch, and the number of eigenvectors used in all experiments is set to 16 though fewer eigenvectors may also work well. Implemented in MATLAB with MEX, our algorithm runs at 4 frames per second on a standard computer with 200 particles. We present some tracking results in this section and more tracking results as well as videos can be found at http: //vision. ucsd. edu/~jwlim/ilt/. 4. 1 Experimental Results Figure 1 shows the tracking results using a challenging sequence recorded with a mov- ing digital camera in which a person moves from a dark room toward a bright area while changing his pose, moving underneath spot lights, changing facial expressions and taking off glasses. All the eigenbases are constructed automatically from scratch and constantly updated to model the appearance of the target object while undergoing appearance changes. Even with the significant camera motion and low frame rate (which makes the motions be- tween frames more significant, or equivalently to tracking fast moving objects), our tracker stays stably on the target throughout the sequence. The second sequence contains an animal doll moving in different pose, scale, and lighting conditions as shown in Figure 2. Experimental results demonstrate that our tracker is able to follow the target as it undergoes large pose change, cluttered background, and lighting variation. Notice that the non-convex target object is localized with an enclosing rectan- gular window, and thus it inevitably contains some background pixels in its appearance representation. The robust error norm enables the tracker to ignore background pixels and estimate the target location correctly. The results also show that our algorithm faithfully Figure 1: A person moves from dark toward bright area with large lighting and pose changes. The images in the second row shows the current sample mean, tracked region, reconstructed image, and the reconstruction error respectively. The third and forth rows shows 10 largest eigenbases. Figure 2: An animal doll moving with large pose, lighting variation in a cluttered background. models the appearance of the target, as shown in eigenbases and reconstructed images, in the presence of noisy background pixels. We recorded a sequence to demonstrate that our tracker performs well in outdoor environ- ment where lighting conditions change drastically. The video was acquired when a person walking underneath a trellis covered by vines. As shown in Figure 3, the cast shadow changes the appearance of the target face drastically. Furthermore, the combined pose and lighting variation with low frame rate makes the tracking task extremely difficult. Nev- ertheless, the results show that our tracker successfully follows the target accurately and robustly. Due to heavy shadows and drastic lighting change, other tracking methods based on gradient, contour, or color information are unlikely to perform well in this case.

PDF Details