Author name cluster

Liefeng Bo

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

34 papers

2 author rows

ICML Conference 2025 Conference Paper

ExtPose: Robust and Coherent Pose Estimation by Extending ViTs

Rongyu Chen
Li'an Zhuo
Linlin Yang
Qi Wang 0148
Liefeng Bo
Bang Zhang
Angela Yao

Vision Transformers (ViT) are remarkable at 3D pose estimation, yet they still encounter certain challenges. One issue is that the popular ViT architecture for pose estimation is limited to images and lacks temporal information. Another challenge is that the prediction often fails to maintain pixel alignment with the original images. To address these issues, we propose a systematic framework for 3D pose estimation, called ExtPose. ExtPose extends image ViT to the challenging scenario and video setting by taking in additional 2D pose evidence and capturing temporal information in a full attention-based manner. We use 2D human skeleton images to integrate structured 2D pose information. By sharing parameters and attending across modalities and frames, we enhance the consistency between 3D poses and 2D videos without introducing additional parameters. We achieve state-of-the-art (SOTA) performance on multiple human and hand pose estimation benchmarks with substantial improvements to 34. 0mm (-23%) on 3DPW and 4. 9mm (-18%) on FreiHAND in PA-MPJPE over the other ViT-based methods respectively.

AAAI Conference 2024 Conference Paper

Evaluate Geometry of Radiance Fields with Low-Frequency Color Prior

Qihang Fang
Yafei Song
Keqiang Li
Li Shen
Huaiyu Wu
Gang Xiong
Liefeng Bo

A radiance field is an effective representation of 3D scenes, which has been widely adopted in novel-view synthesis and 3D reconstruction. It is still an open and challenging problem to evaluate the geometry, i.e., the density field, as the ground-truth is almost impossible to obtain. One alternative indirect solution is to transform the density field into a point-cloud and compute its Chamfer Distance with the scanned ground-truth. However, many widely-used datasets have no point-cloud ground-truth since the scanning process along with the equipment is expensive and complicated. To this end, we propose a novel metric, named Inverse Mean Residual Color (IMRC), which can evaluate the geometry only with the observation images. Our key insight is that the better the geometry, the lower-frequency the computed color field. From this insight, given a reconstructed density field and observation images, we design a closed-form method to approximate the color field with low-frequency spherical harmonics, and compute the inverse mean residual color. Then the higher the IMRC, the better the geometry. Qualitative and quantitative experimental results verify the effectiveness of our proposed IMRC metric. We also benchmark several state-of-the-art methods using IMRC to promote future related research. Our code is available at https://github.com/qihangGH/IMRC.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

GIC: Gaussian-Informed Continuum for Physical Property Identification and Simulation

Junhao Cai
Yuji Yang
Weihao Yuan
Yisheng He
Zilong Dong
Liefeng Bo
Hui Cheng
Qifeng Chen

This paper studies the problem of estimating physical properties (system identification) through visual observations. To facilitate geometry-aware guidance in physical property estimation, we introduce a novel hybrid framework that leverages 3D Gaussian representation to not only capture explicit shapes but also enable the simulated continuum to render object masks as 2D shape surrogates during training. We propose a new dynamic 3D Gaussian framework based on motion factorization to recover the object as 3D Gaussian point sets across different time states. Furthermore, we develop a coarse-to-fine filling strategy to generate the density fields of the object from the Gaussian reconstruction, allowing for the extraction of object continuums along with their surfaces and the integration of Gaussian attributes into these continuum. In addition to the extracted object surfaces, the Gaussian-informed continuum also enables the rendering of object masks during simulations, serving as 2D-shape guidance for physical property estimation. Extensive experimental evaluations demonstrate that our pipeline achieves state-of-the-art performance across multiple benchmarks and metrics. Additionally, we illustrate the effectiveness of the proposed method through real-world demonstrations, showcasing its practical utility. Our project page is at https: //jukgei. github. io/project/gic.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

MoGenTS: Motion Generation based on Spatial-Temporal Joint Modeling

Weihao Yuan
Yisheng He
Weichao Shen
Yuan Dong
Xiaodong Gu
Zilong Dong
Liefeng Bo
Qixing Huang

Motion generation from discrete quantization offers many advantages over continuous regression, but at the cost of inevitable approximation errors. Previous methods usually quantize the entire body pose into one code, which not only faces the difficulty in encoding all joints within one vector but also loses the spatial relationship between different joints. Differently, in this work we quantize each individual joint into one vector, which i) simplifies the quantization process as the complexity associated with a single joint is markedly lower than that of the entire pose; ii) maintains a spatial-temporal structure that preserves both the spatial relationships among joints and the temporal movement patterns; iii) yields a 2D token map, which enables the application of various 2D operations widely used in 2D images. Grounded in the 2D motion quantization, we build a spatial-temporal modeling framework, where 2D joint VQVAE, temporal-spatial 2D masking technique, and spatial-temporal 2D attention are proposed to take advantage of spatial-temporal signals among the 2D tokens. Extensive experiments demonstrate that our method significantly outperforms previous methods across different datasets, with a $26. 6\%$ decrease of FID on HumanML3D and a $29. 9\%$ decrease on KIT-ML.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

Reducing Shape-Radiance Ambiguity in Radiance Fields with a Closed-Form Color Estimation Method

Qihang Fang
Yafei Song
Keqiang Li
Liefeng Bo

A neural radiance field (NeRF) enables the synthesis of cutting-edge realistic novel view images of a 3D scene. It includes density and color fields to model the shape and radiance of a scene, respectively. Supervised by the photometric loss in an end-to-end training manner, NeRF inherently suffers from the shape-radiance ambiguity problem, i. e. , it can perfectly fit training views but does not guarantee decoupling the two fields correctly. To deal with this issue, existing works have incorporated prior knowledge to provide an independent supervision signal for the density field, including total variation loss, sparsity loss, distortion loss, etc. These losses are based on general assumptions about the density field, e. g. , it should be smooth, sparse, or compact, which are not adaptive to a specific scene. In this paper, we propose a more adaptive method to reduce the shape-radiance ambiguity. The key is a rendering method that is only based on the density field. Specifically, we first estimate the color field based on the density field and posed images in a closed form. Then NeRF's rendering process can proceed. We address the problems in estimating the color field, including occlusion and non-uniformly distributed views. Afterwards, it is applied to regularize NeRF's density field. As our regularization is guided by photometric loss, it is more adaptive compared to existing ones. Experimental results show that our method improves the density field of NeRF both qualitatively and quantitatively. Our code is available at https: //github. com/qihangGH/Closed-form-color-field.

AAAI Conference 2021 Conference Paper

Graph-Enhanced Multi-Task Learning of Multi-Level Transition Dynamics for Session-based Recommendation

Chao Huang
Jiahui Chen
Lianghao Xia
Yong Xu
Peng Dai
Yanqing Chen
Liefeng Bo
Jiashu Zhao

Session-based recommendation plays a central role in a wide spectrum of online applications, ranging from e-commerce to online advertising services. However, the majority of existing session-based recommendation techniques (e. g. , attentionbased recurrent network or graph neural network) are not well-designed for capturing the complex transition dynamics exhibited with temporally-ordered and multi-level interdependent relation structures. These methods largely overlook the relation hierarchy of item transitional patterns. In this paper, we propose a multi-task learning framework with Multi-level Transition Dynamics (MTD), which enables the jointly learning of intra- and inter-session item transition dynamics in automatic and hierarchical manner. Towards this end, we ﬁrst develop a position-aware attention mechanism to learn item transitional regularities within individual session. Then, a graph-structured hierarchical relation encoder is proposed to explicitly capture the cross-session item transitions in the form of high-order connectivities by performing embedding propagation with the global graph context. The learning process of intra- and inter-session transition dynamics are integrated, to preserve the underlying low- and highlevel item relationships in a common latent space. Extensive experiments on three real-world datasets demonstrate the superiority of MTD as compared to state-of-the-art baselines.

AAAI Conference 2021 Conference Paper

Knowledge-aware Coupled Graph Neural Network for Social Recommendation

Chao Huang
Huance Xu
Yong Xu
Peng Dai
Lianghao Xia
Mengyin Lu
Liefeng Bo
Hao Xing

Social recommendation task aims to predict users’ preferences over items with the incorporation of social connections among users, so as to alleviate the sparse issue of collaborative ﬁltering. While many recent efforts show the effectiveness of neural network-based social recommender systems, several important challenges have not been well addressed yet: (i) The majority of models only consider users’ social connections, while ignoring the inter-dependent knowledge across items; (ii) Most of existing solutions are designed for singular type of user-item interactions, making them infeasible to capture the interaction heterogeneity; (iii) The dynamic nature of user-item interactions has been less explored in many social-aware recommendation techniques. To tackle the above challenges, this work proposes a Knowledge-aware Coupled Graph Neural Network (KCGN) that jointly injects the inter-dependent knowledge across items and users into the recommendation framework. KCGN enables the highorder user- and item-wise relation encoding by exploiting the mutual information for global graph structure awareness. Additionally, we further augment KCGN with the capability of capturing dynamic multi-typed user-item interactive patterns. Experimental studies on real-world datasets show the effectiveness of our method against many strong baselines in a variety of settings. Source codes are available at: https: //github. com/xhcdream/KCGN.

AAAI Conference 2021 Conference Paper

Knowledge-Enhanced Hierarchical Graph Transformer Network for Multi-Behavior Recommendation

Lianghao Xia
Chao Huang
Yong Xu
Peng Dai
Xiyue Zhang
Hongsheng Yang
Jian Pei
Liefeng Bo

Accurate user and item embedding learning is crucial for modern recommender systems. However, most existing recommendation techniques have thus far focused on modeling users’ preferences over singular type of user-item interactions. Many practical recommendation scenarios involve multi-typed user interactive behaviors (e. g. , page view, addto-favorite and purchase), which presents unique challenges that cannot be handled by current recommendation solutions. In particular: i) complex inter-dependencies across different types of user behaviors; ii) the incorporation of knowledgeaware item relations into the multi-behavior recommendation framework; iii) dynamic characteristics of multityped user-item interactions. To tackle these challenges, this work proposes a Knowledge-Enhanced Hierarchical Graph Transformer Network (KHGT), to investigate multi-typed interactive patterns between users and items in recommender systems. Speciﬁcally, KHGT is built upon a graph-structured neural architecture to i) capture type-speciﬁc behavior characteristics; ii) explicitly discriminate which types of user-item interactions are more important in assisting the forecasting task on the target behavior. Additionally, we further integrate the graph attention layer with the temporal encoding strategy, to empower the learned embeddings be reﬂective of both dedicated multiplex user-item and item-item relations, as well as the underlying interaction dynamics. Extensive experiments conducted on three real-world datasets show that KHGT consistently outperforms many state-of-the-art recommendation methods across various evaluation settings. Our implementation code is available in https: //github. com/akaxlh/KHGT.

IJCAI Conference 2021 Conference Paper

Spatial-Temporal Sequential Hypergraph Network for Crime Prediction with Dynamic Multiplex Relation Learning

Lianghao Xia
Chao Huang
Yong Xu
Peng Dai
Liefeng Bo
Xiyue Zhang
Tianyi Chen

Crime prediction is crucial for public safety and resource optimization, yet is very challenging due to two aspects: i) the dynamics of criminal patterns across time and space, crime events are distributed unevenly on both spatial and temporal domains; ii) time-evolving dependencies between different types of crimes (e. g. , Theft, Robbery, Assault, Damage) which reveal fine-grained semantics of crimes. To tackle these challenges, we propose Spatial-Temporal Sequential Hypergraph Network (ST-SHN) to collectively encode complex crime spatial-temporal patterns as well as the underlying category-wise crime semantic relationships. In specific, to handle spatial-temporal dynamics under the long-range and global context, we design a graph-structured message passing architecture with the integration of the hypergraph learning paradigm. To capture category-wise crime heterogeneous relations in a dynamic environment, we introduce a multi-channel routing mechanism to learn the time-evolving structural dependency across crime types. We conduct extensive experiments on two real-word datasets, showing that our proposed ST-SHN framework can significantly improve the prediction performance as compared to various state-of-the-art baselines. The source code is available at https: //github. com/akaxlh/ST-SHN.

PDF Details DOI

AAAI Conference 2021 Conference Paper

Traffic Flow Forecasting with Spatial-Temporal Graph Diffusion Network

Xiyue Zhang
Chao Huang
Yong Xu
Lianghao Xia
Peng Dai
Liefeng Bo
Junbo Zhang
Yu Zheng

Accurate forecasting of citywide traffic flow has been playing critical role in a variety of spatial-temporal mining applications, such as intelligent traffic control and public risk assessment. While previous work has made significant efforts to learn traffic temporal dynamics and spatial dependencies, two key limitations exist in current models. First, only the neighboring spatial correlations among adjacent regions are considered in most existing methods, and the global interregion dependency is ignored. Additionally, these methods fail to encode the complex traffic transition regularities exhibited with time-dependent and multi-resolution in nature. To tackle these challenges, we develop a new traffic prediction framework–Spatial-Temporal Graph Diffusion Network (ST-GDN). In particular, ST-GDN is a hierarchically structured graph neural architecture which learns not only the local region-wise geographical dependencies, but also the spatial semantics from a global perspective. Furthermore, a multiscale attention network is developed to empower ST-GDN with the capability of capturing multi-level temporal dynamics. Experiments on several real-life traffic datasets demonstrate that ST-GDN outperforms different types of state-ofthe-art baselines. Source codes of implementations are available at https: //github. com/jill001/ST-GDN.

IJCAI Conference 2020 Conference Paper

Cross-Interaction Hierarchical Attention Networks for Urban Anomaly Prediction

Chao Huang
Chuxu Zhang
Peng Dai
Liefeng Bo

Predicting anomalies (e. g. , blocked driveway and vehicle collisions) in urban space plays an important role in assisting governments and communities for building smart city applications, ranging from intelligent transportation to public safety. However, predicting urban anomalies is not trivial due to the following two factors: i) The sequential transition regularities of anomaly occurrences is complex, which exhibit with high-order and dynamic correlations. ii) The Interactions between region, time and anomaly category is multi-dimensional in real-world urban anomaly forecasting scenario. How to fuse multiple relations from spatial, temporal and categorical dimensions in the predictive framework remains a significant challenge. To address these two challenges, we propose a Cross-Interaction Hierarchical Attention network model (CHAT) which uncovers the dynamic occurrence patterns of time-stamped urban anomaly data. Our CHAT framework could automatically capture the relevance of past anomaly occurrences across different time steps, and discriminates which types of cross-modal interactions are more important for making future predictions. Experiment results demonstrate the superiority of CHAT framework over state-of-the-art baselines.

PDF Details DOI

ICRA Conference 2020 Conference Paper

Efficient Pig Counting in Crowds with Keypoints Tracking and Spatial-aware Temporal Response Filtering

Guang Chen
Shiwen Shen
Longyin Wen
Si Luo
Liefeng Bo

Pig counting is a crucial task for large-scale pig farming. Pigs are usually visually counted by human. But this process is very time-consuming and error-prone. Few studies in literature developed automated pig counting method. The existing works only focused on pig counting using single image, and its level of accuracy faced challenges due to pig movements, occlusion and overlapping. Especially, the field of view of a single image is very limited, and could not meet the needs of pig counting for large pig grouping houses. Towards addressing these challenges, we presented a real-time automated pig counting system in crowds using only one monocular fisheye camera with an inspection robot. Our system showed that it achieved performance superior to human. Our pipeline began with a novel bottom-up pig detection algorithm to avoid false negatives due to overlapping, occlusion and deformable pig shapes. This detection included a deep convolution neural network (CNN) for pig body part keypoints detection and the keypoints association method to identify individual pigs. It then employed an efficient on-line tracking method to associate pigs across image frames. Finally, pig counts were estimated by a novel spatial-aware temporal response filtering (STRF) method to suppress false positives caused by pig or camera movements or tracking failures. The whole pipeline has been deployed in an edge computing device, and demonstrated the effectiveness.

ECAI Conference 2020 Conference Paper

Joint Modeling of Local and Global Behavior Dynamics for Session-Based Recommendation

Yong Xu 0007
Jiahui Chen
Chao Huang 0001
Bo Zhang
Hao Xing
Peng Dai 0001
Liefeng Bo

Session-based recommendation is critical in modern recommender systems, which aims to predict the next interested item given anonymous behavior sequences of users. While prior works have made efforts to addressing the session-based recommendation problem, two significant limitations exist: i) They ignore the fact that items may be correlated with other across different session units; ii) existing solutions are also limited in their assumption of rigidly ordered pattern over intra-session item transition, which may not be true in practice. To address these above limitations, we propose a Local-Global Session-based Recommendation framework–LGSR which generalizes the modeling of behavior dynamics from two perspectives: we first design a cross-session item dependency encoder to learn the inter-session item relation structures from a global perspective. Additionally, a dual-stage attentive aggregation module is developed to capture local item transition dynamics, without the restriction of rigid sequential process for jointly modeling user’s current interest and intra-session purpose. With the exploration of both complex intra- and inter-session interest transitional regularities, our LGSR model enables the representation learning of user behavior dynamics via jointly mapping local and global signals into the same latent space. The experimental results on two real-world datasets demonstrate the superiority of the proposed LGSR framework over state-of-the-art methods.

ICRA Conference 2014 Conference Paper

Hierarchical sparse coded surface models

Michael Ruhnke
Liefeng Bo
Dieter Fox
Wolfram Burgard

In this paper, we describe a novel approach to construct textured 3D environment models in a hierarchical fashion based on local surface patches. Compared to previous approaches, the hierarchy enables our method to represent the environment with differently sized surface patches. The reconstruction scheme starts at a coarse resolution with large patches and in an iterative fashion uses the reconstruction error to guide the decision as to whether the resolution should be refined. This leads to variable resolution models that represent areas with few variations at low resolution and areas with large variations at high resolution. In addition, we compactly describe local surface attributes via sparse coding based on an overcomplete dictionary. In this way, we additionally exploit similarities in structure and texture, which leads to compact models. We learn the dictionary directly from the input data and independently for every level in the hierarchy in an unsupervised fashion. Practical experiments with large-scale datasets demonstrate that our method compares favorably with two state-of-the-art techniques while being comparable in accuracy.

AAAI Conference 2014 Conference Paper

Learning from Unscripted Deictic Gesture and Language for Human-Robot Interactions

Cynthia Matuszek
Liefeng Bo
Luke Zettlemoyer
Dieter Fox

As robots become more ubiquitous, it is increasingly important for untrained users to be able to interact with them intuitively. In this work, we investigate how people refer to objects in the world during relatively unstructured communication with robots. We collect a corpus of deictic interactions from users describing objects, which we use to train language and gesture models that allow our robot to determine what objects are being indicated. We introduce a temporal extension to stateof-the-art hierarchical matching pursuit features to support gesture understanding, and demonstrate that combining multiple communication modalities more effectively capture user intent than relying on a single type of input. Finally, we present initial interactions with a robot that uses the learned models to follow commands.

ICRA Conference 2014 Conference Paper

Learning to identify new objects

Yuyin Sun
Liefeng Bo
Dieter Fox

Identifying objects based on language descriptions is an important capability for robots interacting with people in everyday environments. People naturally use attributes and names to refer to objects of interest. Due to the complexity of indoor environments and the fact that people use various ways to refer to objects, a robot frequently encounters new objects or object names. To deal with such situations, a robot must be able to continuously grow its object knowledge base. In this work we introduce a system that organizes objects and names in a semantic hierarchy. Similarity between name words is learned via a hierarchy embedded vector representation. The hierarchy enables reasoning about unknown objects and names. Novel objects are inserted automatically into the knowledge base, where the exact location in the hierarchy is determined by asking a user questions. The questions are informed by the current hierarchy and the appearance of the object. Experiments demonstrate that the learned representation captures the meaning of names and is helpful for object identification with new names.

ICRA Conference 2014 Conference Paper

ST-HMP: Unsupervised Spatio-Temporal feature learning for tactile data

Marianna Madry
Liefeng Bo
Danica Kragic
Dieter Fox

Tactile sensing plays an important role in robot grasping and object recognition. In this work, we propose a new descriptor named Spatio-Temporal Hierarchical Matching Pursuit (ST-HMP) that captures properties of a time series of tactile sensor measurements. It is based on the concept of unsupervised hierarchical feature learning realized using sparse coding. The ST-HMP extracts rich spatio-temporal structures from raw tactile data without the need to predefine discriminative data characteristics. We apply it to two different applications: (1) grasp stability assessment and (2) object instance recognition, presenting its universal properties. An extensive evaluation on several synthetic and real datasets collected using the Schunk Dexterous, Schunk Parallel and iCub hands shows that our approach outperforms previously published results by a large margin.

ICRA Conference 2014 Conference Paper

Unsupervised feature learning for 3D scene labeling

Kevin Lai 0001
Liefeng Bo
Dieter Fox

This paper presents an approach for labeling objects in 3D scenes. We introduce HMP3D, a hierarchical sparse coding technique for learning features from 3D point cloud data. HMP3D classifiers are trained using a synthetic dataset of virtual scenes generated using CAD models from an online database. Our scene labeling system combines features learned from raw RGB-D images and 3D point clouds directly, without any hand-designed features, to assign an object label to every 3D point in the scene. Experiments on the RGB-D Scenes Dataset v. 2 demonstrate that the proposed approach can be used to label indoor scenes containing both small tabletop objects and large furniture pieces.

ICRA Conference 2013 Conference Paper

Attribute based object identification

Yuyin Sun
Liefeng Bo
Dieter Fox

Over the last years, the robotics community has made substantial progress in detection and 3D pose estimation of known and unknown objects. However, the question of how to identify objects based on language descriptions has not been investigated in detail. While the computer vision community recently started to investigate the use of attributes for object recognition, these approaches do not consider the task settings typically observed in robotics, where a combination of appearance attributes and object names might be used in referral language to identify specific objects in a scene. In this paper, we introduce an approach for identifying objects based on natural language containing appearance and name attributes. To learn rich RGB-D features needed for attribute classification, we extend recently introduced sparse coding techniques so as to automatically learn attribute-dependent features. We introduce a large data set of attribute descriptions of objects in the RGB-D object dataset. Experiments on this data set demonstrate the strong performance of our approach to language based object identification. We also show that our attribute-dependent features provide significantly better generalization to previously unseen attribute values, thereby enabling more rapid learning of new attribute values.

AAAI Conference 2013 Conference Paper

Compact RGBD Surface Models Based on Sparse Coding

Michael Ruhnke
Liefeng Bo
Dieter Fox
Wolfram Burgard

In this paper, we describe a novel approach to construct compact colored 3D environment models representing local surface attributes via sparse coding. Our method decomposes a set of colored point clouds into local surface patches and encodes them based on an overcomplete dictionary. Instead of storing the entire point cloud, we store a dictionary, surface patch positions, and a sparse code description of the depth and RGB attributes for every patch. The dictionary is learned in an unsupervised way from surface patches sampled from indoor maps. We show that better dictionaries can be learned by extending the K-SVD method with a binary weighting scheme that ignores undeﬁned surface cells. Through experimental evaluation on real world laser and RGBD datasets we demonstrate that our method produces compact and accurate models. Furthermore, we clearly outperform an existing state of the art method in terms of compactness, accuracy, and computation time. Additionally, we demonstrate that our sparse code descriptions can be utilized for other important tasks such as object detection.

ICML Conference 2012 Conference Paper

A Joint Model of Language and Perception for Grounded Attribute Learning

Cynthia Matuszek
Nicholas FitzGerald
Luke Zettlemoyer
Liefeng Bo
Dieter Fox

ICRA Conference 2012 Conference Paper

Detection-based object labeling in 3D scenes

Kevin Lai 0001
Liefeng Bo
Xiaofeng Ren
Dieter Fox

We propose a view-based approach for labeling objects in 3D scenes reconstructed from RGB-D (color+depth) videos. We utilize sliding window detectors trained from object views to assign class probabilities to pixels in every RGB-D frame. These probabilities are projected into the reconstructed 3D scene and integrated using a voxel representation. We perform efficient inference on a Markov Random Field over the voxels, combining cues from view-based detection and 3D shape, to label the scene. Our detection-based approach produces accurate scene labeling on the RGB-D Scenes Dataset and improves the robustness of object detection.

NeurIPS Conference 2012 Conference Paper

Discriminatively Trained Sparse Code Gradients for Contour Detection

Ren Xiaofeng
Liefeng Bo

Finding contours in natural images is a fundamental problem that serves as the basis of many tasks such as image segmentation and object recognition. At the core of contour detection technologies are a set of hand-designed gradient features, used by most existing approaches including the state-of-the-art Global Pb (gPb) operator. In this work, we show that contour detection accuracy can be significantly improved by computing Sparse Code Gradients (SCG), which measure contrast using patch representations automatically learned through sparse coding. We use K-SVD and Orthogonal Matching Pursuit for efficient dictionary learning and encoding, and use multi-scale pooling and power transforms to code oriented local neighborhoods before computing gradients and applying linear SVM. By extracting rich representations from pixels and avoiding collapsing them prematurely, Sparse Code Gradients effectively learn how to measure local contrasts and find contours. We improve the F-measure metric on the BSDS500 benchmark to 0. 74 (up from 0. 71 of gPb contours). Moreover, our learning approach can easily adapt to novel sensor data such as Kinect-style RGB-D cameras: Sparse Code Gradients on depth images and surface normals lead to promising contour detection using depth and depth+color, as verified on the NYU Depth Dataset. Our work combines the concept of oriented gradients with sparse representation and opens up future possibilities for learning contour detection and segmentation.

NeurIPS Conference 2012 Conference Paper

Unsupervised Template Learning for Fine-Grained Object Recognition

Shulin Yang
Liefeng Bo
Jue Wang
Linda Shapiro

Fine-grained recognition refers to a subordinate level of recognition, such are recognizing different species of birds, animals or plants. It differs from recognition of basic categories, such as humans, tables, and computers, in that there are global similarities in shape or structure shared within a category, and the differences are in the details of the object parts. We suggest that the key to identifying the fine-grained differences lies in finding the right alignment of image regions that contain the same object parts. We propose a template model for the purpose, which captures common shape patterns of object parts, as well as the co-occurence relation of the shape patterns. Once the image regions are aligned, extracted features are used for classification. Learning of the template model is efficient, and the recognition results we achieve significantly outperform the state-of-the-art algorithms.

ICRA Conference 2011 Conference Paper

A large-scale hierarchical multi-view RGB-D object dataset

Kevin Lai 0001
Liefeng Bo
Xiaofeng Ren
Dieter Fox

Over the last decade, the availability of public image repositories and recognition benchmarks has enabled rapid progress in visual object category and instance detection. Today we are witnessing the birth of a new generation of sensing technologies capable of providing high quality synchronized videos of both color and depth, the RGB-D (Kinect-style) camera. With its advanced sensing capabilities and the potential for mass adoption, this technology represents an opportunity to dramatically increase robotic object recognition, manipulation, navigation, and interaction capabilities. In this paper, we introduce a large-scale, hierarchical multi-view object dataset collected using an RGB-D camera. The dataset contains 300 objects organized into 51 categories and has been made publicly available to the research community so as to enable rapid progress based on this promising technology. This paper describes the dataset collection procedure and introduces techniques for RGB-D based object recognition and detection, demonstrating that combining color and depth information substantially improves quality of results.

AAAI Conference 2011 Conference Paper

A Scalable Tree-Based Approach for Joint Object and Pose Recognition

Kevin Lai
Liefeng Bo
Xiaofeng Ren
Dieter Fox

Recognizing possibly thousands of objects is a crucial capability for an autonomous agent to understand and interact with everyday environments. Practical object recognition comes in multiple forms: Is this a coffee mug? (category recognition). Is this Alice’s coffee mug? (instance recognition). Is the mug with the handle facing left or right? (pose recognition). We present a scalable framework, Object-Pose Tree, which efﬁciently organizes data into a semantically structured tree. The tree structure enables both scalable training and testing, allowing us to solve recognition over thousands of object poses in near real-time. Moreover, by simultaneously optimizing all three tasks, our approach outperforms standard nearest neighbor and 1-vs-all classiﬁcations, with large improvements on pose recognition. We evaluate the proposed technique on a dataset of 300 household objects collected using a Kinect-style 3D camera. Experiments demonstrate that our system achieves robust and efﬁcient object category, instance, and pose recognition on challenging everyday objects.

IROS Conference 2011 Conference Paper

Depth kernel descriptors for object recognition

Liefeng Bo
Xiaofeng Ren
Dieter Fox

Consumer depth cameras, such as the Microsoft Kinect, are capable of providing frames of dense depth values at real time. One fundamental question in utilizing depth cameras is how to best extract features from depth frames. Motivated by local descriptors on images, in particular kernel descriptors, we develop a set of kernel features on depth images that model size, 3D shape, and depth edges in a single framework. Through extensive experiments on object recognition, we show that (1) our local features capture different aspects of cues from a depth frame/view that complement one another; (2) our kernel features significantly outperform traditional 3D features (e. g. Spin images); and (3) we significantly improve the capabilities of depth and RGB-D (color+depth) recognition, achieving 10–15% improvement in accuracy over the state of the art.

ICRA Conference 2011 Conference Paper

Gambit: An autonomous chess-playing robotic system

Cynthia Matuszek
Brian Mayton
Roberto Aimi
Marc Peter Deisenroth
Liefeng Bo
Robert Chu
Mike Kung
Louis LeGrand

This paper presents Gambit, a custom, mid-cost 6-DoF robot manipulator system that can play physical board games against human opponents in non-idealized environments. Historically, unconstrained robotic manipulation in board games has often proven to be more challenging than the underlying game reasoning, making it an ideal testbed for small-scale manipulation. The Gambit system includes a low-cost Kinect-style visual sensor, a custom manipulator, and state-of-the-art learning algorithms for automatic detection and recognition of the board and objects on it. As a use-case, we describe playing chess quickly and accurately with arbitrary, uninstrumented boards and pieces, demonstrating that Gambit's engineering and design represent a new state-of-the-art in fast, robust tabletop manipulation.

NeurIPS Conference 2011 Conference Paper

Hierarchical Matching Pursuit for Image Classification: Architecture and Fast Algorithms

Liefeng Bo
Xiaofeng Ren
Dieter Fox

Extracting good representations from images is essential for many computer vision tasks. In this paper, we propose hierarchical matching pursuit (HMP), which builds a feature hierarchy layer-by-layer using an efficient matching pursuit encoder. It includes three modules: batch (tree) orthogonal matching pursuit, spatial pyramid max pooling, and contrast normalization. We investigate the architecture of HMP, and show that all three components are critical for good performance. To speed up the orthogonal matching pursuit, we propose a batch tree orthogonal matching pursuit that is particularly suitable to encode a large number of observations that share the same large dictionary. HMP is scalable and can efficiently handle full-size images. In addition, HMP enables linear support vector machines (SVM) to match the performance of nonlinear SVM while being scalable to large datasets. We compare HMP with many state-of-the-art algorithms including convolutional deep belief networks, SIFT based single layer sparse coding, and kernel based feature learning. HMP consistently yields superior accuracy on three types of image classification problems: object recognition (Caltech-101), scene recognition (MIT-Scene), and static event recognition (UIUC-Sports).

ICRA Conference 2011 Conference Paper

Sparse distance learning for object recognition combining RGB and depth information

Kevin Lai 0001
Liefeng Bo
Xiaofeng Ren
Dieter Fox

In this work we address joint object category and instance recognition in the context of RGB-D (depth) cameras. Motivated by local distance learning, where a novel view of an object is compared to individual views of previously seen objects, we define a view-to-object distance where a novel view is compared simultaneously to all views of a previous object. This novel distance is based on a weighted combination of feature differences between views. We show, through jointly learning per-view weights, that this measure leads to superior classification performance on object category and instance recognition. More importantly, the proposed distance allows us to find a sparse solution via Group-Lasso regularization, where a small subset of representative views of an object is identified and used, with the rest discarded. This significantly reduces computational cost without compromising recognition accuracy. We evaluate the proposed technique, Instance Distance Learning (IDL), on the RGB-D Object Dataset, which consists of 300 object instances in 51 everyday categories and about 250, 000 views of objects with both RGB color and depth. We empirically compare IDL to several alternative state-of-the-art approaches and also validate the use of visual and shape cues and their combination.

NeurIPS Conference 2010 Conference Paper

Kernel Descriptors for Visual Recognition

Liefeng Bo
Xiaofeng Ren
Dieter Fox

The design of low-level image features is critical for computer vision algorithms. Orientation histograms, such as those in SIFT~\cite{Lowe2004Distinctive} and HOG~\cite{Dalal2005Histograms}, are the most successful and popular features for visual object and scene recognition. We highlight the kernel view of orientation histograms, and show that they are equivalent to a certain type of match kernels over image patches. This novel view allows us to design a family of kernel descriptors which provide a unified and principled framework to turn pixel attributes (gradient, color, local binary pattern, \etc) into compact patch-level features. In particular, we introduce three types of match kernels to measure similarities between image patches, and construct compact low-dimensional kernel descriptors from these match kernels using kernel principal component analysis (KPCA)~\cite{Scholkopf1998Nonlinear}. Kernel descriptors are easy to design and can turn any type of pixel attribute into patch-level features. They outperform carefully tuned and sophisticated features including SIFT and deep belief networks. We report superior performance on standard image classification benchmarks: Scene-15, Caltech-101, CIFAR10 and CIFAR10-ImageNet.

NeurIPS Conference 2009 Conference Paper

Conditional Neural Fields

Jian Peng
Liefeng Bo
Jinbo Xu

Conditional random fields (CRF) are quite successful on sequence labeling tasks such as natural language processing and biological sequence analysis. CRF models use linear potential functions to represent the relationship between input features and outputs. However, in many real-world applications such as protein structure prediction and handwriting recognition, the relationship between input features and outputs is highly complex and nonlinear, which cannot be accurately modeled by a linear function. To model the nonlinear relationship between input features and outputs we propose Conditional Neural Fields (CNF), a new conditional probabilistic graphical model for sequence labeling. Our CNF model extends CRF by adding one (or possibly several) middle layer between input features and outputs. The middle layer consists of a number of hidden parameterized gates, each acting as a local neural network node or feature extractor to capture the nonlinear relationship between input features and outputs. Therefore, conceptually this CNF model is much more expressive than the linear CRF model. To better control the complexity of the CNF model, we also present a hyperparameter optimization procedure within the evidence framework. Experiments on two widely-used benchmarks indicate that this CNF model performs significantly better than a number of popular methods. In particular, our CNF model is the best among about ten machine learning methods for protein secondary tructure prediction and also among a few of the best methods for handwriting recognition.

NeurIPS Conference 2009 Conference Paper

Efficient Match Kernel between Sets of Features for Visual Recognition

Liefeng Bo
Cristian Sminchisescu

In visual recognition, the images are frequently modeled as sets of local features (bags). We show that bag of words, a common method to handle such cases, can be viewed as a special match kernel, which counts 1 if two local features fall into the same regions partitioned by visual words and 0 otherwise. Despite its simplicity, this quantization is too coarse. It is, therefore, appealing to design match kernels that more accurately measure the similarity between local features. However, it is impractical to use such kernels on large datasets due to their significant computational cost. To address this problem, we propose an efficient match kernel (EMK), which maps local features to a low dimensional feature space, average the resulting feature vectors to form a set-level feature, then apply a linear classifier. The local feature maps are learned so that their inner products preserve, to the best possible, the values of the specified kernel function. EMK is linear both in the number of images and in the number of local features. We demonstrate that EMK is extremely efficient and achieves the current state of the art performance on three difficult real world datasets: Scene-15, Caltech-101 and Caltech-256.

UAI Conference 2008 Conference Paper

Greedy Block Coordinate Descent for Large Scale Gaussian Process Regression

Liefeng Bo
Cristian Sminchisescu

We propose a variable decomposition algorithm– greedy block coordinate descent (GBCD)–in order to make dense Gaussian process regression practical for large scale problems. GBCD breaks a large scale optimization into a series of small sub-problems. The challenge in variable decomposition algorithms is the identification of a subproblem (the active set of variables) that yields the largest improvement. We analyze the limitations of existing methods and cast the active set selection into a zero-norm constrained optimization problem that we solve using greedy methods. By directly estimating the decrease in the objective function, we obtain not only eﬃcient approximate solutions for GBCD, but we are also able to demonstrate that the method is globally convergent. Empirical comparisons against competing dense methods like Conjugate Gradient or SMO show that GBCD is an order of magnitude faster. Comparisons against sparse GP methods show that GBCD is both accurate and capable of handling datasets of 100,000 samples or more.