Author name cluster

Haibin Ling

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

28 papers

2 author rows

TMLR Journal 2026 Journal Article

Retrieval as a Decision: Training-Free Adaptive Gating for Efficient RAG

Yufeng Wang
Lu Wei
Haibin Ling

Retrieval-Augmented Generation (RAG) improves factuality but retrieving for every query often hurts quality while inflating tokens and latency. We propose Training-free Adaptive Retrieval Gating (\textbf{TARG}), a single-shot policy that decides when to retrieve using only a short, no-context draft from the base model. From the draft’s prefix logits, TARG computes lightweight uncertainty scores—mean token entropy, a margin signal derived from the top-1/top-2 logit gap via a monotone link, or small-$N$ variance across a handful of stochastic prefixes—and triggers retrieval only when the score exceeds a threshold. The gate is model-agnostic, adds only tens to hundreds of draft tokens, and requires no additional training or auxiliary heads. On NQ-Open, TriviaQA, and PopQA, TARG consistently pushes the accuracy–efficiency frontier: compared with Always-RAG\footnote{\textsc{Always-RAG}: retrieve for every query; \textsc{Never-RAG}: never retrieve.}, TARG matches or improves EM/F1 while reducing retrieval by 70–90\% and cutting end-to-end latency, and it remains close to Never-RAG in overhead. A central empirical finding is that under modern instruction-tuned LLMs the margin signal is a robust default (entropy compresses as backbones sharpen), with small-$N$ variance offering a conservative, budget-first alternative. We provide ablations over gate type and prefix length and use a $\Delta$-latency view to make budget trade-offs explicit.

PDF Details

ICLR Conference 2025 Conference Paper

Backdooring Vision-Language Models with Out-Of-Distribution Data

Weimin Lyu
Jiachen Yao
Saumya Gupta
Lu Pang 0006
Tao Sun 0009
Lingjie Yi
Lijie Hu
Haibin Ling

The emergence of Vision-Language Models (VLMs) represents a significant advancement in integrating computer vision with Large Language Models (LLMs) to generate detailed text descriptions from visual inputs. Despite their growing importance, the security of VLMs, particularly against backdoor attacks, is under explored. Moreover, prior works often assume attackers have access to the original training data, which is often unrealistic. In this paper, we address a more practical and challenging scenario where attackers must rely solely on Out-Of-Distribution (OOD) data. We introduce VLOOD (Backdoor Vision-Language Models using Out-of-Distribution Data), a novel approach with two key contributions: (1) demonstrating backdoor attacks on VLMs in complex image-to-text tasks while minimizing degradation of the original semantics under poisoned inputs, and (2) proposing innovative techniques for backdoor injection without requiring any access to the original training data. Our evaluation on image captioning and visual question answering (VQA) tasks confirms the effectiveness of VLOOD, revealing a critical security vulnerability in VLMs and laying the foundation for future research on securing multimodal models against sophisticated threats.

Details

IJCAI Conference 2025 Conference Paper

Federated Stochastic Bilevel Optimization with Fully First-Order Gradients

Yihan Zhang
Rohit Dhaipule
Chiu C. Tan
Haibin Ling
Hongchang Gao

Federated stochastic bilevel optimization has been actively studied in recent years due to its widespread applications in machine learning. However, most existing federated stochastic bilevel optimization algorithms require the computation of second-order Hessian and Jacobian matrices, which leads to longer running times in practice. To address these challenges, we propose a novel federated stochastic variance-reduced bilevel gradient descent algorithm that relies solely on first-order oracles. Specifically, our approach does not require the computation of second-order Hessian and Jacobian matrices, significantly reducing running time. Furthermore, we introduce a novel learning rate mechanism, i. e. , a constant single-time-scale learning rate, to coordinate the update of different variables. We also present a new strategy to establish the convergence rate of our algorithm. Finally, the extensive experimental results confirm the efficacy of our proposed algorithm.

PDF Details DOI

ICLR Conference 2025 Conference Paper

Geometry of Long-Tailed Representation Learning: Rebalancing Features for Skewed Distributions

Lingjie Yi
Jiachen Yao
Weimin Lyu
Haibin Ling
Raphael Douady
Chao Chen 0012

Deep learning has achieved significant success by training on balanced datasets. However, real-world data often exhibit long-tailed distributions. Empirical studies have revealed that long-tailed data skew data representations, where head classes dominate the feature space. Many methods have been proposed to empirically rectify the skewed representations. However, a clear understanding of the underlying cause and extent of this skew remains lacking. In this study, we provide a comprehensive theoretical analysis to elucidate how long-tailed data affect feature distributions, deriving the conditions under which centers of tail classes shrink together or even collapse into a single point. This results in overlapping feature distributions of tail classes, making features in the overlapping regions inseparable. Moreover, we demonstrate that merely empirically correcting the skewed representations of the training data is insufficient to separate the overlapping features due to distribution shifts between the training and real data. To address these challenges, we propose a novel long-tailed representation learning method, FeatRecon. It reconstructs the feature space in order to arrange features from different classes into symmetricial and linearly separable regions. This, in turn, enhances the model’s robustness to long-tailed data. We validate the effectiveness of our method through extensive experiments on the CIFAR-10-LT, CIFAR-100-LT, ImageNet-LT, and iNaturalist 2018 datasets.

Details

NeurIPS Conference 2025 Conference Paper

LoRATv2: Enabling Low-Cost Temporal Modeling in One-Stream Trackers

Liting Lin
Heng Fan
Zhipeng Zhang
Yuqing Huang
Yaowei Wang
Yong Xu
Haibin Ling

Transformer-based algorithms, such as LoRAT, have significantly enhanced object-tracking performance. However, these approaches rely on a standard attention mechanism, which incurs quadratic token complexity, making real-time inference computationally expensive. In this paper, we introduce LoRATv2, a novel tracking framework that addresses these limitations with three main contributions. First, LoRATv2 integrates frame-wise causal attention, which ensures full self-attention within each frame while enabling causal dependencies across frames, significantly reducing computational overhead. Moreover, key-value (KV) caching is employed to efficiently reuse past embeddings for further speedup. Second, building on LoRAT's parameter-efficient fine-tuning, we propose Stream-Specific LoRA Adapters (SSLA). As frame-wise causal attention introduces asymmetry in how streams access temporal information, SSLA assigns dedicated LoRA modules to the template and each search stream, with the main ViT backbone remaining frozen. This allows specialized adaptation for each stream's role in temporal tracking. Third, we introduce a two-phase progressive training strategy, which first trains a single-search-frame tracker and then gradually extends it to multi-search-frame inputs by introducing additional LoRA modules. This curriculum-based learning paradigm improves long-term tracking while maintaining training efficiency. In extensive experiments on multiple benchmarks, LoRATv2 achieves state-of-the-art performance, substantially improved efficiency, and a superior performance-to-FLOPs ratio over state-of-the-art trackers. The code is available at https: //github. com/LitingLin/LoRATv2.

PDF Details

ICML Conference 2025 Conference Paper

RISE: Radius of Influence based Subgraph Extraction for 3D Molecular Graph Explanation

Jingxiang Qu
Wenhan Gao 0002
Jiaxing Zhang 0002
Xufeng Liu 0002
Hua Wei 0001
Haibin Ling
Yi Liu 0059

3D Geometric Graph Neural Networks (GNNs) have emerged as transformative tools for modeling molecular data. Despite their predictive power, these models often suffer from limited interpretability, raising concerns for scientific applications that require reliable and transparent insights. While existing methods have primarily focused on explaining molecular substructures in 2D GNNs, the transition to 3D GNNs introduces unique challenges, such as handling the implicit dense edge structures created by a cutoff radius. To tackle this, we introduce a novel explanation method specifically designed for 3D GNNs, which localizes the explanation to the immediate neighborhood of each node within the 3D space. Each node is assigned an radius of influence, defining the localized region within which message passing captures spatial and structural interactions crucial for the model’s predictions. This method leverages the spatial and geometric characteristics inherent in 3D graphs. By constraining the subgraph to a localized radius of influence, the approach not only enhances interpretability but also aligns with the physical and structural dependencies typical of 3D graph applications, such as molecular learning.

Details

IROS Conference 2025 Conference Paper

VAPO: Visibility-Aware Keypoint Localization for Efficient 6DoF Object Pose Estimation

Ruyi Lian
Yuewei Lin
Longin Jan Latecki
Haibin Ling

Localizing predefined 3D keypoints in a 2D image is an effective way to establish 3D-2D correspondences for instance-level 6DoF object pose estimation. However, unreliable localization results of invisible keypoints degrade the quality of correspondences. In this paper, we address this issue by localizing the important keypoints in terms of visibility. Since keypoint visibility information is currently missing in the dataset collection process, we propose an efficient way to generate binary visibility labels from available object-level annotations, for keypoints of both asymmetric objects and symmetric objects. We further derive real-valued visibility-aware importance from binary labels based on the PageRank algorithm. Taking advantage of the flexibility of our visibility-aware importance, we construct VAPO (Visibility-Aware POse estimator) by integrating the visibility-aware importance with a state-of-the-art pose estimation algorithm, along with additional positional encoding. VAPO can work in both CAD-based and CAD-free settings. Extensive experiments are conducted on popular pose estimation benchmarks including Linemod, Linemod-Occlusion, and YCB-V, demonstrating that VAPO clearly achieves state-of-the-art performances. Project page: https://github.com/RuyiLian/VAPO.

Details

NeurIPS Conference 2024 Conference Paper

DFA-GNN: Forward Learning of Graph Neural Networks by Direct Feedback Alignment

Gongpei Zhao
Tao Wang
Congyan Lang
Yi Jin
Yidong Li
Haibin Ling

Graph neural networks (GNNs) are recognized for their strong performance across various applications, with the backpropagation (BP) algorithm playing a central role in the development of most GNN models. However, despite its effectiveness, BP has limitations that challenge its biological plausibility and affect the efficiency, scalability and parallelism of training neural networks for graph-based tasks. While several non-backpropagation (non-BP) training algorithms, such as the direct feedback alignment (DFA), have been successfully applied to fully-connected and convolutional network components for handling Euclidean data, directly adapting these non-BP frameworks to manage non-Euclidean graph data in GNN models presents significant challenges. These challenges primarily arise from the violation of the independent and identically distributed (i. i. d. ) assumption in graph data and the difficulty in accessing prediction errors for all samples (nodes) within the graph. To overcome these obstacles, in this paper we propose DFA-GNN, a novel forward learning framework tailored for GNNs with a case study of semi-supervised learning. The proposed method breaks the limitations of BP by using a dedicated forward training mechanism. Specifically, DFA-GNN extends the principles of DFA to adapt to graph data and unique architecture of GNNs, which incorporates the information of graph topology into the feedback links to accommodate the non-Euclidean characteristics of graph data. Additionally, for semi-supervised graph learning tasks, we developed a pseudo error generator that spreads residual errors from training data to create a pseudo error for each unlabeled node. These pseudo errors are then utilized to train GNNs using DFA. Extensive experiments on 10 public benchmarks reveal that our learning framework outperforms not only previous non-BP methods but also the standard BP methods, and it exhibits excellent robustness against various types of noise and attacks.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Efficient Temporal Action Segmentation via Boundary-aware Query Voting

Peiyao Wang
Yuewei Lin
Erik Blasch
Jie Wei
Haibin Ling

Although the performance of Temporal Action Segmentation (TAS) has been improved in recent years, achieving promising results often comes with a high computational cost due to dense inputs, complex model structures, and resource-intensive post-processing requirements. To improve the efficiency while keeping the high performance, we present a novel perspective centered on per-segment classification. By harnessing the capabilities of Transformers, we tokenize each video segment as an instance token, endowed with intrinsic instance segmentation. To realize efficient action segmentation, we introduce BaFormer, a boundary-aware Transformer network. It employs instance queries for instance segmentation and a global query for class-agnostic boundary prediction, yielding continuous segment proposals. During inference, BaFormer employs a simple yet effective voting strategy to classify boundary-wise segments based on instance segmentation. Remarkably, as a single-stage approach, BaFormer significantly reduces the computational costs, utilizing only 6% of the running time compared to the state-of-the-art method DiffAct, while producing better or comparable accuracy over several popular benchmarks. The code for this project is publicly available at https: //github. com/peiyao-w/BaFormer.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

VQ-Map: Bird's-Eye-View Map Layout Estimation in Tokenized Discrete Space via Vector Quantization

Yiwei Zhang
Jin Gao
Fudong Ge
Guan Luo
Bing Li
Zhaoxiang Zhang
Haibin Ling
Weiming Hu

Bird's-eye-view (BEV) map layout estimation requires an accurate and full understanding of the semantics for the environmental elements around the ego car to make the results coherent and realistic. Due to the challenges posed by occlusion, unfavourable imaging conditions and low resolution, \emph{generating} the BEV semantic maps corresponding to corrupted or invalid areas in the perspective view (PV) is appealing very recently. \emph{The question is how to align the PV features with the generative models to facilitate the map estimation}. In this paper, we propose to utilize a generative model similar to the Vector Quantized-Variational AutoEncoder (VQ-VAE) to acquire prior knowledge for the high-level BEV semantics in the tokenized discrete space. Thanks to the obtained BEV tokens accompanied with a codebook embedding encapsulating the semantics for different BEV elements in the groundtruth maps, we are able to directly align the sparse backbone image features with the obtained BEV tokens from the discrete representation learning based on a specialized token decoder module, and finally generate high-quality BEV maps with the BEV codebook embedding serving as a bridge between PV and BEV. We evaluate the BEV map layout estimation performance of our model, termed VQ-Map, on both the nuScenes and Argoverse benchmarks, achieving 62. 2/47. 6 mean IoU for surround-view/monocular evaluation on nuScenes, as well as 73. 4 IoU for monocular evaluation on Argoverse, which all set a new record for this map layout estimation task. The code and models are available on \url{https: //github. com/Z1zyw/VQ-Map}.

PDF Details DOI

AAAI Conference 2023 Conference Paper

Domain Adaptation with Adversarial Training on Penultimate Activations

Tao Sun
Cheng Lu
Haibin Ling

Enhancing model prediction confidence on target data is an important objective in Unsupervised Domain Adaptation (UDA). In this paper, we explore adversarial training on penultimate activations, i.e., input features of the final linear classification layer. We show that this strategy is more efficient and better correlated with the objective of boosting prediction confidence than adversarial training on input images or intermediate features, as used in previous works. Furthermore, with activation normalization commonly used in domain adaptation to reduce domain gap, we derive two variants and systematically analyze the effects of normalization on our adversarial training. This is illustrated both in theory and through empirical analysis on real adaptation tasks. Extensive experiments are conducted on popular UDA benchmarks under both standard setting and source-data free setting. The results validate that our method achieves the best scores against previous arts. Code is available at https://github.com/tsun/APA.

PDF Details DOI

IROS Conference 2023 Conference Paper

Transparent Object Tracking with Enhanced Fusion Module

Kalyan Garigapati
Erik Blasch
Jie Wei
Haibin Ling

Accurate tracking of transparent objects, such as glasses, plays a critical role in many robotic tasks such as robot-assisted living. Due to the adaptive and often reflective texture of such objects, traditional tracking algorithms that rely on general-purpose learned features suffer from reduced performance. Recent research has proposed to instill trans-parency awareness into existing general object trackers by fusing purpose-built features. However, with the existing fusion techniques, the addition of new features causes a change in the latent space making it impossible to incorporate transparency awareness on trackers with fixed latent spaces. For example, many of the current days' transformer-based trackers are fully pre-trained and are sensitive to any latent space perturbations. In this paper, we present a new feature fusion technique that integrates transparency information into a fixed feature space, enabling its use in a broader range of trackers. Our proposed fusion module, composed of a transformer encoder and an MLP module, leverages key query-based transformations to embed the transparency information into the tracking pipeline. We also present a new two-step training strategy for our fusion module to effectively merge transparency features. We propose a new tracker architecture that uses our fusion techniques to achieve superior results for transparent object tracking. Our proposed method achieves competitive results with state-of-the-art trackers on TOTB, which is the largest transparent object tracking benchmark recently released. Our results and the implementation of code will be made publicly available at https://github.com/kalyan0510/TOTEM.

Details

AAAI Conference 2022 Conference Paper

Structural Landmarking and Interaction Modelling: A “SLIM” Network for Graph Classification

Yaokang Zhu
Kai Zhang
Jun Wang
Haibin Ling
Jie Zhang
Hongyuan Zha

Graph neural networks are a promising architecture for learning and inference with graph-structured data. Yet, how to generate informative, fixed-dimensional graph-level features for graphs with varying size and topology can still be challenging. Typically, this is achieved through graph-pooling, which summarizes a graph by compressing all its nodes into a single vector after convolutional operations. Is such a “collapsing-style” graph-pooling the only choice for graph classification? From complex system’s point of view, properties of a complex system arise largely from the interaction among its components. Therefore, we speculate that preserving the interacting relation between parts, instead of pooling them together, could benefit system-level prediction. To verify this, we propose SLIM, a graph neural network model for Structural Landmarking and Interaction Modelling. The main idea is to compute a set of end-to-end optimizable sub-structure landmarks, so that any input graph can be projected onto these (spatially) local structural representatives for a faithful, global characterization. By doing this, explicit interaction between component parts of a graph can be leveraged directly in generating useful graphlevel representations despite significant topological variations. Encouraging results are observed on benchmark datasets for graph classification, demonstrating the value of interaction modelling in the design of graph neural networks.

PDF Details

NeurIPS Conference 2022 Conference Paper

SwinTrack: A Simple and Strong Baseline for Transformer Tracking

Liting Lin
Heng Fan
Zhipeng Zhang
Yong Xu
Haibin Ling

Recently Transformer has been largely explored in tracking and shown state-of-the-art (SOTA) performance. However, existing efforts mainly focus on fusing and enhancing features generated by convolutional neural networks (CNNs). The potential of Transformer in representation learning remains under-explored. In this paper, we aim to further unleash the power of Transformer by proposing a simple yet efficient fully-attentional tracker, dubbed SwinTrack, within classic Siamese framework. In particular, both representation learning and feature fusion in SwinTrack leverage the Transformer architecture, enabling better feature interactions for tracking than pure CNN or hybrid CNN-Transformer frameworks. Besides, to further enhance robustness, we present a novel motion token that embeds historical target trajectory to improve tracking by providing temporal context. Our motion token is lightweight with negligible computation but brings clear gains. In our thorough experiments, SwinTrack exceeds existing approaches on multiple benchmarks. Particularly, on the challenging LaSOT, SwinTrack sets a new record with 0. 713 SUC score. It also achieves SOTA results on other benchmarks. We expect SwinTrack to serve as a solid baseline for Transformer tracking and facilitate future research. Our codes and results are released at https: //github. com/LitingLin/SwinTrack.

PDF Details

IJCAI Conference 2021 Conference Paper

Adaptive Edge Attention for Graph Matching with Outliers

Jingwei Qu
Haibin Ling
Chenrui Zhang
Xiaoqing Lyu
Zhi Tang

Graph matching aims at establishing correspondence between node sets of given graphs while keeping the consistency between their edge sets. However, outliers in practical scenarios and equivalent learning of edge representations in deep learning methods are still challenging. To address these issues, we present an Edge Attention-adaptive Graph Matching (EAGM) network and a novel description of edge features. EAGM transforms the matching relation between two graphs into a node and edge classification problem over their assignment graph. To explore the potential of edges, EAGM learns edge attention on the assignment graph to 1) reveal the impact of each edge on graph matching, as well as 2) adjust the learning of edge representations adaptively. To alleviate issues caused by the outliers, we describe an edge by aggregating the semantic information over the space spanned by the edge. Such rich information provides clear distinctions between different edges (e. g. , inlier-inlier edges vs. inlier-outlier edges), which further distinguishes outliers in the view of their associated edges. Extensive experiments demonstrate that EAGM achieves promising matching quality compared with state-of-the-arts, on cases both with and without outliers. Our source code along with the experiments is available at https: //github. com/bestwei/EAGM.

PDF Details DOI

IROS Conference 2021 Conference Paper

CRACT: Cascaded Regression-Align-Classification for Robust Tracking

Heng Fan 0001
Haibin Ling

High quality object proposals are crucial in visual tracking algorithms that utilize region proposal network (RPN). Refinement of these proposals, typically by box regression and classification in parallel, has been popularly adopted to boost tracking performance. However, it still meets problems when dealing with complex and dynamic background. Thus motivated, in this paper we introduce an improved proposal refinement module, Cascaded Regression-Align-Classification (CRAC), which yields new state-of-the-art performances on many benchmarks. First, having observed that the offsets from box regression can serve as guidance for proposal feature refinement, we design CRAC as a cascade of box regression, feature alignment and box classification. The key is to bridge box regression and classification via an alignment step, which leads to more accurate features for proposal classification with improved robustness. To address the variation in object appearance, we introduce an identification-discrimination component for box classification, which leverages offline reliable fine-grained template and online rich background information to distinguish the target from background. Moreover, we present pyramid RoIAlign that benefits CRAC by exploiting both the local and global cues of proposals. During inference, tracking proceeds by ranking all refined proposals and selecting the best one. In experiments on seven benchmarks including OTB-2015, UAV123, NfS, VOT-2018, TrackingNet, GOT-10k and LaSOT, our CRACT exhibits very promising results in comparison with state-of-the-art competitors and runs in real-time at 28 fps.

Details

AAAI Conference 2021 Conference Paper

Modeling Deep Learning Based Privacy Attacks on Physical Mail

Bingyao Huang
Ruyi Lian
Dimitris Samaras
Haibin Ling

Mail privacy protection aims to prevent unauthorized access to hidden content within an envelope since normal paper envelopes are not as safe as we think. In this paper, for the first time, we show that with a well designed deep learning model, the hidden content may be largely recovered without opening the envelope. We start by modeling deep learning-based privacy attacks on physical mail content as learning the mapping from the camera-captured envelope front face image to the hidden content, then we explicitly model the mapping as a combination of perspective transformation, image dehazing and denoising using a deep convolutional neural network, named Neural-STE (See-Through-Envelope). We show experimentally that hidden content details, such as texture and image structure, can be clearly recovered. Finally, our formulation and model allow us to design envelopes that can counter deep learning-based privacy attacks on physical mail.

PDF Details

NeurIPS Conference 2021 Conference Paper

Searching the Search Space of Vision Transformer

Minghao Chen
Kan Wu
Bolin Ni
Houwen Peng
Bei Liu
Jianlong Fu
Hongyang Chao
Haibin Ling

Vision Transformer has shown great visual representation power in substantial vision tasks such as recognition and detection, and thus been attracting fast-growing efforts on manually designing more effective architectures. In this paper, we propose to use neural architecture search to automate this process, by searching not only the architecture but also the search space. The central idea is to gradually evolve different search dimensions guided by their E-T Error computed using a weight-sharing supernet. Moreover, we provide design guidelines of general vision transformers with extensive analysis according to the space searching process, which could promote the understanding of vision transformer. Remarkably, the searched models, named S3 (short for Searching the Search Space), from the searched space achieve superior performance to recently proposed models, such as Swin, DeiT and ViT, when evaluated on ImageNet. The effectiveness of S3 is also illustrated on object detection, semantic segmentation and visual question answering, demonstrating its generality to downstream vision and vision-language tasks. Code and models will be available at https: //github. com/microsoft/Cream.

PDF Details

AAAI Conference 2020 Conference Paper

CBNet: A Novel Composite Backbone Network Architecture for Object Detection

Yudong Liu
Yongtao Wang
Siwei Wang
Tingting Liang
Qijie Zhao
Zhi Tang
Haibin Ling

In existing CNN based detectors, the backbone network is a very important component for basic feature1 extraction, and the performance of the detectors highly depends on it. In this paper, we aim to achieve better detection performance by building a more powerful backbone from existing ones like ResNet and ResNeXt. Speciﬁcally, we propose a novel strategy for assembling multiple identical backbones by composite connections between the adjacent backbones, to form a more powerful backbone named Composite Backbone Network (CBNet). In this way, CBNet iteratively feeds the output features of the previous backbone, namely high-level features, as part of input features to the succeeding backbone, in a stage-by-stage fashion, and ﬁnally the feature maps of the last backbone (named Lead Backbone) are used for object detection. We show that CBNet can be very easily integrated into most state-of-the-art detectors and signiﬁcantly improve their performances. For example, it boosts the mAP of FPN, Mask R-CNN and Cascade R-CNN on the COCO dataset by about 1. 5 to 3. 0 points. Moreover, experimental results show that the instance segmentation results can be improved as well. Speciﬁcally, by simply integrating the proposed CBNet into the baseline detector Cascade Mask R-CNN, we achieve a new state-of-the-art result on COCO dataset (mAP of 53. 3) with a single model, which demonstrates great effectiveness of the proposed CBNet architecture. Code will be made available at https: //github. com/PKUbahuangliuhe/CBNet.

PDF Details

AAAI Conference 2019 Conference Paper

M2Det: A Single-Shot Object Detector Based on Multi-Level Feature Pyramid Network

Qijie Zhao
Tao Sheng
Yongtao Wang
Zhi Tang
Ying Chen
Ling Cai
Haibin Ling

Feature pyramids are widely exploited by both the state-ofthe-art one-stage object detectors (e. g. , DSSD, RetinaNet, RefineDet) and the two-stage object detectors (e. g. , Mask R- CNN, DetNet) to alleviate the problem arising from scale variation across object instances. Although these object detectors with feature pyramids achieve encouraging results, they have some limitations due to that they only simply construct the feature pyramid according to the inherent multiscale, pyramidal architecture of the backbones which are originally designed for object classification task. Newly, in this work, we present Multi-Level Feature Pyramid Network (MLFPN) to construct more effective feature pyramids for detecting objects of different scales. First, we fuse multi-level features (i. e. multiple layers) extracted by backbone as the base feature. Second, we feed the base feature into a block of alternating joint Thinned U-shape Modules and Feature Fusion Modules and exploit the decoder layers of each Ushape module as the features for detecting objects. Finally, we gather up the decoder layers with equivalent scales (sizes) to construct a feature pyramid for object detection, in which every feature map consists of the layers (features) from multiple levels. To evaluate the effectiveness of the proposed MLFPN, we design and train a powerful end-to-end one-stage object detector we call M2Det by integrating it into the architecture of SSD, and achieve better detection performance than state-of-the-art one-stage detectors. Specifically, on MS- COCO benchmark, M2Det achieves AP of 41. 0 at speed of 11. 8 FPS with single-scale inference strategy and AP of 44. 2 with multi-scale inference strategy, which are the new stateof-the-art results among one-stage detectors. The code will be made available on https: //github. com/qijiezhao/M2Det.

PDF Details

ICRA Conference 2018 Conference Paper

Constrained Confidence Matching for Planar Object Tracking

Tao Wang 0011
Haibin Ling
Congyan Lang
Songhe Feng
Yi Jin 0001
Yidong Li

Tracking planar objects has a wide range of applications in robotics. Conventional template tracking algorithms, however, often fail to observe fast object motion or drift significantly after a period of time, due to drastic object appearance change. To address such challenges, we propose a novel constrained confidence matching algorithm for motion estimation and a robust Kalman filter for template updating. Integrated with an accurate occlusion detector, our approach achieves accurate motion estimation in presence of partial occlusion, by excluding occluded pixels from computation of motion parameters. Furthermore, the proposed Kalman filter employs a novel control-input model to handle the object appearance change, which brings our tracker high robustness against sudden illumination change and heavy motion blur. For evaluation, we compare the proposed tracker with several state-of-the-art planar object trackers on two public benchmark datasets. Experimental results show that our algorithm achieves robust tracking results against various environmental variations, and outperforms baseline algorithms remarkably on both datasets.

Details

AAAI Conference 2018 Conference Paper

Graph Correspondence Transfer for Person Re-Identification

Qin Zhou
Heng Fan
Shibao Zheng
Hang Su
Xinzhe Li
Shuang Wu
Haibin Ling

In this paper, we propose a graph correspondence transfer (GCT) approach for person re-identiﬁcation. Unlike existing methods, the GCT model formulates person re-identiﬁcation as an off-line graph matching and on-line correspondence transferring problem. In speciﬁc, during training, the GCT model aims to learn off-line a set of correspondence templates from positive training pairs with various pose-pair con- ﬁgurations via patch-wise graph matching. During testing, for each pair of test samples, we select a few training pairs with the most similar pose-pair conﬁgurations as references, and transfer the correspondences of these references to test pair for feature distance calculation. The matching score is derived by aggregating distances from different references. For each probe image, the gallery image with the highest matching score is the re-identifying result. Compared to existing algorithms, our GCT can handle spatial misalignment caused by large variations in view angles and human poses owing to the beneﬁts of patch-wise graph matching. Extensive experiments on ﬁve benchmarks including VIPeR, Road, PRID450S, 3DPES and CUHK01 evidence the superior performance of GCT model over other state-of-the-art methods.

PDF Details

ICRA Conference 2018 Conference Paper

Planar Object Tracking in the Wild: A Benchmark

Pengpeng Liang
Yifan Wu
Hu Lu
Liming Wang
Chunyuan Liao
Haibin Ling

Planar object tracking is an actively studied problem in vision-based robotic applications. While several benchmarks have been constructed for evaluating state-of-the-art algorithms, there is a lack of video sequences captured in the wild rather than in constrained laboratory environment. In this paper, we present a carefully designed planar object tracking benchmark containing 210 videos of 30 planar objects sampled in the natural environment. In particular, for each object, we shoot seven videos involving various challenging factors, namely scale change, rotation, perspective distortion, motion blur, occlusion, out-of-view, and unconstrained. The ground truth is carefully annotated semi-manually to ensure the quality. Moreover, eleven state-of-the-art algorithms are evaluated on the benchmark using two evaluation metrics, with detailed analysis provided for the evaluation results. We expect the proposed benchmark to benefit future studies on planar object tracking.

Details

IJCAI Conference 2017 Conference Paper

Exclusivity Regularized Machine: A New Ensemble SVM Classifier

Xiaojie Guo
Xiaobo Wang
Haibin Ling

The diversity of base learners is of utmost importance to a good ensemble. This paper defines a novel measurement of diversity, termed as exclusivity. With the designed exclusivity, we further propose an ensemble SVM classifier, namely Exclusivity Regularized Machine (ExRM), to jointly suppress the training error of ensemble and enhance the diversity between bases. Moreover, an Augmented Lagrange Multiplier based algorithm is customized to effectively and efficiently seek the optimal solution of ExRM. Theoretical analysis on convergence, global optimality and linear complexity of the proposed algorithm, as well as experiments are provided to reveal the efficacy of our method and show its superiority over state-of-the-arts in terms of accuracy and efficiency.

PDF Details

ICRA Conference 2017 Conference Paper

Illumination insensitive efficient second-order minimization for planar object tracking

Lin Chen 0030
Fan Zhou 0007
Yu Shen
Xiang Tian 0002
Haibin Ling
Yaowu Chen

Tracking for planar objects is an important issue to vision-based robotic applications. In direct visual tracking (DVT) methods, the similarity between two images is often measured through the sum of squared differences (SSD) especially with the efficient second-order minimization (ESM) due to its simplicity and efficiency. However, SSD-based ESM is not robust to illumination changes since it is usually built upon the brightness constancy assumption. Contrast to image brightness, gradient orientations (GO) are invariant to both linear and non-linear illumination changes as verified in practice. Based on GO, we propose an illumination insensitive ESM method for planar object tracking in this paper. In order to introduce GO into the ESM, we generalized the original ESM formulas for multi-dimensional features. In addition, a denoising method based on the Perona-Malik function and a mask image were suggested to improve GO's robustness against image noise and low texture. Our experimental results on dataset for planar objects with illumination changes and a benchmark dataset confirm the proposed method is robust to illumination variations and capable to deal with the general tracking challenges.

Details

IJCAI Conference 2016 Conference Paper

Crowd Scene Understanding with Coherent Recurrent Neural Networks

Hang Su
Yinpeng Dong
Jun Zhu
Haibin Ling
Bo Zhang

Exploring crowd dynamics is essential in understanding crowd scenes, which still remains as a challenging task due to the nonlinear characteristics and coherent spatio-temporal motion patterns in crowd behaviors. To address these issues, we present a Coherent Long Short Term Memory (cLSTM) network to capture the nonlinear crowd dynamics by learning an informative representation of crowd motions, which facilitates the critical tasks in crowd scene analysis. By describing the crowd motion patterns with a cloud of keypoint tracklets, we explore the nonlinear crowd dynamics embedded in the tracklets with a stacked LSTM model, which is further improved to capture the collective properties by introducing a coherent regularization term; and finally, we adopt an unsupervised encoder-decoder framework to learn a hidden feature for each input tracklet that embeds its inherent dynamics. With the learnt features properly harnessed, crowd scene understanding is conducted effectively in predicting the future paths of agents, estimating group states, and classifying crowd events. Extensive experiments on hundreds of public crowd videos demonstrate that our method is state-of-the-art performance by exploring the coherent spatio-temporal structures in crowd behaviors.

PDF Details

AAAI Conference 2016 Conference Paper

Path Following with Adaptive Path Estimation for Graph Matching

Tao Wang
Haibin Ling

Graph matching plays an important role in many ﬁelds in computer vision. It is a well-known general NP-hard problem and has been investigated for decades. Among the large amount of algorithms for graph matching, the algorithms utilizing the path following strategy exhibited state-of-art performances. However, the main drawback of this category of algorithms lies in their high computational burden. In this paper, we propose a novel path following strategy for graph matching aiming to improve its computation efﬁciency. We ﬁrst propose a path estimation method to reduce the computational cost at each iteration, and subsequently a method of adaptive step length to accelerate the convergence. The proposed approach is able to be integrated into all the algorithms that utilize the path following strategy. To validate our approach, we compare our approach with several recently proposed graph matching algorithms on three benchmark image datasets. Experimental results show that, our approach improves signiﬁcantly the computation efﬁciency of the original algorithms, and offers similar or better matching results.

PDF Details

AAAI Conference 2014 Conference Paper

Exploiting Competition Relationship for Robust Visual Recognition

Liang Du
Haibin Ling

Joint learning of similar tasks has been a popular trend in visual recognition and proven to be beneficial. Between-task similarity often provides useful cues, such as feature sharing, for learning visual classifiers. By contrast, the competition relationship between visual recognition tasks (e. g. , content independent writer identification and handwriting recognition) remains largely under-explored. A key challenge in visual recognition is to select the most discriminating features and remove irrelevant features related to intraclass variations. With the help of auxiliary competing tasks, we can identify such features within a joint learning model exploiting the competition relationship. Motivated by this intuition, we propose a novel way to exploit competition relationship for solving visual recognition problems. Specifically, given a target task and its competing tasks, we jointly model them by a generalized additive regression model with a competition constraint. This constraint effectively discourages choosing of irrelevant features (weak learners) that support the auxiliary competing tasks. We name the proposed algorithm CompBoost. In our study, CompBoost is applied to two visual recognition applications: (1) content-independent writer identification from handwriting scripts by exploiting competing tasks of handwriting recognition, and (2) actor-independent facial expression recognition by exploiting competing tasks of face recognition. In both experiments our approach demonstrates promising performance gains by exploiting the between-task competition.

PDF Details