Author name cluster

Xiaobo Lu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

8 papers

2 author rows

EAAI Journal 2026 Journal Article

Class label enhanced Wasserstein distance for classification of remote sensing smoke-related scenes

Shikun Chen
Xin Lu
Xiaobo Lu

Classification of remote sensing (RS) smoke-related scenes is a tough task due to their high inter-class similarity. To improve the classification performance, it is a wise choice to better distinguish various scenes on the feature space by deep learning models. The optimal transport (OT), which measures the difference between probability distributions by Wasserstein distance (WD), fits well with this idea. Served as a loss function, WD enables closer distance between samples matched with each other in the transportation plan. Here in classification tasks, it is naturally expected samples belonging to the same scene are matched. In traditional WD methods, distances between samples are only measured on the feature space. Now with the existence of class labels, it is proposed samples could also be compared on the label space to further reduce distances between same class samples in the transportation plan. Based on the fact samples of the same scene should form a cluster on the feature space, it is proposed class labels can also be predicted according to the spatial relationships of feature representations besides the output of classifier. In this work, we utilize class labels both predicted based on the spatial relationships and the classifier to reduce mis-mappings in the transportation plan. The proposed algorithm is named as Class Label Enhanced Wasserstein Distance (CLEWD), and extensive experiments show CLEWD outperforms other state-of-the-art (SOTA) algorithms in classification of RS smoke-related scenes.

Details DOI

AAAI Conference 2026 Conference Paper

CompTrack: Information Bottleneck-Guided Low-Rank Dynamic Token Compression for Point Cloud Tracking

Sifan Zhou
Yichao Cao
Jiahao Nie
Yuqian Fu
Ziyu Zhao
Xiaobo Lu
Shuo Wang

3D single object tracking (SOT) in LiDAR point clouds is a critical task in computer vision and autonomous driving. Despite great success having been achieved, the inherent sparsity of point clouds introduces a dual-redundancy challenge that limits existing trackers: (1) vast spatial redundancy from background noise impairs accuracy, and (2) informational redundancy within the foreground hinders efficiency. To tackle these issues, we propose CompTrack, a novel end-to-end framework that systematically eliminates both forms of redundancy in point clouds. First, CompTrack incorporates a Spatial Foreground Predictor (SFP) module to filter out irrelevant background noise based on information entropy, addressing spatial redundancy. Subsequently, its core is an Information Bottleneck-guided Dynamic Token Compression (IB-DTC) module that eliminates the informational redundancy within the foreground. Theoretically grounded in low-rank approximation, this module leverages an online SVD analysis to adaptively compress the redundant foreground into a compact and highly informative set of proxy tokens. Extensive experiments on KITTI, nuScenes and Waymo datasets demonstrate that CompTrack achieves top-performing tracking performance with superior efficiency, running at a real-time 90 FPS on a single RTX 3090 GPU.

PDF Details DOI

ICLR Conference 2024 Conference Paper

LiDAR-PTQ: Post-Training Quantization for Point Cloud 3D Object Detection

Sifan Zhou
Liang Li 0003
Xinyu Zhang 0015
Bo Zhang 0046
Shipeng Bai
Miao Sun
Ziyu Zhao
Xiaobo Lu

Due to highly constrained computing power and memory, deploying 3D lidar-based detectors on edge devices equipped in autonomous vehicles and robots poses a crucial challenge. Being a convenient and straightforward model compression approach, Post-Training Quantization (PTQ) has been widely adopted in 2D vision tasks. However, applying it directly to 3D lidar-based tasks inevitably leads to performance degradation. As a remedy, we propose an effective PTQ method called LiDAR-PTQ, which is particularly curated for 3D lidar detection (both SPConv-based and SPConv-free). Our LiDAR-PTQ features three main components, (1) a sparsity-based calibration method to determine the initialization of quantization parameters, (2) an adaptive rounding-to-nearest operation to minimize the layerwise reconstruction error, (3) a Task-guided Global Positive Loss (TGPL) to reduce the disparity between the final predictions before and after quantization. Extensive experiments demonstrate that our LiDAR-PTQ can achieve state-of-the-art quantization performance when applied to CenterPoint (both Pillar-based and Voxel-based). To our knowledge, for the very first time in lidar-based 3D detection tasks, the PTQ INT8 model's accuracy is almost the same as the FP32 model while enjoying 3X inference speedup. Moreover, our LiDAR-PTQ is cost-effective being 6X faster than the quantization-aware training method. The code will be released.

Details

AAAI Conference 2023 Conference Paper

Coarse2Fine: Local Consistency Aware Re-prediction for Weakly Supervised Object Localization

Yixuan Pan
Yao Yao
Yichao Cao
Chongjin Chen
Xiaobo Lu

Weakly supervised object localization aims to localize objects of interest by using only image-level labels. Existing methods generally segment activation map by threshold to obtain mask and generate bounding box. However, the activation map is locally inconsistent, i.e., similar neighboring pixels of the same object are not equally activated, which leads to the blurred boundary issue: the localization result is sensitive to the threshold, and the mask obtained directly from the activation map loses the fine contours of the object, making it difficult to obtain a tight bounding box. In this paper, we introduce the Local Consistency Aware Re-prediction (LCAR) framework, which aims to recover the complete fine object mask from locally inconsistent activation map and hence obtain a tight bounding box. To this end, we propose the self-guided re-prediction module (SGRM), which employs a novel superpixel aggregation network to replace the post-processing of threshold segmentation. In order to derive more reliable pseudo label from the activation map to supervise the SGRM, we further design an affinity refinement module (ARM) that utilizes the original image feature to better align the activation map with the image appearance, and design a self-distillation CAM (SD-CAM) to alleviate the locator dependence on saliency. Experiments demonstrate that our LCAR outperforms the state-of-the-art on both the CUB-200-2011 and ILSVRC datasets, achieving 95.89% and 70.72% of GT-Know localization accuracy, respectively.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

Detecting Any Human-Object Interaction Relationship: Universal HOI Detector with Spatial Prompt Learning on Foundation Models

Yichao Cao
Qingfei Tang
Xiu Su
Song Chen
Shan You
Xiaobo Lu
Chang Xu

Human-object interaction (HOI) detection aims to comprehend the intricate relationships between humans and objects, predicting triplets, and serving as the foundation for numerous computer vision tasks. The complexity and diversity of human-object interactions in the real world, however, pose significant challenges for both annotation and recognition, particularly in recognizing interactions within an open world context. This study explores the universal interaction recognition in an open-world setting through the use of Vision-Language (VL) foundation models and large language models (LLMs). The proposed method is dubbed as UniHOI. We conduct a deep analysis of the three hierarchical features inherent in visual HOI detectors and propose a method for high-level relation extraction aimed at VL foundation models, which we call HO prompt-based learning. Our design includes an HO Prompt-guided Decoder (HOPD), facilitates the association of high-level relation representations in the foundation model with various HO pairs within the image. Furthermore, we utilize a LLM (i. e. GPT) for interaction interpretation, generating a richer linguistic understanding for complex HOIs. For open-category interaction recognition, our method supports either of two input types: interaction phrase or interpretive sentence. Our efficient architecture design and learning methods effectively unleash the potential of the VL foundation models and LLMs, allowing UniHOI to surpass all existing methods with a substantial margin, under both supervised and zero-shot settings. The code and pre-trained weights will be made publicly available.

PDF Details

EAAI Journal 2023 Journal Article

Keypoint-enhanced adaptive weighting model with effective frequency channel attention for driver action recognition

Mingqi Lu
Xiaobo Lu

Traffic accidents caused by distracted driving have become a threat to people’s lives and properties, so it is necessary to recognize driver actions for early warning effectively. Since it is sometimes impossible to distinguish similar activities only by global driving image features, we explicitly extract the action-related keypoint features and propose a keypoint-enhanced model for classification. Specifically, we construct an effective frequency channel attention module to generate discriminative global representations. Considering the diversity of keypoint information, we design an adaptive weighted residual bottleneck to make the model weights of the input keypoint features dynamic. Furthermore, we propose a keypoint-guided conditional computation module. Under the guidance of keypoint features, the expert weights generated by conditional computation enable the model to adapt to different categories of driving images. Essentially, the model generates keypoint-enhanced attention to scale the classification feature channels. We also propose a model backbone training strategy combining self-supervised and supervised contrastive learning so that the model can achieve better results without large-scale labeled driver behavior data.

Details DOI

EAAI Journal 2022 Journal Article

A pose-aware dynamic weighting model using feature integration for driver action recognition

Mingqi Lu
Yaocong Hu
Xiaobo Lu

Traffic accidents caused by distracted driving are on the rise, posing a serious threat to the safety of people’s lives and property. Recognition and early warning of the driver’s actions is particularly important. Considering the differences in local details of driver actions, we use the keypoint information of drivers that reflects the category differences. Specifically, we explicitly model keypoints features and propose a pose-aware driver action recognition model. We design a pose-based feature fusion module incorporating the attention mechanism, to fuse the global features and keypoint features of different scales in driving images. In addition, we propose an input-dependent weighting module to enhance the discrimination of the fused features. We use dynamic convolution to apply channel attention to convolution weights. According to the input driving image, the corresponding weights are adaptively generated for multiple convolution kernels, and the final weighted summation is carried out. This is a soft gate scheme, which opens a new perspective for driver action recognition. The proposed model achieves 90. 7% and 95. 3% accuracy on StateFarm dataset and SEU-Driving dataset, which are improved by 1. 4% and 3. 1% respectively compared to SOTA (MSA-CNN).

Details DOI

NeurIPS Conference 2022 Conference Paper

Searching for Better Spatio-temporal Alignment in Few-Shot Action Recognition

Yichao Cao
Xiu Su
Qingfei Tang
Shan You
Xiaobo Lu
Chang Xu

Spatio-Temporal feature matching and alignment are essential for few-shot action recognition as they determine the coherence and effectiveness of the temporal patterns. Nevertheless, this process could be not reliable, especially when dealing with complex video scenarios. In this paper, we propose to improve the performance of matching and alignment from the end-to-end design of models. Our solution comes at two-folds. First, we encourage to enhance the extracted Spatio-Temporal representations from few-shot videos in the perspective of architectures. With this aim, we propose a specialized transformer search method for videos, thus the spatial and temporal attention can be well-organized and optimized for stronger feature representations. Second, we also design an efficient non-parametric spatio-temporal prototype alignment strategy to better handle the high variability of motion. In particular, a query-specific class prototype will be generated for each query sample and category, which can better match query sequences against all support sequences. By doing so, our method SST enjoys significant superiority over the benchmark UCF101 and HMDB51 datasets. For example, with no pretraining, our method achieves 17. 1\% Top-1 accuracy improvement than the baseline TRX on UCF101 5-way 1-shot setting but with only 3x fewer FLOPs.

PDF Details