Author name cluster

Chenglong Li

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

26 papers

2 author rows

AAAI Conference 2026 Conference Paper

Apo2Mol: 3D Molecule Generation via Dynamic Pocket-Aware Diffusion Models

Xinzhe Zheng
Shiyu Jiang
Gustavo Seabra
Chenglong Li
Yanjun Li

Deep generative models are rapidly advancing structure-based drug design, offering substantial promise for generating small molecule ligands that bind to specific protein targets. However, most current approaches assume a rigid protein binding pocket, neglecting the intrinsic flexibility of proteins and the conformational rearrangements induced by ligand binding, limiting their applicability in practical drug discovery. Here, we propose Apo2Mol, a diffusion-based generative framework for 3D molecule design that explicitly accounts for conformational flexibility in protein binding pockets. To support this, we curate a dataset of over 24,000 experimentally resolved apo-holo structure pairs from the Protein Data Bank, enabling the characterization of protein structure changes associated with ligand binding. Apo2Mol employs a full-atom hierarchical graph-based diffusion model that simultaneously generates 3D ligand molecules and their corresponding holo pocket conformations from input apo states. Empirical studies demonstrate that Apo2Mol can achieve state-of-the-art performance in generating high-affinity ligands and accurately capture realistic protein pocket conformational changes.

PDF Details DOI

AAAI Conference 2026 Conference Paper

CreBench: Human-Aligned Creativity Evaluation from Idea to Process to Product

Kaiwen Xue
Chenglong Li
Zhonghong Ou
Guoxin Zhang
Kaoyan Lu
Shuai Lyu
Yifan Zhu
Ping Zong

Human-defined creativity is highly abstract, posing a challenge for multimodal large language models (MLLMs) to comprehend and assess creativity that aligns with human judgments. The absence of an existing benchmark further exacerbates this dilemma. To this end, we propose CreBench, which consists of two key components: 1) an evaluation benchmark covering the multiple dimensions from creative idea to process to products; 2) CreMIT (Creativity Multimodal Instruction Tuning dataset), a multimodal creativity evaluation dataset, consisting of 2.2K diverse-sourced multimodal data, 79.2K human feedbacks and 4.7M multityped instructions. Specifically, to ensure MLLMs can handle diverse creativity-related queries, we prompt GPT to refine the human feedback to activate stronger creativity assessment capabilities. CreBench serves as a foundation for building MLLMs that understand human-aligned creativity. Based on the CreBench, we fine-tune open-source general MLLMs, resulting in CreExpert, a multimodal creativity evaluation expert model. Extensive experiments demonstrate that the proposed CreExpert models achieve significantly better alignment with human creativity evaluation compared to state-ofthe-art MLLMs, including the most advanced GPT-4V and Gemini-Pro-Vision.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Dual-Teacher Interactive Knowledge Distillation Network for Text-to-Visible & Infrared Person Retrieval

Chenglong Li
Zhengyu Chen
Yifei Deng
Aihua Zheng

Text-to-visible & infrared person retrieval aims to retrieve the corresponding visible (RGB) and thermal infrared (TIR) images given the text descriptions. Existing methods perform semantic decoupling by aligning RGB and TIR features separately to different attributes, thereby facilitating the alignment between the fused multimodal representation and the text. However, insufficient TIR representation ability and cross-view representation capabilities of RGB and TIR modalities limit the retrieval accuracy and robustness. To address these issues, we propose a novel Dual-teacher Interactive Knowledge Distillation Network called DIKDNet, which performs the interactive knowledge distillation between two modality-specific teachers with rich cross-view representation capabilities to enhance TIR representations and the collaborative knowledge distillation from both teachers to the corresponding students to enhance the cross-modal cross-view representations, for robust text-to-visible & infrared person retrieval. Specifically, to enhance the representation ability of the TIR backbone network while preserving modality-specific characteristics, we design an Interactive Knowledge Distillation Module (IKDM), which introduces a boundary-constrained distillation strategy between RGB and TIR backbones, to transfer the semantic features of RGB backbone to TIR one. To enhance the cross-modal cross-view representation capability, we design a Collaborative Knowledge Distillation Module (CKDM) to transfer the cross-modal similarity relations and the cross-view multimodal representations from teacher networks to student ones. Experimental results demonstrate that our method consistently achieves significant performance gains on both the RGBT-PEDES and RGBNT201-PEDES datasets. The code will be released upon the acceptance.

PDF Details DOI

EAAI Journal 2026 Journal Article

Morphology-aware hierarchical mixture of experts for Chest X-ray anatomy segmentation

Lili Huang
Yuanjun He
Xiaowei Zhao
Chenglong Li
Jin Tang

Medical anatomical segmentation of Chest X-ray (CXR) images is critical for accurately delineating lesion areas, aiding diagnosis, and alleviating physicians’ workload. However, CXR image segmentation presents significant challenges, including blurred foreground and background, complex anatomical structures, and indistinct edges. Therefore, existing methods often struggle to capture fine edge details and fail to account for the diverse morphological characteristics, such as curved ribs, large lungs, and slender trachea, during segmentation. To address these challenges, we propose a novel Morphology-aware Hierarchical Mixture of Experts (MH-MoE) architecture that explicitly incorporates morphological features to improve segmentation performance for CXR images. Specifically, the MH-MoE comprises multiple cascaded blocks, each integrating two key components: the Edge-Enhanced Mixture of Experts (EE-MoE) and the Morphology-Aware Mixture of Experts (MA-MoE). First, EE-MoE employs directional difference convolutions with an adaptive gating mechanism to extract prominent edge features across varying orientations and spatial frequencies, effectively enhancing anatomical boundary perception. Then, our MA-MoE further enables the model to adaptively capture morphological variations. It utilizes gating mechanisms to dynamically select our proposed Dilated-based Strip Convolutions and refines feature representations through an attention mechanism, ensuring a more comprehensive understanding of anatomical structures. Extensive experiments on two CXR datasets, the chest X-ray Segmentation (CXRS) and VinDr-RibCXR datasets validate the effectiveness of our method. Our approach achieves state-of-the-art performance, as confirmed by both evaluation results and visual demonstrations. The code is released at: MHMoEcode.

Details DOI

AAAI Conference 2026 Conference Paper

ProxyTTT: Proxy-driven Test-Time Training for Multi-modal Re-identification

Aihua Zheng
Zhaojun Liu
Xixi Wan
Chenglong Li
Jin Tang
Yan Yan

Multi-modal object re-identification (ReID) aims to retrieve specific targets by leveraging complementary cues from different sensing modalities. Despite recent progress, two key challenges remain: (1) the limited ability to jointly address both modality and viewpoint discrepancies, and (2) the difficulty of effectively leveraging reliable target-domain data to improve generalization. To address these challenges, we propose Proxy-driven Test-Time Training (ProxyTTT), a unified framework that enhances both multi-modal identity representation learning and model generalization. During training, we propose a Multi-Proxy Learning (MPL) mechanism to address the representation bias across different views and modalities. MPL disentangles fine-grained modality-specific and modality-common identity proxies as semantic anchors to align identity features across diverse perspectives and sensing modalities. This alignment strategy enables the model to learn robust and discriminative global identity representations under heterogeneous modality conditions. At test time, to reliably exploit target domain data, we propose Proxy-guided Entropy-based Selective Adaptation (PESA) for test-time training. Specifically, PESA leverages the semantic structure encoded by identity proxies to estimate prediction uncertainty via entropy, and selectively adapts the model using only high-confidence samples. This selective adaptation effectively mitigates the domain shift between training and deployment environments, improving the model’s generalization in real-world scenarios. Extensive experiments on four public multi-modal ReID benchmarks (RGBNT201, RGBNT100, MSVR310, and WMVeID863) demonstrate the effectiveness of ProxyTTT.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Unaligned UAV RGBT Tracking: A Largescale Benchmark and a Novel Approach

Yun Xiao
Yuhang Wang
Jiandong Jin
Wankang Zhang
Chenglong Li

With the rapid development of the low-altitude economy, multimodal visual tracking in UAV scenarios has attracted extensive attention. UAVs are typically equipped with independent visible (RGB) and thermal infrared (TIR) sensors, resulting in an inherent spatial misalignment between the two modalities. However, existing RGBT tracking methods generally rely on spatially aligned data inputs, making them unsuitable for unaligned RGBT tracking task in UAV scenarios. In this work, we introduce the new task called unaligned UAV RGBT tracking and construct the first large-scale unaligned RGB and TIR video dataset to promote the research and development of this field. The dataset contains 1,453 pairs of UAV-captured RGBT sequences with precise dual-modal bounding box annotations, and covers 42 object categories, 22 typical challenge attributes, and diverse spatial misalignment scales to better simulate real-world challenging scenarios. To address the limitations of existing methods that fail to handle the spatial misalignment issue in UAV scenarios, we propose the novel RGBT tracking approach. In particular, we design a mixture of shift estimation experts module to adaptively estimate the spatial shifts across two modalities at different scales, and a cross-modal alignment and fusion module to further compensate for nonlinear deformations and integrate multimodal information. Extensive experiments on the created dataset demonstrate that the proposed tracker significantly outperforms existing state-of-the-art tracking methods, validating its practicality and robustness in real-world unaligned UAV tracking scenarios.

PDF Details DOI

AAAI Conference 2025 Conference Paper

Alignment-Free RGB-T Salient Object Detection: A Large-Scale Dataset and Progressive Correlation Network

Kunpeng Wang
Keke Chen
Chenglong Li
Zhengzheng Tu
Bin Luo

Alignment-free RGB-Thermal (RGB-T) salient object detection (SOD) aims to achieve robust performance in complex scenes by directly leveraging the complementary information from unaligned visible-thermal image pairs, without requiring manual alignment. However, the labor-intensive process of collecting and annotating image pairs limits the scale of existing benchmarks, hindering the advancement of alignment-free RGB-T SOD. In this paper, we construct a large-scale and high-diversity unaligned RGB-T SOD dataset named UVT20K, comprising 20,000 image pairs, 407 scenes, and 1256 object categories. All samples are collected from real-world scenarios with various challenges, such as low illumination, image clutter, complex salient objects, and so on. To support the exploration for further research, each sample in UVT20K is annotated with a comprehensive set of ground truths, including saliency masks, scribbles, boundaries, and challenge attributes. In addition, we propose a Progressive Correlation Network (PCNet), which models inter- and intra-modal correlations on the basis of explicit alignment to achieve accurate predictions in unaligned image pairs. Extensive experiments conducted on two unaligned three weakly aligned three aligned datasets demonstrate the effectiveness of our method.

PDF Details DOI

AAAI Conference 2025 Conference Paper

Cross-modulated Attention Transformer for RGBT Tracking

Yun Xiao
Jiacong Zhao
Andong Lu
Chenglong Li
Bing Yin
Yin Lin
Cong Liu

Existing Transformer-based RGBT trackers achieve remarkable performance benefits by leveraging self-attention to extract uni-modal features and cross-attention to enhance multi-modal feature interaction and search-template correlation. Nevertheless, the independent search-template correlation calculations are prone to be affected by low-quality data, which might result in contradictory and ambiguous correlation weights. It not only limits the intra-modal feature representation, but also harms the robustness of cross-attention for multi-modal feature interaction and search-template correlation computation. To address these issues, we propose a novel approach called Cross-modulated Attention Transformer (CAFormer), which innovatively integrates inter-modality interaction into the search-template correlation computation within typical attention mechanism, for RGBT tracking. In particular, we first independently generate correlation maps for each modality and feed them into the designed correlation modulated enhancement module, which can modify inaccurate correlation weights by seeking the consensus between modalities. Such kind of design unifies self-attention and cross-attention schemes, which not only alleviates inaccurate attention weight computation in self-attention but also eliminates redundant computation introduced by extra cross-attention scheme. In addition, we design a collaborative token elimination strategy to further improve tracking inference efficiency and accuracy. Experiments on five public RGBT tracking benchmarks show the outstanding performance of the proposed CAFormer against state-of-the-art methods.

PDF Details DOI

EAAI Journal 2025 Journal Article

Cross-scale prediction of aluminum dust concentration based on Image Fusion Physics-Informed Neural Networks

Nanxi Ding
Wenzhong Lou
Zihao Zhang
Yizhe Wu
Chenglong Li
Wenlong Ma
Zhengqian Zhang

Research on predicting dust concentration can effectively help reduce the occurrence of dust explosion accidents. However, existing methods struggle to accurately and quickly reconstruct and predict concentration fields across scales in turbulent dust diffusion processes. This paper proposes a neural network framework that combines particle physics information with image inverse mapping to achieve rapid, multi-source, and cross-scale prediction of turbulent dust concentration fields. We conducted a 280 kg aluminum dust dispersion experiment, collecting ultrasound attenuation signals and image data for our dataset. Based on this, by incorporating the Maxwell-Stefan equation, our approach addresses the underfitting issues of existing neural networks in predicting microscale turbulence within concentration fields. Additionally, the inverse mapping of images provides macroscopic diffusion trends for the concentration field. Results demonstrate that our method reconstructs aluminum dust concentration and predicts future 0. 06 s states in 0. 011 s, with a mean squared error of only 0. 0003. Compared to existing Convolutional Neural Networks, Physics-Informed Neural Networks, and Computational Fluid Dynamics methods, our approach shows significant improvement in cross-scale prediction, making accurate concentration prediction possible. This advancement offers quantitative prediction data crucial for preventing dust explosions.

Details DOI

NeurIPS Conference 2025 Conference Paper

DecoyDB: A Dataset for Graph Contrastive Learning in Protein-Ligand Binding Affinity Prediction

Yupu Zhang
Zelin Xu
Tingsong Xiao
Gustavo Seabra
Yanjun Li
Chenglong Li
Zhe Jiang

Predicting the binding affinity of protein-ligand complexes plays a vital role in drug discovery. Unfortunately, progress has been hindered by the lack of large-scale and high-quality binding affinity labels. The widely used PDBbind dataset has fewer than 20K labeled complexes. Self-supervised learning, especially graph contrastive learning (GCL), provides a unique opportunity to break the barrier by pretraining graph neural network models based on vast unlabeled complexes and fine-tuning the models on much fewer labeled complexes. However, the problem faces unique challenges, including a lack of a comprehensive unlabeled dataset with well-defined positive/negative complex pairs and the need to design GCL algorithms that incorporate the unique characteristics of such data. To fill the gap, we propose DecoyDB, a large-scale, structure-aware dataset specifically designed for self-supervised GCL on protein–ligand complexes. DecoyDB consists of high-resolution ground truth complexes and diverse decoy structures with computationally generated binding poses that range from realistic to suboptimal. Each decoy is annotated with a Root Mean Square Deviation (RMSD) from the native pose. We further design a customized GCL framework to pretrain graph neural networks based on DecoyDB and fine-tune the models with labels from PDBbind. Extensive experiments confirm that models pretrained with DecoyDB achieve superior accuracy, sample efficiency, and generalizability.

PDF Details

EAAI Journal 2025 Journal Article

Erasure-based interaction network for red-green-blue and thermal object detection and a unified benchmark

Qishun Wang
Zhengzheng Tu
Chenglong Li
Hongshun Wang
Kunpeng Wang

Recently, many breakthroughs have been made in the field of video object detection, but the performance is still limited due to the imaging limitations of RGB (red-green-blue) sensors in adverse illumination conditions. To alleviate this issue, this work introduces a new computer vision task called RGBT (red-green-blue and thermal) video object detection by introducing the thermal modality that is insensitive to adverse illumination conditions. To promote the research and development of RGBT video object detection, we design a novel Erasure-based Interaction Network (EINet) and establish a comprehensive benchmark dataset for this task. Traditional methods often leverage temporal information by using many auxiliary frames, and thus have a large computational burden. Considering thermal images exhibit less noise than RGB ones, we develop a negative activation function that is used to erase the noise of RGB features with the help of thermal image features. Furthermore, with the benefits from thermal images, we rely only on a small temporal window to model the spatio temporal information to greatly improve efficiency while maintaining detection accuracy. Our dataset consists of 50 pairs of RGBT video sequences with complex backgrounds, various objects and different illuminations, which are collected in real traffic scenarios. Extensive experiments on the proposed dataset demonstrate the effectiveness and efficiency of EINet. Compared with existing detectors, EINet achieves a relatively balanced performance with a detection accuracy of 46. 3% and a speed of 92. 6 frames per second. This project will be released to the public for free academic usage at https: //github. com/tzz-ahu.

Details DOI

EAAI Journal 2025 Journal Article

Keypoint-guided feature enhancement and alignment for cross-resolution vehicle re-identification

Aihua Zheng
Longfei Zhang
Weijun Zhang
Zi Wang
Chenglong Li
Xiaofei Sheng

Resolution mismatch between low-resolution query images and high-resolution gallery images in vehicle re-identification is rarely studied but ubiquitous in real-world applications. An intuitive approach to solving cross-resolution vehicle re-identification is to utilize super-resolution algorithms to recover detailed information from low-resolution query images. However, vehicle super-resolution algorithms not only recover the detailed information of the vehicle but also enhance the background noise, which would degrade the re-identification performance. In addition, the view mismatch problem also significantly limits the performance of vehicle re-identification. To handle these problems, we propose a novel Keypoint Guiding Network, which simultaneously addresses the problems of resolution mismatch and view mismatch from the perspective of keypoints in an end-to-end learning framework, for cross-resolution vehicle re-identification. In particular, we first generate a set of vehicle keypoints via an effective Gaussian localization method, and then adaptively construct two keypoint-based guidances using attention models. We integrate these two guidances into vehicle super-resolution and view alignment to handle the problems of resolution mismatch and view mismatch respectively. Moreover, to alleviate the heterogeneity between super-resolution query images and high-resolution gallery ones, we design a dual-path teacher–student distillation scheme to narrow their feature distributions. Comprehensive experiments on two down-sampled benchmark datasets demonstrate the effectiveness of our Keypoint Guiding Network against the state-of-the-art methods.

Details DOI

AAAI Conference 2025 Conference Paper

Pedestrian Attribute Recognition: A New Benchmark Dataset and a Large Language Model Augmented Framework

Jiandong Jin
Xiao Wang
Qian Zhu
Haiyang Wang
Chenglong Li

Pedestrian Attribute Recognition (PAR) is one of the indispensable tasks in human-centered research. However, existing datasets neglect different domains (e.g., environments, times, populations, and data sources), only conducting simple random splits, and the performance of these datasets has already approached saturation. In the past five years, no large-scale dataset has been opened to the public. To address this issue, this paper proposes a new large-scale, cross-domain pedestrian attribute recognition dataset to fill the data gap, termed MSP60K. It consists of 60,122 images and 57 attribute annotations across eight scenarios. Synthetic degradation is also conducted to further narrow the gap between the dataset and real-world challenging scenarios. To establish a more rigorous benchmark, we evaluate 17 representative PAR models under both random and cross-domain split protocols on our dataset. Additionally, we propose an innovative Large Language Model (LLM) augmented PAR framework, named LLM-PAR. This framework processes pedestrian images through a Vision Transformer (ViT) backbone to extract features and introduces a multi-embedding query Transformer to learn partial-aware features for attribute classification. Significantly, we enhance this framework with LLM for ensemble learning and visual feature augmentation. Comprehensive experiments across multiple PAR benchmark datasets have thoroughly validated the efficacy of our proposed framework.

PDF Details DOI

AAAI Conference 2025 Conference Paper

RGBT Tracking via All-layer Multimodal Interactions with Progressive Fusion Mamba

Andong Lu
Wanyu Wang
Chenglong Li
Jin Tang
Bin Luo

Existing RGBT tracking methods often design various interaction models to perform cross-modal fusion of each layer, but can not execute the feature interactions among all layers, which plays a critical role in robust multimodal representation, due to large computational burden. To address this issue, this paper presents a novel All-layer multimodal Interaction Network, named AINet, which performs efficient and effective feature interactions of all modalities and layers in a progressive fusion Mamba, for robust RGBT tracking. Even though modality features in different layers are known to contain different cues, it is always challenging to build multimodal interactions in each layer due to struggling in balancing interaction capabilities and efficiency. Meanwhile, considering that the feature discrepancy between RGB and thermal modalities reflects their complementary information to some extent, we design a Difference-based Fusion Mamba (DFM) to achieve enhanced fusion of different modalities with linear complexity. When interacting with features from all layers, a huge number of token sequences (3840 tokens in this work) are involved and the computational burden is thus large. To handle this problem, we design an Order-dynamic Fusion Mamba (OFM) to execute efficient and effective feature interactions of all layers by dynamically adjusting the scan order of different layers in Mamba. Extensive experiments on four public RGBT tracking datasets show that AINet achieves leading performance against existing state-of-the-art methods. We will release the code upon acceptance of the paper.

PDF Details DOI

ICRA Conference 2025 Conference Paper

Robo-GS: A Physics Consistent Spatial-Temporal Model for Robotic Arm with Hybrid Representation

Haozhe Lou
Yurong Liu
Yike Pan
Yiran Geng
Jianteng Chen
Wenlong Ma 0006
Chenglong Li
Lin Wang

The Real2Sim2Real (R2S2R) paradigm is critical for advancing robotic learning. Existing methods lack a comprehensive solution to accurately reconstruct real-world objects with both spatial representations and their associated physics attributes in the Real2Sim stage. We propose a Real2Sim pipeline to generate digital assets enabling high-fidelity simulation. We design a hybrid repre-sentation model that integrates mesh geometry, 3D Gaussian kernels, and physics attributes to enhance the representation of robotic arms in digital assets. This hybrid representation is implemented through a Gaussian-Mesh-Pixel binding technique, which establishes an isomorphic mapping between mesh vertices and the Gaussian model. This enables a fully differentiable rendering pipeline that can be optimized through numerical solvers, achieves high-fidelity rendering via Gaussian Splatting, and facilitates physically plausible simulation of the robotic arm's interaction with its environment through mesh geometry. With the digital assets, we propose a fully manipulable Real2Sim pipeline that standardizes coordinate systems and scales, ensuring the seamless integration of multiple components. To demonstrate its effectiveness, we include datasets covering various robotic manipulation tasks with their mesh reconstructions. Our model achieves state-of-the-art results in realistic rendering and mesh reconstruction quality for robotic applications. Our code and datasets will be made publicly available at robostudioapp. com.

Details

IJCAI Conference 2025 Conference Paper

Template-based Uncertainty Multimodal Fusion Network for RGBT Tracking

Zhaodong Ding
Chenglong Li
Shengqing Miao
Jin Tang

RGBT tracking is to localize the predefined targets in video sequences by effectively leveraging the information from both visible light (RGB) and thermal infrared (TIR) modalities. However, the quality of different modalities changes dynamically in complex scenes, and effectively perceiving modal quality for multimodal fusion remains a significant challenge. To address this challenge, we propose to employ the reliability of initial template to explore the uncertainty across different modalities, and design a novel template-based uncertainty computation framework for robust multimodal fusion in RGBT tracking. In particular, we introduce an Uncertainty-aware Multimodal Fusion Module (UMFM), which constructs the uncertainty of each modality by leveraging the correlation between the template and search region in the Subjective Logic framework, aiming to achieve robust multimodal fusion. In addition, existing methods focus on dynamic template update while overlooking the potential role of a reliable initial template in the template updating process. To this end, we design a simple yet effective Contrastive Template Update Module (CTUM) to assess the reliability of the new template by comparing its quality with that of the initial template. Extensive experiments suggest that our method outperforms existing approaches on four RGBT tracking benchmarks.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

UGG-ReID: Uncertainty-Guided Graph Model for Multi-Modal Object Re-Identification

Xixi Wan
Aihua Zheng
Bo Jiang
Beibei Wang
Chenglong Li
Jin Tang

Multi-modal object Re-IDentification (ReID) has gained considerable attention with the goal of retrieving specific targets across cameras using heterogeneous visual data sources. At present, multi-modal object ReID faces two core challenges: (1) learning robust features under fine-grained local noise caused by occlusion, frame loss, and other disruptions; and (2) effectively integrating heterogeneous modalities to enhance multi-modal representation. To address the above challenges, we propose a robust approach named Uncertainty-Guided Graph model for multi-modal object ReID (UGG-ReID). UGG-ReID is designed to mitigate noise interference and facilitate effective multi-modal fusion by estimating both local and sample-level aleatoric uncertainty and explicitly modeling their dependencies. Specifically, we first propose the Gaussian patch-graph representation model that leverages uncertainty to quantify fine-grained local cues and capture their structural relationships. This process boosts the expressiveness of modal-specific information, ensuring that the generated embeddings are both more informative and robust. Subsequently, we design an uncertainty-guided mixture of experts strategy that dynamically routes samples to experts exhibiting low uncertainty. This strategy effectively suppresses noise-induced instability, leading to enhanced robustness. Meanwhile, we design an uncertainty-guided routing to strengthen the multi-modal interaction, improving the performance. UGG-ReID is comprehensively evaluated on five representative multi-modal object ReID datasets, encompassing diverse spectral modalities. Experimental results show that the proposed method achieves excellent performance on all datasets and is significantly better than current methods in terms of noise immunity. Our code is available at https: //github. com/wanxixi11/UGG-ReID.

PDF Details

EAAI Journal 2024 Journal Article

Lane detection via disentangled representation network with slope consistency loss

Zhaodong Ding
Yifei Deng
Chenglong Li
Rui Ruan
Jin Tang

Existing works in lane detection focus on learning the general robust representation across different scenarios to overcome the impact of the lack of visual cues. However, factors leading to the absence of visual cues vary across different scenarios and the training data from challenging conditions is relatively small compared to common conditions. These problems result in the inability of existing methods to maintain robust lane detection in different scenarios for practical applications. To address these problems, this work presents a novel Disentangled Representation Network called DRNet, which disentangles the lane feature representations using a disentangled representation network to efficiently learn the lane representations corresponding to the specific condition. Meanwhile, DRNet also mitigates the adverse effects of data imbalance. Specifically, we disentangle lane representation via five branches, respectively to the common scenes, crowded objects, low light, dazzle light and other conditions. Due to the separated model of different conditions, each branch can be represented using a small number of parameters, which can be sufficiently learned using corresponding training subset. Moreover, existing works perform lane classification or regression using pixel-level losses, which neglect the important shape information. To this end, we design a novel slope consistency loss to take both global and local slope consistencies between prediction and ground truth into account for lane detection, which can adaptively adjust the lane shape and location. Extensive experiments on the CULane and TuSimple datasets show that our DRNet outperforms state-of-the-art methods, as it can reach 81. 07% F1 on CULane and 97. 97% on TuSimple.

Details DOI

AAAI Conference 2024 Conference Paper

Structural Information Guided Multimodal Pre-training for Vehicle-Centric Perception

Xiao Wang
Wentao Wu
Chenglong Li
Zhicheng Zhao
Zhe Chen
Yukai Shi
Jin Tang

Understanding vehicles in images is important for various applications such as intelligent transportation and self-driving system. Existing vehicle-centric works typically pre-train models on large-scale classification datasets and then fine-tune them for specific downstream tasks. However, they neglect the specific characteristics of vehicle perception in different tasks and might thus lead to sub-optimal performance. To address this issue, we propose a novel vehicle-centric pre-training framework called VehicleMAE, which incorporates the structural information including the spatial structure from vehicle profile information and the semantic structure from informative high-level natural language descriptions for effective masked vehicle appearance reconstruction. To be specific, we explicitly extract the sketch lines of vehicles as a form of the spatial structure to guide vehicle reconstruction. The more comprehensive knowledge distilled from the CLIP big model based on the similarity between the paired/unpaired vehicle image-text sample is further taken into consideration to help achieve a better understanding of vehicles. A large-scale dataset is built to pre-train our model, termed Autobot1M, which contains about 1M vehicle images and 12693 text information. Extensive experiments on four vehicle-based downstream tasks fully validated the effectiveness of our VehicleMAE. The source code and pre-trained models will be released at https://github.com/Event-AHU/VehicleMAE.

PDF Details DOI

EAAI Journal 2023 Journal Article

Multimodal salient object detection via adversarial learning with collaborative generator

Zhengzheng Tu
Wenfang Yang
Kunpeng Wang
Amir Hussain
Bin Luo
Chenglong Li

Multimodal salient object detection(MSOD), which utilizes multimodal information (e. g. , RGB image and thermal infrared or depth image) to detect common salient objects, has received much attention recently. Different modalities reflect different appearance properties of salient objects, some of which could contribute to improving the precision and/or recall of MSOD. To greatly improve both Precision and Recall by fully exploring multimodal data, in this work, we propose an effective adversarial learning framework based on a novel collaborative generator for accurate multimodal salient object detection. In particular, the collaborative generator consists of three generators (generator1, generator2 and generator3), which aim at decreasing the false positive and false negative of the generated saliency maps and improving F-measure of the final saliency maps respectively. Generator1 and generator2 contain two encoder–decoder networks for multimodal inputs, and we propose a new co-attention model to perform adaptive interactions between different modalities. Furthermore, we apply generator3 to integrate feature maps from generator1 and generator2 in a complementary way. Through adversarially learning the collaborative generator and discriminator, both Precision and Recall of the predicted maps are boosted with the complementary benefits of multimodal data. Extensive experiments on three RGBT datasets and six RGBD datasets show that our method performs quite well against state-of-the-art MSOD methods.

Details DOI

AAAI Conference 2022 Conference Paper

Attribute-Based Progressive Fusion Network for RGBT Tracking

Yun Xiao
MengMeng Yang
Chenglong Li
Lei Liu
Jin Tang

RGBT tracking usually suffers from various challenging factors of fast motion, scale variation, illumination variation, thermal crossover and occlusion, to name a few. Existing works often study fusion models to solve all challenges simultaneously, which requires fusion models complex enough and training data large enough, and are usually difficult to be constructed in real-world scenarios. In this work, we disentangle the fusion process via the challenge attributes, and thus propose a novel Attribute-Based Progressive Fusion Network (APFNet) to increase the fusion capacity with a small number of parameters while reducing the dependence on large-scale training data. In particular, we design five attribute-specific fusion branches to integrate RGB and thermal features under the challenges of thermal crossover, illumination variation, scale variation, occlusion and fast motion respectively. By disentangling the fusion process, we can use a small number of parameters for each branch to achieve robust fusion of different modalities and train each branch using the small training subset with the corresponding attribute annotation. Then, to adaptive fuse features of all branches, we design an aggregation fusion module based on SKNet. Finally, we also design an enhancement fusion transformer to strengthen the aggregated feature and modality-specific features. Experimental results on benchmark datasets demonstrate the effectiveness of our APFNet against other state-of-the-art methods.

PDF Details

AAAI Conference 2022 Conference Paper

Cross-Modal Object Tracking: Modality-Aware Representations and a Unified Benchmark

Chenglong Li
Tianhao Zhu
Lei Liu
Xiaonan Si
Zilin Fan
Sulan Zhai

In many visual systems, visual tracking often bases on RGB image sequences, in which some targets are invalid in lowlight conditions, and tracking performance is thus affected significantly. Introducing other modalities such as depth and infrared data is an effective way to handle imaging limitations of individual sources, but multi-modal imaging platforms usually require elaborate designs and cannot be applied in many real-world applications at present. Near-infrared (NIR) imaging becomes an essential part of many surveillance cameras, whose imaging is switchable between RGB and NIR based on the light intensity. These two modalities are heterogeneous with very different visual properties and thus bring big challenges for visual tracking. However, existing works have not studied this challenging problem. In this work, we address the cross-modal object tracking problem and contribute a new video dataset, including 644 cross-modal image sequences with over 478K frames in total, and the average video length is more than 742 frames. To promote the research and development of cross-modal object tracking, we propose a new algorithm, which learns the modality-aware target representation to mitigate the appearance gap between RGB and NIR modalities in the tracking process. It is plugand-play and could thus be flexibly embedded into different tracking frameworks. Extensive experiments on the dataset are conducted, and we demonstrate the effectiveness of the proposed algorithm in two representative tracking frameworks against 19 state-of-the-art tracking methods.

PDF Details

AAAI Conference 2022 Conference Paper

Interact, Embed, and EnlargE: Boosting Modality-Specific Representations for Multi-Modal Person Re-identification

Zi Wang
Chenglong Li
Aihua Zheng
Ran He
Jin Tang

Multi-modal person Re-ID introduces more complementary information to assist the traditional Re-ID task. Existing multi-modal methods ignore the importance of modalityspecific information in the feature fusion stage. To this end, we propose a novel method to boost modality-specific representations for multi-modal person Re-ID: Interact, Embed, and EnlargE (IEEE). First, we propose a cross-modal interacting module to exchange useful information between different modalities in the feature extraction phase. Second, we propose a relation-based embedding module to enhance the richness of feature descriptors by embedding the global feature into the fine-grained local information. Finally, we propose multi-modal margin loss to force the network to learn modality-specific information for each modality by enlarging the intra-class discrepancy. Superior performance on multimodal Re-ID dataset RGBNT201 and three constructed Re- ID datasets validate the effectiveness of the proposed method compared with the state-of-the-art approaches.

PDF Details

AAAI Conference 2021 Conference Paper

Robust Multi-Modality Person Re-identification

Aihua Zheng
Zi Wang
Zihan Chen
Chenglong Li
Jin Tang

To avoid the illumination limitation in visible person reidentification (Re-ID) and the heterogeneous issue in crossmodality Re-ID, we propose to utilize complementary advantages of multiple modalities including visible (RGB), near infrared (NI) and thermal infrared (TI) ones for robust person Re-ID. A novel progressive fusion network is designed to learn effective multi-modal features from single to multiple modalities and from local to global views. Our method works well in diversely challenging scenarios even in the presence of missing modalities. Moreover, we contribute a comprehensive benchmark dataset, RGBNT201, including 201 identities captured from various challenging conditions, to facilitate the research of RGB-NI-TI multi-modality person Re-ID. Comprehensive experiments on RGBNT201 dataset comparing to the state-of-the-art methods demonstrate the contribution of multi-modality person Re-ID and the effectiveness of the proposed approach, which launch a new benchmark and a new baseline for multi-modality person Re-ID.

PDF Details

AAAI Conference 2020 Conference Paper

Multi-Spectral Vehicle Re-Identification: A Challenge

Hongchao Li
Chenglong Li
Xianpeng Zhu
Aihua Zheng
Bin Luo

Vehicle re-identiﬁcation (Re-ID) is a crucial task in smart city and intelligent transportation, aiming to match vehicle images across non-overlapping surveillance camera views. Currently, most works focus on RGB-based vehicle Re-ID, which limits its capability of real-life applications in adverse environments such as dark environments and bad weathers. IR (Infrared) spectrum imaging offers complementary information to relieve the illumination issue in computer vision tasks. Furthermore, vehicle Re-ID suffers a big challenge of the diverse appearance with different views, such as trucks. In this work, we address the RGB and IR vehicle Re-ID problem and contribute a multi-spectral vehicle Re-ID benchmark named RGBN300, including RGB and NIR (Near Infrared) vehicle images of 300 identities from 8 camera views, giving in total 50125 RGB images and 50125 NIR images respectively. In addition, we have acquired additional TIR (Thermal Infrared) data for 100 vehicles from RGBN300 to form another dataset for three-spectral vehicle Re-ID. Furthermore, we propose a Heterogeneity-collaboration Aware Multi-stream convolutional Network (HAMNet) towards automatically fusing different spectrum features in an endto-end learning framework. Comprehensive experiments on prevalent networks show that our HAMNet can effectively integrate multi-spectral data for robust vehicle Re-ID in day and night. Our work provides a benchmark dataset for RGB- NIR and RGB-NIR-TIR multi-spectral vehicle Re-ID and a baseline network for both research and industrial communities. The dataset and baseline codes are available at: https: //github. com/ttaalle/multi-modal-vehicle-Re-ID.

PDF Details

AAAI Conference 2017 Conference Paper

Learning Patch-Based Dynamic Graph for Visual Tracking

Chenglong Li
Liang Lin
Wangmeng Zuo
Jin Tang

Existing visual tracking methods usually localize the object with a bounding box, in which the foreground object trackers/detectors are often disturbed by the introduced background information. To handle this problem, we aim to learn a more robust object representation for visual tracking. In particular, the tracked object is represented with a graph structure (i. e. , a set of non-overlapping image patches), in which the weight of each node (patch) indicates how likely it belongs to the foreground and edges are also weighed for indicating the appearance compatibility of two neighboring nodes. This graph is dynamically learnt (i. e. , the nodes and edges received weights) and applied in object tracking and model updating. We constrain the graph learning from two aspects: i) the global low-rank structure over all nodes and ii) the local sparseness of node neighbors. During the tracking process, our method performs the following steps at each frame. First, the graph is initialized by assigning either 1 or 0 to the weights of some image patches according to the predicted bounding box. Second, the graph is optimized through designing a new ALM (Augmented Lagrange Multiplier) based algorithm. Third, the object feature representation is updated by imposing the weights of patches on the extracted image features. The object location is ﬁnally predicted by adopting the Struck tracker (Hare, Saffari, and Torr 2011). Extensive experiments show that our approach outperforms the state-of-the-art tracking methods on two standard benchmarks, i. e. , OTB100 and NUS-PRO.

PDF Details