Author name cluster

Shuang Li

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

45 papers

2 author rows

EAAI Journal 2026 Journal Article

A spatio-temporal graph attention prediction method for coal spontaneous combustion temperature integrating time-frequency domain lag features

Shuang Li
Ningke Xu

Details DOI

AAAI Conference 2026 Conference Paper

Dynamic-Static Collaboration for Unsupervised Domain Adaptive Video-Based Visible-Infrared Person Re-Identification

Jiaxu Leng
Zhengjie Wang
Shuang Li
Xinbo Gao

Video-based visible-infrared person re-identification (VVI-ReID) aims to match pedestrian sequences across modalities for all-day surveillance. While supervised methods have shown progress, their dependence on large-scale cross-modal annotations limits scalability. We investigate the task of unsupervised domain adaptation for VVI-ReID (UDA-VVI-ReID), where a model trained on a labeled source domain is adapted to an unlabeled target domain. Directly extending existing image-based unsupervised VI-ReID methods to video scenarios by simply averaging frame-level features is suboptimal, as this naive strategy neglects the rich temporal dynamics in video data and leads to unreliable pseudo-labels due to occlusion-induced noise. To overcome these limitations, we propose a Dynamic-Static Collaboration (DSC) framework that explicitly leverages the complementary strengths of motion and appearance cues. The Dynamic-Static Label Unification (DSLU) module refines pseudo-labels by validating the consistency between static and dynamic predictions. Based on these labels, the Dynamic-Static Joint Learning (DSJL) module performs neighbor-aware contrastive learning in both feature spaces, promoting robust representation learning under cross-modal and temporal variations. Experiments on HITSZ-VCM and BUPTCampus show that DSC sets a strong baseline for this new task, enabling robust cross-modal video ReID without target labels.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Hierarchical Prompt Learning for Image- and Text-Based Person Re-Identification

Linhan Zhou
Shuang Li
Neng Dong
Yonghang Tai
Yafei Zhang
Huafeng Li

Person re-identification (ReID) aims to retrieve target pedestrian images given either visual queries (image-to-image, I2I) or textual descriptions (text-to-image, T2I). Although both tasks share a common retrieval objective, they pose distinct challenges: I2I emphasizes discriminative identity learning, while T2I requires accurate cross-modal semantic alignment. Existing methods often treat these tasks separately, which may lead to representation entanglement and suboptimal performance. To address this, we propose a unified framework named Hierarchical Prompt Learning (HPL), which leverages task-aware prompt modeling to jointly optimize both tasks. Specifically, we first introduce a Task-Routed Transformer, which incorporates dual classification tokens into a shared visual encoder to route features for I2I and T2I branches respectively. On top of this, we develop a hierarchical prompt generation scheme that integrates identity-level learnable tokens with instance-level pseudo-text tokens. These pseudo-tokens are derived from image or text features via modality-specific inversion networks, injecting fine-grained, instance-specific semantics into the prompts. Furthermore, we propose a Cross-Modal Prompt Regularization strategy to enforce semantic alignment in the prompt token space, ensuring that pseudo-prompts preserve source-modality characteristics while enhancing cross-modal transferability. Extensive experiments on multiple ReID benchmarks validate the effectiveness of our method, achieving state-of-the-art performance on both I2I and T2I tasks.

PDF Details DOI

TMLR Journal 2026 Journal Article

MIRA: Multi-view Information Retrieval with Adaptive Routing for Test-time Long-video Comprehension

Zecheng Hao
Wayne Ma
Yufeng Cui
Shuang Li
Xinlong Wang
Tiejun Huang

Foundational Multi-modal Large Language Models (MLLMs) have achieved rapid progress in handling complex tasks across diverse modalities. However, they still struggle to deliver satisfactory performance on Long-video Comprehension (LVC) tasks involving thousands of frames. Existing optimization strategies can be broadly categorized into LVC-specific fine-tuning, built-in token compression and training-free keyframe extraction, with the latter being most suitable for flexible deployment across various MLLMs. Unfortunately, current training-free approaches predominantly focus on query-frame relevance retrieval, overlooking other levels of visual information and the inherent heterogeneity of LVC tasks. In this work, we propose the $\textbf{M}$ulti-view $\textbf{I}$nformation $\textbf{R}$etrieval with $\textbf{A}$daptive Routing ($\textbf{MIRA}$) framework, which evaluates video frames using distinct metrics for relevance and causality, combines these scores to select a balanced pool of keyframes, and employs an adaptive feedback loop to tailor the retrieval process to different user queries, enabling more precise and sample-grained video comprehension. Extensive experiments demonstrate the advanced performance of our scheme across multiple challenging LVC benchmarks. For instance, integrating $\textbf{MIRA}$ with Qwen-2.5-VL yields performance gains of 3.5% to 13.1% on LVB, VideoMME and MLVU.