Author name cluster

Heng Fan

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

9 papers

1 author row

NeurIPS Conference 2025 Conference Paper

All You Need is One: Capsule Prompt Tuning with a Single Vector

Yiyang Liu
James Liang
Heng Fan
Wenhao Yang
Yiming Cui
Xiaotian Han
Lifu Huangg
Dongfang Liu

Prompt-based learning has emerged as a parameter-efficient finetuning (PEFT) approach to facilitate Large Language Model (LLM) adaptation to downstream tasks by conditioning generation with task-aware guidance. Despite its successes, current prompt-based learning methods heavily rely on laborious grid searching for optimal prompt length and typically require considerable number of prompts, introducing additional computational burden. Worse yet, our pioneer findings indicate that the task-aware prompt design is inherently limited by its absence of instance-aware information, leading to a subtle attention interplay with the input sequence. In contrast, simply incorporating instance-aware information as a part of the guidance can enhance the prompt-tuned model performance without additional fine-tuning. Moreover, we find an interesting phenomenon, namely "attention anchor", that incorporating instance-aware tokens at the earliest position of the sequence can successfully preserve strong attention to critical structural information and exhibit more active attention interaction with all input tokens. In light of our observation, we introduce Capsule Prompt-Tuning (CaPT), an efficient and effective solution that leverages off-the-shelf, informative instance semantics into prompt-based learning. Our approach innovatively integrates both instance-aware and task-aware information in a nearly parameter-free manner (i. e. , one single capsule prompt). Empirical results demonstrate that our method can exhibit superior performance across various language tasks (e. g. , 84. 03\% average accuracy on T5-Large), serving as an "attention anchor, " while enjoying high parameter efficiency (e. g. , 0. 003\% of model parameters on Llama3. 2-1B).

PDF Details

NeurIPS Conference 2025 Conference Paper

LoRATv2: Enabling Low-Cost Temporal Modeling in One-Stream Trackers

Liting Lin
Heng Fan
Zhipeng Zhang
Yuqing Huang
Yaowei Wang
Yong Xu
Haibin Ling

Transformer-based algorithms, such as LoRAT, have significantly enhanced object-tracking performance. However, these approaches rely on a standard attention mechanism, which incurs quadratic token complexity, making real-time inference computationally expensive. In this paper, we introduce LoRATv2, a novel tracking framework that addresses these limitations with three main contributions. First, LoRATv2 integrates frame-wise causal attention, which ensures full self-attention within each frame while enabling causal dependencies across frames, significantly reducing computational overhead. Moreover, key-value (KV) caching is employed to efficiently reuse past embeddings for further speedup. Second, building on LoRAT's parameter-efficient fine-tuning, we propose Stream-Specific LoRA Adapters (SSLA). As frame-wise causal attention introduces asymmetry in how streams access temporal information, SSLA assigns dedicated LoRA modules to the template and each search stream, with the main ViT backbone remaining frozen. This allows specialized adaptation for each stream's role in temporal tracking. Third, we introduce a two-phase progressive training strategy, which first trains a single-search-frame tracker and then gradually extends it to multi-search-frame inputs by introducing additional LoRA modules. This curriculum-based learning paradigm improves long-term tracking while maintaining training efficiency. In extensive experiments on multiple benchmarks, LoRATv2 achieves state-of-the-art performance, substantially improved efficiency, and a superior performance-to-FLOPs ratio over state-of-the-art trackers. The code is available at https: //github. com/LitingLin/LoRATv2.

PDF Details

NeurIPS Conference 2025 Conference Paper

Robust Ego-Exo Correspondence with Long-Term Memory

Yijun Hu
Bing Fan
Xin Gu
海青任
Dongfang Liu
Heng Fan
Libo Zhang

Establishing object-level correspondence between egocentric and exocentric views is essential for intelligent assistants to deliver precise and intuitive visual guidance. However, this task faces numerous challenges, including extreme viewpoint variations, occlusions, and the presence of small objects. Existing approaches usually borrow solutions from video object segmentation models, but still suffer from the aforementioned challenges. Recently, the Segment Anything Model 2 (SAM 2) has shown strong generalization capabilities and excellent performance in video object segmentation. Yet, when simply applied to the ego-exo correspondence (EEC) task, SAM 2 encounters severe difficulties due to ineffective ego-exo feature fusion and limited long-term memory capacity, especially for long videos. Addressing these problems, we propose a novel EEC framework based on SAM 2 with long-term memories by presenting a dual-memory architecture and an adaptive feature routing module inspired by Mixture-of-Experts (MoE). Compared to SAM 2, our approach features (i) a Memory-View MoE module which consists of a dual-branch routing mechanism to adaptively assign contribution weights to each expert feature along both channel and spatial dimensions, and (ii) a dual-memory bank system with a simple yet effective compression strategy to retain critical long-term information while eliminating redundancy. In the extensive experiments on the challenging EgoExo4D benchmark, our method, dubbed LM-EEC, achieves new state-of-the-art results and significantly outperforms existing methods and the SAM 2 baseline, showcasing its strong generalization across diverse scenarios. Our code and model are available at https: //github. com/juneyeeHu/LM-EEC.

PDF Details

NeurIPS Conference 2024 Conference Paper

In this paper, we propose a novel benchmark, named VastTrack, aiming to facilitate the development of general visual tracking via encompassing abundant classes and videos. VastTrack consists of a few attractive properties: (1) Vast Object Category. In particular, it covers targets from 2, 115 categories, significantly surpassing object classes of existing popular benchmarks (e. g. , GOT-10k with 563 classes and LaSOT with 70 categories). Through providing such vast object classes, we expect to learn more general object tracking. (2) Larger scale. Compared with current benchmarks, VastTrack provides 50, 610 videos with 4. 2 million frames, which makes it to date the largest dataset in term of the number of videos, and hence could benefit training even more powerful visual trackers in the deep learning era. (3) Rich Annotation. Besides conventional bounding box annotations, VastTrack also provides linguistic descriptions with more than 50K sentences for the videos. Such rich annotations of VastTrack enable the development of both vision-only and vision-language tracking. In order to ensure precise annotation, each frame in the videos is manually labeled with multi-stage of careful inspections and refinements. To understand performance of existing trackers and to provide baselines for future comparison, we extensively evaluate 25 representative trackers. The results, not surprisingly, display significant drops compared to those on current datasets due to lack of abundant categories and videos from diverse scenarios for training, and more efforts are urgently required to improve general visual tracking. Our VastTrack, the toolkit, and evaluation results are publicly available at https: //github. com/HengLan/VastTrack.

PDF Details DOI

NeurIPS Conference 2022 Conference Paper

Divert More Attention to Vision-Language Tracking

Mingzhe Guo
Zhipeng Zhang
Heng Fan
Liping Jing

Relying on Transformer for complex visual feature learning, object tracking has witnessed the new standard for state-of-the-arts (SOTAs). However, this advancement accompanies by larger training data and longer training period, making tracking increasingly expensive. In this paper, we demonstrate that the Transformer-reliance is not necessary and the pure ConvNets are still competitive and even better yet more economical and friendly in achieving SOTA tracking. Our solution is to unleash the power of multimodal vision-language (VL) tracking, simply using ConvNets. The essence lies in learning novel unified-adaptive VL representations with our modality mixer (ModaMixer) and asymmetrical ConvNet search. We show that our unified-adaptive VL representation, learned purely with the ConvNets, is a simple yet strong alternative to Transformer visual features, by unbelievably improving a CNN-based Siamese tracker by 14. 5% in SUC on challenging LaSOT (50. 7%$\rightarrow$65. 2%), even outperforming several Transformer-based SOTA trackers. Besides empirical results, we theoretically analyze our approach to evidence its effectiveness. By revealing the potential of VL representation, we expect the community to divert more attention to VL tracking and hope to open more possibilities for future tracking beyond Transformer. Code and models are released at https: //github. com/JudasDie/SOTS.

PDF Details

IJCAI Conference 2022 Conference Paper

Learning Target-aware Representation for Visual Tracking via Informative Interactions

Mingzhe Guo
Zhipeng Zhang
Heng Fan
Liping Jing
Yilin Lyu
Bing Li
Weiming Hu

We introduce a novel backbone architecture to improve target-perception ability of feature representation for tracking. Having observed de facto frameworks perform feature matching simply using the backbone outputs for target localization, there is no direct feedback from the matching module to the backbone network, especially the shallow layers. Concretely, only the matching module can directly access the target information, while the representation learning of candidate frame is blind to the reference target. Therefore, the accumulated target-irrelevant interference in shallow stages may degrade the feature quality of deeper layers. In this paper, we approach the problem by conducting multiple branch-wise interactions inside the Siamese-like backbone networks (InBN). The core of InBN is a general interaction modeler (GIM) that injects the target information to different stages of the backbone network, leading to better target-perception of candidate feature representation with negligible computation cost. The proposed GIM module and InBN mechanism are general and applicable to different backbone types including CNN and Transformer for improvements, as evidenced on multiple benchmarks. In particular, the CNN version improves the baseline with 3. 2/6. 9 absolute gains of SUC on LaSOT/TNL2K. The Transformer version obtains SUC of 65. 7/52. 0 on LaSOT/TNL2K, which are on par with recent SOTAs.

PDF Details DOI

NeurIPS Conference 2022 Conference Paper

SwinTrack: A Simple and Strong Baseline for Transformer Tracking

Liting Lin
Heng Fan
Zhipeng Zhang
Yong Xu
Haibin Ling

Recently Transformer has been largely explored in tracking and shown state-of-the-art (SOTA) performance. However, existing efforts mainly focus on fusing and enhancing features generated by convolutional neural networks (CNNs). The potential of Transformer in representation learning remains under-explored. In this paper, we aim to further unleash the power of Transformer by proposing a simple yet efficient fully-attentional tracker, dubbed SwinTrack, within classic Siamese framework. In particular, both representation learning and feature fusion in SwinTrack leverage the Transformer architecture, enabling better feature interactions for tracking than pure CNN or hybrid CNN-Transformer frameworks. Besides, to further enhance robustness, we present a novel motion token that embeds historical target trajectory to improve tracking by providing temporal context. Our motion token is lightweight with negligible computation but brings clear gains. In our thorough experiments, SwinTrack exceeds existing approaches on multiple benchmarks. Particularly, on the challenging LaSOT, SwinTrack sets a new record with 0. 713 SUC score. It also achieves SOTA results on other benchmarks. We expect SwinTrack to serve as a solid baseline for Transformer tracking and facilitate future research. Our codes and results are released at https: //github. com/LitingLin/SwinTrack.

PDF Details

AAAI Conference 2018 Conference Paper

Graph Correspondence Transfer for Person Re-Identification

Qin Zhou
Heng Fan
Shibao Zheng
Hang Su
Xinzhe Li
Shuang Wu
Haibin Ling

In this paper, we propose a graph correspondence transfer (GCT) approach for person re-identiﬁcation. Unlike existing methods, the GCT model formulates person re-identiﬁcation as an off-line graph matching and on-line correspondence transferring problem. In speciﬁc, during training, the GCT model aims to learn off-line a set of correspondence templates from positive training pairs with various pose-pair con- ﬁgurations via patch-wise graph matching. During testing, for each pair of test samples, we select a few training pairs with the most similar pose-pair conﬁgurations as references, and transfer the correspondences of these references to test pair for feature distance calculation. The matching score is derived by aggregating distances from different references. For each probe image, the gallery image with the highest matching score is the re-identifying result. Compared to existing algorithms, our GCT can handle spatial misalignment caused by large variations in view angles and human poses owing to the beneﬁts of patch-wise graph matching. Extensive experiments on ﬁve benchmarks including VIPeR, Road, PRID450S, 3DPES and CUHK01 evidence the superior performance of GCT model over other state-of-the-art methods.

PDF Details

AAAI Conference 2017 Conference Paper

Robust Visual Tracking via Local-Global Correlation Filter

Heng Fan
Jinhai Xiang

Correlation filter has drawn increasing interest in visual tracking due to its high efficiency, however, it is sensitive to partial occlusion, which may result in tracking failure. To address this problem, we propose a novel local-global correlation filter (LGCF) for object tracking. Our LGCF model utilizes both local-based and global-based strategies, and effectively combines these two strategies by exploiting the relationship of circular shifts among local object parts and global target for their motion models to preserve the structure of object. In specific, our proposed model has two advantages: (1) Owing to the benefits of local-based mechanism, our method is robust to partial occlusion by leveraging visible parts. (2) Taking into account the relationship of motion models among local parts and global target, our LGCF model is able to capture the inner structure of object, which further improves its robustness to occlusion. In addition, to alleviate the issue of drift away from object, we incorporate temporal consistencies of both local parts and global target in our LGCF model. Besides, we adopt an adaptive method to accurately estimate the scale of object. Extensive experiments on OTB15 with 100 videos demonstrate that our tracking algorithm performs favorably against state-of-the-art methods.

PDF Details