Arrow Research search

Author name cluster

Bineng Zhong

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

13 papers
1 author row

Possible papers

13

AAAI Conference 2026 Conference Paper

Aware Distillation for Robust Vision-Language Tracking Under Linguistic Sparsity

  • Guangtong Zhang
  • Bineng Zhong
  • Shirui Yang
  • Yang Wang
  • Tian Bai

Vision-language object tracking overcomes the limitations of relying solely on visual features by leveraging language descriptions of objects to provide cross-modal semantic information, thereby enhancing model robustness in complex scenarios. However, most existing high-performance vision-language trackers are trained jointly on pure visual data and vision-language multimodal data. Due to the relative sparsity of language annotations in the data, the trackers tend to prioritize the localization role of visual features, diminishing the model's attention to language information. To mitigate this issue, we propose a novel vision-language tracker: Aware Distillation for Robust Vision-Language Tracking under Linguistic Sparsity (ADTrack). We introduce a knowledge distillation framework employing a knowledge-rich teacher model and a lightweight student model to establish modality correlations between vision and language, enabling efficient modeling between visual information and language descriptions. Specifically, our lightweight student module simultaneously distills language encoding capabilities from large language models through teacher-guided learning on input language, while performing target-aware perception on template images using language descriptions to generate more effective template features for subsequent visual extraction. Furthermore, to ensure perceptual robustness in linguistically sparse scenarios, we simulate language-deficient conditions during training and employ contrastive learning to enhance model adaptability. Extensive experiments demonstrate that ADTrack reduces parameters by over 50% while achieving state-of-the-art (SOTA) performance and speed on vision-language tracking benchmarks, including LaSOT, LaSOText, TNL2K, OTB-Lang and MGIT.

AAAI Conference 2026 Conference Paper

Motion-Aware Object Tracking via Motion and Geometry-Aware Cues

  • Hongtao Yang
  • Bineng Zhong
  • Qihua Liang
  • Xiantao Hu
  • Yufei Tan
  • Haiying Xia
  • Shuxiang Song

Understanding motion is essential for visual object tracking, especially in complex and dynamic scenarios. Yet, many existing methods rely on simplistic strategies such as template updates or temporal feature propagation, often overlooking the deeper modeling of motion information. To mitigate this limitation, we introduce a motion-aware spatio-temporal framework that enhances motion perception by explicitly matching motion patterns and modeling inter-frame motion relationships. Central to our design is a motion pattern dictionary, which encodes a diverse set of representative motion cues as learnable features. During tracking, features from the search region interact with the dictionary to retrieve the most relevant motion patterns, allowing the model to adapt to the current motion state. A dedicated decoder further incorporates temporal correlations to refine motion awareness. To complement motion modeling, we embed geometric cues into the search region features, which strengthens spatial perception, reduces ambiguity under occlusion, and improves foreground-background separation. Extensive evaluations on seven challenging benchmarks demonstrate the effectiveness of our design. In particular, MoDTrack_384 surpasses recent SOTA trackers on LaSOT by 1.2% in AUC, highlighting the benefits of motion pattern modeling and geometry-guided enhancement in mitigating tracking drift.

AAAI Conference 2026 Conference Paper

MUTrack: A Memory-Aware Unified Representation Framework for Visual Tracking

  • Weijing Wu
  • Qihua Liang
  • Bineng Zhong
  • Xiaohu Tang
  • Yufei Tan
  • Ning Li
  • Yuanliang Xue

Building a unified target representation that simultaneously achieves short-term adaptability and long-term stability is crucial for robust visual tracking. However, existing trackers typically face an inherent trade-off. Methods primarily relying on short-term appearance and motion cues achieve rapid adaptation, but they often struggle with long-term identity consistency. Conversely, trackers that emphasize extensive temporal context provide strong robustness, yet this approach can compromise their short-term adaptability. To bridge this gap, we propose a novel tracker, MUTrack, which comprehensively integrates both long-term and short-term memories into a unified target representation for more robust tracking. Specifically, we design a unified memory bank that stores and manages long-term memory for maintaining long-term identity consistency, and short-term memory for adapting to instantaneous appearance changes. To fully leverage the complementary nature of both long-term and short-term temporal information, we introduce a perception interaction module that dynamically fuses these memory types through deep and bidirectional interactions, enabling mutual refinement where one guides the other. This ultimately generates a highly adaptive target representation, which effectively balances adaptability to instantaneous changes with robustness against long-term identity drift. Extensive experiments on GOT10k, TrackingNet, LaSOT, LaSOT_ext, NfS, and OTB100 consistently demonstrate that MUTrack achieves SOTA performance.

AAAI Conference 2025 Conference Paper

Decoupled Spatio-Temporal Consistency Learning for Self-Supervised Tracking

  • Yaozong Zheng
  • Bineng Zhong
  • Qihua Liang
  • Ning Li
  • Shuxiang Song

The success of visual tracking has been largely driven by datasets with manual box annotations. However, these box annotations require tremendous human effort, limiting the scale and diversity of existing tracking datasets. In this work, we present a novel Self-Supervised Tracking framework, named SSTrack, designed to eliminate the need of box annotations. Specifically, a decoupled spatio-temporal consistency training framework is proposed to learn rich target information across timestamps through global spatial localization and local temporal association. This allows for the simulation of appearance and motion variations of instances in real-world scenarios. Furthermore, an instance contrastive loss is designed to learn instance-level correspondences from a multi-view perspective, offering robust instance supervision without additional labels. This new design paradigm enables SSTrack to effectively learn generic tracking representations in a self-supervised manner, while reducing reliance on extensive box annotations. Extensive experiments on nine benchmark datasets demonstrate that SSTrack surpasses SOTA self-supervised tracking methods, achieving an improvement of more than 25.3%, 20.4%, and 14.8% in AUC (AO) score on the GOT10K, LaSOT, TrackingNet datasets, respectively.

AAAI Conference 2025 Conference Paper

Exploiting Multimodal Spatial-temporal Patterns for Video Object Tracking

  • Xiantao Hu
  • Ying Tai
  • Xu Zhao
  • Chen Zhao
  • Zhenyu Zhang
  • Jun Li
  • Bineng Zhong
  • Jian Yang

Multimodal tracking has garnered widespread attention as a result of its ability to effectively address the inherent limitations of traditional RGB tracking. However, existing multimodal trackers mainly focus on the fusion and enhancement of spatial features or merely leverage the sparse temporal relationships between video frames. These approaches do not fully exploit the temporal correlations in multimodal videos, making it difficult to capture the dynamic changes and motion information of targets in complex scenarios. To alleviate this problem, we propose a unified multimodal spatial-temporal tracking approach named STTrack. In contrast to previous paradigms that solely relied on updating reference information, we introduced a temporal state generator (TSG) that continuously generates a sequence of tokens containing multimodal temporal information. These temporal information tokens are used to guide the localization of the target in the next time state, establish long-range contextual relationships between video frames, and capture the temporal trajectory of the target. Furthermore, at the spatial level, we introduced the mamba fusion and background suppression interactive (BSI) modules. These modules establish a dual-stage mechanism for coordinating information interaction and fusion between modalities. Extensive comparisons on five benchmark datasets illustrate that STTrack achieves state-of-the-art performance across various multimodal tracking scenarios.

AAAI Conference 2025 Conference Paper

Less Is More: Token Context-Aware Learning for Object Tracking

  • Chenlong Xu
  • Bineng Zhong
  • Qihua Liang
  • Yaozong Zheng
  • Guorong Li
  • Shuxiang Song

Recently, several studies have shown that utilizing contextual information to perceive target states is crucial for object tracking. They typically capture context by incorporating multiple video frames. However, these naive frame-context methods fail to consider the importance of each patch within a reference frame, making them susceptible to noise and redundant tokens, which deteriorates tracking performance. To address this challenge, we propose a new token context-aware tracking pipeline named LMTrack, designed to automatically learn high-quality reference tokens for efficient visual tracking. Embracing the principle of Less is More, the core idea of LMTrack is to analyze the importance distribution of all reference tokens, where important tokens are collected, continually attended to, and updated. Specifically, a novel Token Context Memory module is designed to dynamically collect high-quality spatio-temporal information of a target in an autoregressive manner, eliminating redundant background tokens from the reference frames. Furthermore, an effective Unidirectional Token Attention mechanism is designed to establish dependencies between reference tokens and search frame, enabling robust cross-frame association and target localization. Extensive experiments demonstrate the superiority of our tracker, achieving state-of-the-art results on tracking benchmarks such as GOT-10K, TrackingNet, and LaSOT.

AAAI Conference 2025 Conference Paper

MambaLCT: Boosting Tracking via Long-term Context State Space Model

  • Xiaohai Li
  • Bineng Zhong
  • Qihua Liang
  • Guorong Li
  • Zhiyi Mo
  • Shuxiang Song

Effectively constructing context information with long-term dependencies from video sequences is crucial for object tracking. However, the context length constructed by existing work is limited, only considering object information from adjacent frames or video clips, leading to insufficient utilization of contextual information. To address this issue, we propose MambaLCT, which constructs and utilizes target variation cues from the first frame to the current frame for robust tracking. First, a novel unidirectional Context Mamba module is designed to scan frame features along the temporal dimension, gathering target change cues throughout the entire sequence. Specifically, target-related information in frame features is compressed into a hidden state space through a selective scanning mechanism. The target information across the entire video is continuously aggregated into target variation cues. Next, we inject the target change cues into the attention mechanism, providing temporal information for modeling the relationship between the template and search frames. The advantage of MambaLCT is its ability to continuously extend the length of the context, capturing complete target change cues, which enhances the stability and robustness of the tracker. Extensive experiments show that long-term context information enhances the model's ability to perceive targets in complex scenarios. MambaLCT achieves new SOTA performance on six benchmarks while maintaining real-time runing speeds.

AAAI Conference 2025 Conference Paper

Robust Tracking via Mamba-based Context-aware Token Learning

  • Jinxia Xie
  • Bineng Zhong
  • Qihua Liang
  • Ning Li
  • Zhiyi Mo
  • Shuxiang Song

How to make a good trade-off between performance and computational cost is crucial for a tracker. However, current famous methods typically focus on complicated and time-consuming learning that combining temporal and appearance information by input more and more images (or features). Consequently, these methods not only increase the model's computational source and learning burden but also introduce much useless and potentially interfering information. To alleviate the above issues, we propose a simple yet robust tracker that separates temporal information learning from appearance modeling and extracts temporal relations from a set of representative tokens rather than several images (or features). Specifically, we introduce one track token for each frame to collect the target's appearance information in the backbone. Then, we design a mamba-based Temporal Module for track tokens to be aware of context by interacting with other track tokens within a sliding window. This module consists of a mamba layer with autoregressive characteristic and a cross-attention layer with strong global perception ability, ensuring sufficient interaction for track tokens to perceive the appearance changes and movement trends of the target. Finally, track tokens serve as a guidance to adjust the appearance feature for the final prediction in the head. Experiments show our method is effective and achieves competitive performance on multiple benchmarks at a real-time speed.

IJCAI Conference 2024 Conference Paper

Diffusion Mask-Driven Visual-language Tracking

  • Guangtong Zhang
  • Bineng Zhong
  • Qihua Liang
  • Zhiyi Mo
  • Shuxiang Song

Most existing visual-language trackers greatly rely on the initial language descriptions on a target object to extract their multi-modal features. However, the initial language descriptions are often inaccurate in a highly time-varying video sequence and thus greatly deteriorate their tracking performance due to the low quality of extracted multi-modal features. To address this challenge, we propose a Diffusion Mask-Driven Visual-language Tracker (DMTrack) based on a diffusion model. Confronting the issue of low-quality multi-modal features due to inaccurate language descriptions, we leverage the diffusion model to capture high-quality semantic information from multi-modal features and transform it into target mask features. During the training phase, we further enhance the diffusion model's perception of pixel-level features by calculating the loss between the target mask features and the ground truth masks. Additionally, we perform joint localization of the target using both target mask features and visual features, instead of relying solely on multi-modal features for localization. Through extensive experiments on four tracking benchmarks (i. e. , LaSOT, TNL2K, LaSOText, and OTB-Lang), we validate that our proposed Diffusion Mask-Driven Visual-language Tracker can improve the robustness and effectiveness of the model.

AAAI Conference 2024 Conference Paper

Explicit Visual Prompts for Visual Object Tracking

  • Liangtao Shi
  • Bineng Zhong
  • Qihua Liang
  • Ning Li
  • Shengping Zhang
  • Xianxian Li

How to effectively exploit spatio-temporal information is crucial to capture target appearance changes in visual tracking. However, most deep learning-based trackers mainly focus on designing a complicated appearance model or template updating strategy, while lacking the exploitation of context between consecutive frames and thus entailing the when-and-how-to-update dilemma. To address these issues, we propose a novel explicit visual prompts framework for visual tracking, dubbed EVPTrack. Specifically, we utilize spatio-temporal tokens to propagate information between consecutive frames without focusing on updating templates. As a result, we cannot only alleviate the challenge of when-to-update, but also avoid the hyper-parameters associated with updating strategies. Then, we utilize the spatio-temporal tokens to generate explicit visual prompts that facilitate inference in the current frame. The prompts are fed into a transformer encoder together with the image tokens without additional processing. Consequently, the efficiency of our model is improved by avoiding how-to-update. In addition, we consider multi-scale information as explicit visual prompts, providing multiscale template features to enhance the EVPTrack's ability to handle target scale changes. Extensive experimental results on six benchmarks (i.e., LaSOT, LaSOText, GOT-10k, UAV123, TrackingNet, and TNL2K.) validate that our EVPTrack can achieve competitive performance at a real-time speed by effectively exploiting both spatio-temporal and multi-scale information. Code and models are available at https://github.com/GXNU-ZhongLab/EVPTrack.

AAAI Conference 2024 Conference Paper

ODTrack: Online Dense Temporal Token Learning for Visual Tracking

  • Yaozong Zheng
  • Bineng Zhong
  • Qihua Liang
  • Zhiyi Mo
  • Shengping Zhang
  • Xianxian Li

Online contextual reasoning and association across consecutive video frames are critical to perceive instances in visual tracking. However, most current top-performing trackers persistently lean on sparse temporal relationships between reference and search frames via an offline mode. Consequently, they can only interact independently within each image-pair and establish limited temporal correlations. To alleviate the above problem, we propose a simple, flexible and effective video-level tracking pipeline, named ODTrack, which densely associates the contextual relationships of video frames in an online token propagation manner. ODTrack receives video frames of arbitrary length to capture the spatio-temporal trajectory relationships of an instance, and compresses the discrimination features (localization information) of a target into a token sequence to achieve frame-to-frame association. This new solution brings the following benefits: 1) the purified token sequences can serve as prompts for the inference in the next video frame, whereby past information is leveraged to guide future inference; 2) the complex online update strategies are effectively avoided by the iterative propagation of token sequences, and thus we can achieve more efficient model representation and computation. ODTrack achieves a new SOTA performance on seven benchmarks, while running at real-time speed. Code and models are available at https://github.com/GXNU-ZhongLab/ODTrack.

IJCAI Conference 2019 Conference Paper

LRDNN: Local-refining based Deep Neural Network for Person Re-Identification with Attribute Discerning

  • Qinqin Zhou
  • Bineng Zhong
  • Xiangyuan Lan
  • Gan Sun
  • Yulun Zhang
  • Mengran Gou

Recently, pose or attribute information has been widely used to solve person re-identification (re-ID) problem. However, the inaccurate output from pose or attribute modules will impair the final person re-ID performance. Since re-ID, pose estimation and attribute recognition are all based on the person appearance information, we propose a Local-refining based Deep Neural Network (LRDNN) to aggregate pose estimation and attribute recognition to improve the re-ID performance. To this end, we add a pose branch to extract the local spatial information and optimize the whole network on both person identity and attribute objectives. To diminish the negative affect from unstable pose estimation, a novel structure called channel parse block (CPB) is introduced to learn weights on different feature channels in pose branch. Then two branches are combined with compact bilinear pooling. Experimental results on Market1501 and DukeMTMC-reid datasets illustrate the effectiveness of the proposed method.

AAAI Conference 2019 Conference Paper

Structured and Sparse Annotations for Image Emotion Distribution Learning

  • Haitao Xiong
  • Hongfu Liu
  • Bineng Zhong
  • Yun Fu

Label distribution learning methods effectively address the label ambiguity problem and have achieved great success in image emotion analysis. However, these methods ignore structured and sparse information naturally contained in the annotations of emotions. For example, emotions can be grouped and ordered due to their polarities and degrees. Meanwhile, emotions have the character of intensity and are reflected in different levels of sparse annotations. Motivated by these observations, we present a convolutional neural network based framework called Structured and Sparse annotations for image emotion Distribution Learning (SSDL) to tackle two challenges. In order to utilize structured annotations, the Earth Mover’s Distance is employed to calculate the minimal cost required to transform one distribution to another for ordered emotions and emotion groups. Combined with Kullback- Leibler divergence, we design the loss to penalize the mispredictions according to the dissimilarities of same emotions and different emotions simultaneously. Moreover, in order to handle sparse annotations, sparse regularization based on emotional intensity is adopted. Through combined loss and sparse regularization, SSDL could effectively leverage structured and sparse annotations for predicting emotion distribution. Experiment results demonstrate that our proposed SSDL significantly outperforms the state-of-the-art methods.