Exploiting All Mamba Fusion for Efficient RGB-D Tracking

Ge Ying; Dawei Zhang; Chengzhuan Yang; Wei Liu; Sang-Woon Jeon; Hua Wang; Changqin Huang; Zhonglong Zheng

doi:10.1609/aaai.v40i14.38195

Back to AAAI

AAAI 2026

Exploiting All Mamba Fusion for Efficient RGB-D Tracking

Conference Paper AAAI Technical Track on Computer Vision XI Artificial Intelligence

PDF Details DOI

Abstract

Despite the progress made through deep learning, existing Visual Object Tracking (VOT) frameworks struggle with real-world challenges. Recent approaches incorporate additional modalities like Depth, Thermal Infrared, and Language to enhance the robustness of VOT, particularly with the improvement of the depth sensor precision, facilitating RGB-D tracking. However, current RGB-D trackers often copy RGB tracking paradigms, leading to inefficiency due to two-stream architectures that fail to exploit heterogeneous features, and reliance on simplistic or large-parameter fusion methods. To address these challenges, we propose AMTrack, a one-stream RGB-D tracker leveraging Mamba's linear complexity for simultaneous feature extraction and two-stage cross-modal feature fusion. Our innovation also includes a low-parameter Multimodal Mix Mamba (3M) module, which optimizes deep feature fusion and reduces computational overhead. The advantage of the 3M module stems from our Multimodal State Space Model (MSSM), a multimodal feature interaction component reconstructed based on SSM. Experiments across multiple RGB-D tracking datasets indicate that AMTrack achieves superior performance with lower parameters and memory demands compared to state-of-the-arts.

Exploiting All Mamba Fusion for Efficient RGB-D Tracking

Abstract

Authors

Keywords

Context