EAAI Journal 2026 Journal Article
Beyond reconstruction: Enhancing masked autoencoders with contrastive learning for video representation learning
- Yawei Feng
- Lijun Guo
- Guitao Yu
- Rong Zhang
- Jiangbo Qian
- Chong Wang
- Shangce Gao
Self-supervised video representation learning primarily employs two methods: contrastive learning and masked video modeling, both of which possess unique advantages. Some studies have attempted to combine these two approaches to fully leverage their respective strengths. However, the intrinsic heterogeneity of these two methods poses challenges for existing models in integrating them, including complex model architectures, difficult training processes, and limited performance gains. To address these issues, this study proposes a novel video pre-training framework called Beyond Reconstruction (BR), which introduces a dual-track heterogeneous learning strategy. This strategy enables contrastive learning and masked video modeling to play their unique roles in different layers of Vision Transformers (ViTs), seamlessly integrating them into a unified framework to enhance the quality of video representations. Additionally, BR incorporates a motion-aware progressive masking strategy to strengthen spatiotemporal saliency modeling and stabilize the training process. By leveraging the advantages of contrastive learning in capturing global spatial motion objects, this strategy overcomes the limitations of previous masking methods. Experiments on multiple benchmarks, including action recognition and video object segmentation, show that the BR method achieves performance comparable to or even better than existing approaches under both fine-tuning and linear probing settings. These results demonstrate BR’s strong adaptability and efficiency in practical deployment: its stable fine-tuning performance enables effective adaptation to complex scenarios with limited annotations, while its strong linear probing capability allows the backbone to remain frozen, facilitating shared usage across multiple tasks and reducing overall computational cost without compromising performance.