AAAI 2025
Exploring the Better Multimodal Synergy Strategy for Vision-Language Models
Abstract
Vision-Language models (VLMs) have shown great potential in enhancing open-world visual concept comprehension. Recent researches focus on an optimum multimodal collaboration strategy that significantly advances CLIP-based few-shot tasks. However, existing prompt-based solutions suffer from unidirectional information flow and increased parameters since they explicitly condition the vision prompts on textual prompts across different transformer layers using non-shareable coupling functions. To address this issue, we propose a Dual-shared mechanism based on LoRA (DsRA) that addresses VLM adaptation in low-data regimes. The proposed DsRA enjoys several merits. First, we design an inter-modal shared coefficient that focuses on capturing visual and textual shared patterns, ensuring effective mutual synergy between image and text features. Second, an intra-modal shared matrix is proposed to achieve efficient parameter fine-tuning by combining the different coefficients to generate layer-wise adapters placed in encoder layers. Our extensive experiments demonstrate that DsRA improves the generalizability under few-shot classification, base-to-new generalization, and domain generalization settings. Our code will be released soon.
Authors
Keywords
No keywords are indexed for this paper.
Context
- Venue
- AAAI Conference on Artificial Intelligence
- Archive span
- 1980-2026
- Indexed papers
- 28718
- Paper id
- 169362126806954263