Exploring the Better Multimodal Synergy Strategy for Vision-Language Models

Xiaotian Yin; Xin Liu; Si Chen; Yuan Wang; Yuwen Pan; Tianzhu Zhang

doi:10.1609/aaai.v39i21.34372

Back to AAAI

AAAI 2025

Exploring the Better Multimodal Synergy Strategy for Vision-Language Models

Conference Paper AAAI Technical Track on Machine Learning VII Artificial Intelligence

PDF Details DOI

Abstract

Vision-Language models (VLMs) have shown great potential in enhancing open-world visual concept comprehension. Recent researches focus on an optimum multimodal collaboration strategy that significantly advances CLIP-based few-shot tasks. However, existing prompt-based solutions suffer from unidirectional information flow and increased parameters since they explicitly condition the vision prompts on textual prompts across different transformer layers using non-shareable coupling functions. To address this issue, we propose a Dual-shared mechanism based on LoRA (DsRA) that addresses VLM adaptation in low-data regimes. The proposed DsRA enjoys several merits. First, we design an inter-modal shared coefficient that focuses on capturing visual and textual shared patterns, ensuring effective mutual synergy between image and text features. Second, an intra-modal shared matrix is proposed to achieve efficient parameter fine-tuning by combining the different coefficients to generate layer-wise adapters placed in encoder layers. Our extensive experiments demonstrate that DsRA improves the generalizability under few-shot classification, base-to-new generalization, and domain generalization settings. Our code will be released soon.

Exploring the Better Multimodal Synergy Strategy for Vision-Language Models

Abstract

Authors

Keywords

Context