Arrow Research search
Back to AAAI

AAAI 2025

When Open-Vocabulary Visual Question Answering Meets Causal Adapter: Benchmark and Approach

Conference Paper AAAI Technical Track on Computer Vision VIII Artificial Intelligence

Abstract

Visual Question Answering (VQA) is a multifaceted task that integrates computer vision and natural language processing to produce textual answers from images and questions. Existing VQA benchmarks predominantly adhere to a closed-set paradigm, limiting their ability to address arbitrary, unseen answers, and thus falling short in real-world scenarios. To address this limitation, we introduce the Open-Vocabulary Visual Question Answering (OVVQA) benchmark, specifically designed to evaluate models under open-world conditions by assessing their performance on both base classes (seen, common answers) and novel classes (unseen, rare answers). In conjunction with this benchmark, we propose a model-agnostic Causal Adapter to combat the inherent bias found in current VQA tasks. Our approach leverages front-door adjustment to enhance causal reasoning, significantly improving model performance on novel categories while maintaining accuracy on base classes. Additionally, we introduce an adaptive transfer loss to facilitate the transfer of more knowledge from the pretrained model to our OVVQA task. Extensive experiments across multiple datasets validate the superiority of our method over existing state-of-the-art approaches, demonstrating its robust generalization and adaptability in open-world VQA scenarios.

Authors

Keywords

No keywords are indexed for this paper.

Context

Venue
AAAI Conference on Artificial Intelligence
Archive span
1980-2026
Indexed papers
28718
Paper id
108046255691919001