Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering

yuyang Hong; Jiaqi Gu; Yang Qi; Lubin Fan; Yue Wu; Ying Wang; Kun Ding; Shiming Xiang; Jieping Ye

Back to NeurIPS

NeurIPS 2025

Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering

Conference Paper Main Conference Track Artificial Intelligence · Machine Learning

PDF Details

Abstract

The task of Knowlegde-Based Visual Question Answering (KB-VQA) requires the model to understand visual features and retrieve external knowledge. Retrieval-Augmented Generation (RAG) have been employed to address this problem through knowledge base querying. However, existing work demonstrate two limitations: insufficient interactivity during knowledge retrieval and ineffective organization of retrieved information for Visual-Language Model (VLM). To address these challenges, we propose a three-stage visual language model with Process, Retrieve and Filter (VLM-PRF) framework. For interactive retrieval, VLM-PRF uses reinforcement learning (RL) to guide the model to strategically process information via tool-driven operations. For knowledge filtering, our method trains the VLM to transform the raw retrieved information into into task-specific knowledge. With a dual reward as supervisory signals, VLM-PRF successfully enable model to optimize retrieval strategies and answer generation capabilities simultaneously. Experiments on two datasets demonstrate the effectiveness of our framework.

Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering

Abstract

Authors

Keywords

Context