Arrow Research search
Back to NeurIPS

NeurIPS 2025

Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering

Conference Paper Main Conference Track Artificial Intelligence ยท Machine Learning

Abstract

The task of Knowlegde-Based Visual Question Answering (KB-VQA) requires the model to understand visual features and retrieve external knowledge. Retrieval-Augmented Generation (RAG) have been employed to address this problem through knowledge base querying. However, existing work demonstrate two limitations: insufficient interactivity during knowledge retrieval and ineffective organization of retrieved information for Visual-Language Model (VLM). To address these challenges, we propose a three-stage visual language model with Process, Retrieve and Filter (VLM-PRF) framework. For interactive retrieval, VLM-PRF uses reinforcement learning (RL) to guide the model to strategically process information via tool-driven operations. For knowledge filtering, our method trains the VLM to transform the raw retrieved information into into task-specific knowledge. With a dual reward as supervisory signals, VLM-PRF successfully enable model to optimize retrieval strategies and answer generation capabilities simultaneously. Experiments on two datasets demonstrate the effectiveness of our framework.

Authors

Keywords

No keywords are indexed for this paper.

Context

Venue
Annual Conference on Neural Information Processing Systems
Archive span
1987-2025
Indexed papers
30776
Paper id
879420422311450300