Arrow Research search
Back to AAAI

AAAI 2025

ChatterBox: Multimodal Referring and Grounding with Chain-of-Questions

Conference Paper AAAI Technical Track on Computer Vision VI Artificial Intelligence

Abstract

In this study, we establish a benchmark and a baseline approach for Multimodal referring and grounding with Chain-of-Questions (MCQ), opening up a promising direction for ‘logical’ multimodal dialogues. The newly collected dataset, named CB-300K, spans challenges including probing dialogues with spatial relationship among multiple objects, consistent reasoning, and complex question chains. The baseline approach, termed ChatterBox, involves a modularized design and a referent feedback mechanism to ensure logical coherence in continuous referring and grounding tasks. This design reduces the risk of referential confusion, simplifies the training process, and presents validity in retaining the language model’s generation ability. Experiments show that ChatterBox demonstrates superiority in MCQ both quantitatively and qualitatively, paving a new path towards multimodal dialogue scenarios with logical interactions.

Authors

Keywords

No keywords are indexed for this paper.

Context

Venue
AAAI Conference on Artificial Intelligence
Archive span
1980-2026
Indexed papers
28718
Paper id
191853068638197911