ChatterBox: Multimodal Referring and Grounding with Chain-of-Questions

Yunjie Tian; Tianren Ma; Lingxi Xie; Qixiang Ye

doi:10.1609/aaai.v39i7.32796

Back to AAAI

AAAI 2025

ChatterBox: Multimodal Referring and Grounding with Chain-of-Questions

Conference Paper AAAI Technical Track on Computer Vision VI Artificial Intelligence

PDF Details DOI

Abstract

In this study, we establish a benchmark and a baseline approach for Multimodal referring and grounding with Chain-of-Questions (MCQ), opening up a promising direction for ‘logical’ multimodal dialogues. The newly collected dataset, named CB-300K, spans challenges including probing dialogues with spatial relationship among multiple objects, consistent reasoning, and complex question chains. The baseline approach, termed ChatterBox, involves a modularized design and a referent feedback mechanism to ensure logical coherence in continuous referring and grounding tasks. This design reduces the risk of referential confusion, simplifies the training process, and presents validity in retaining the language model’s generation ability. Experiments show that ChatterBox demonstrates superiority in MCQ both quantitatively and qualitatively, paving a new path towards multimodal dialogue scenarios with logical interactions.

ChatterBox: Multimodal Referring and Grounding with Chain-of-Questions

Abstract

Authors

Keywords

Context