The Visual Prism: Refracting Images into Parallel Multilingual Descriptions with Structured Visual Guidance

Chengpeng Fu; Xiaocheng Feng; Yichong Huang; Wenshuai Huo; Baohang Li; Yang Xiang; Ting Liu

doi:10.1609/aaai.v40i36.40331

Back to AAAI

AAAI 2026

The Visual Prism: Refracting Images into Parallel Multilingual Descriptions with Structured Visual Guidance

Conference Paper AAAI Technical Track on Natural Language Processing I Artificial Intelligence

PDF Details DOI

Abstract

Parallel corpora, as the foundation of machine translation, remain crucial even in the era of large language models (LLMs) for pre-training and fine-tuning. However, annotating parallel corpora is extremely costly, as it requires annotators to be proficient in multiple languages. To reduce this cost, prior work has explored image-pivoted corpus synthesis, generating multilingual captions for the same image as pseudo-parallel data. Unfortunately, these pseudo corpora suffer from the serious issue of multilingual focus divergence, i.e., the model attending to distinct aspects of the image when generating captions in different languages. To address this problem, we propose a method called PRISMS (Parallel Refracting ImageS into Multilingual descriptions with Structured visual guidance), which leverages semantic graphs as structured visual guidance to unify the focus of multilingual captions. To ensure adherence to this guidance, we introduce two key techniques: supervised fine-tuning using self-generated instructional data, and reinforcement learning with a reward signal based on semantic graph consistency. Experimental results on five languages show that our PRISMS significantly improves the image-pivot parallel corpora synthesis, enabling LLMs to achieve translation performance comparable to that of models trained on manually annotated corpora.

The Visual Prism: Refracting Images into Parallel Multilingual Descriptions with Structured Visual Guidance

Abstract

Authors

Keywords

Context