Knowledge-Enhanced Image Captioning with Adaptive Graph-based Multimodal Alignment and LLM

Guoyi Li; Die Hu; Haozhe Li; Zhongjiang Yao; Wei Mi; Zongzhen Liu; Xiaodan Zhang; Honglei Lyu

doi:10.1609/aaai.v40i18.38532

Back to AAAI

AAAI 2026

Knowledge-Enhanced Image Captioning with Adaptive Graph-based Multimodal Alignment and LLM

Conference Paper AAAI Technical Track on Data Mining & Knowledge Management II Artificial Intelligence

PDF Details DOI

Abstract

Image captioning is crucial for multimodal understanding, bridging visual content and natural language. Despite recent advancements in Large Multimodal Models (LMMs), when faced with unseen entities or scenes in the open world, even when attempting to leverage learned knowledge, models still struggle with vague and inaccurate descriptions, and may even generate knowledge hallucinations. A key reason is that the model fails to effectively integrate knowledge with visual information, limiting its understanding of visual content. Thus, we propose Adaptive Knowledge Graph-guided Multimodal Alignment (AKGMA) for image captioning, which enhances semantic understanding in open-world scenes through visual knowledge reasoning, reducing knowledge hallucinations and improving caption quality. It consist three key components: Entity-guided Knowledge Aligner (EKA), Adaptive Knowledge Graph Construction (AKGC), and Scene-Context Knowledge Adapter (SCKA). EKA connects visual entities to knowledge graphs, providing structured knowledge to a small language model, which interacts with a visual encoder to acquire visual knowledge. AKGC uses reinforcement learning to build image-relevant subgraphs to optimize knowledge prompts and improve knowledge hallucinations. SCKA leverages scene graph annotations to extract visual contextual knowledge and inject it into Large Language Models (LLMs), ensuring the generated descriptions are consistent with the image's details. Additionally, we introduce UniKnowCap, a new image knowledge description dataset spanning various open-world knowledge domains, designed to evaluate the knowledge accuracy and detail consistency of model-generated descriptions. Extensive experiments show our model outperforms baselines across multiple metrics.

Knowledge-Enhanced Image Captioning with Adaptive Graph-based Multimodal Alignment and LLM

Abstract

Authors

Keywords

Context