Stack-Captioning: Coarse-to-Fine Learning for Image Captioning

Jiuxiang Gu; Jianfei Cai; Gang Wang; Tsuhan Chen

Back to AAAI

AAAI 2018

Stack-Captioning: Coarse-to-Fine Learning for Image Captioning

Conference Paper AAAI Technical Track: Vision Artificial Intelligence

PDF Details

Abstract

The existing image captioning approaches typically train a one-stage sentence decoder, which is difﬁcult to generate rich ﬁne-grained descriptions. On the other hand, multi-stage image caption model is hard to train due to the vanishing gradient problem. In this paper, we propose a coarse-to-ﬁne multi-stage prediction framework for image captioning, composed of multiple decoders each of which operates on the output of the previous stage, producing increasingly reﬁned image descriptions. Our proposed learning approach addresses the difﬁculty of vanishing gradients during training by providing a learning objective function that enforces intermediate supervisions. Particularly, we optimize our model with a reinforcement learning approach which utilizes the output of each intermediate decoder’s test-time inference algorithm as well as the output of its preceding decoder to normalize the rewards, which simultaneously solves the well-known exposure bias problem and the loss-evaluation mismatch problem. We extensively evaluate the proposed approach on MSCOCO and show that our approach can achieve the state-of-the-art performance.

Stack-Captioning: Coarse-to-Fine Learning for Image Captioning

Abstract

Authors

Keywords

Context