TIST Journal 2026 Journal Article
Cascade Transformer for Hierarchical Semantic Reasoning in Text-Based Visual Question Answering
- Yuan Gao
- Dezhen Feng
- Laurence T. Yang
- Jing Yang
- Xiaowen Jiang
- Jieming Yang
Text-based visual question answering (TextVQA) aims to answer questions by understanding scene text in images. However, many current methods overly depend on the accuracy of Optical Character Recognition (OCR) systems, while overlooking the significance of visual objects. They tend to perform poorly when the question involves the relationships between visual objects and scene text. To address the above issues, we focus on raising the status of visual objects and innovatively propose a hierarchical semantic reasoning network (CT-HSR) based on the cascade transformer architecture, achieving fine-grained cross-modal reasoning and visual semantic enhancement. Specifically, the visual representations containing rich semantic information of the question modality are obtained through the cross-modal transformer-based vision-language pre-training model firstly. Then, the uni-modal transformer for unified modality encoding module is utilized to capture visual objects that are more semantically related to OCR texts. In addition, we further alleviate the cross-modal noise interference through the feature filtering strategy. Finally, we better align the three modalities by introducing TextVQA pre-training tasks and generate prediction answers through multi-step iterative prediction during fine-tuning. Extensive experiments on the TextVQA, ST-VQA, and OCR-VQA datasets have demonstrated the effectiveness of our proposed model compared to the state-of-the-art methods. The code will be released at https://github.com/FTFWO/CT-HSR.