Author name cluster

Yihao Ding

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

4 papers

1 author row

AAAI Conference 2026 Conference Paper

A Disease-Aware Dual-Stage Framework for Chest X-ray Report Generation

Puzhen Wu
Hexin Dong
Yi Lin
Yihao Ding
Yifan Peng

Radiology report generation from chest X-rays is an important task in artificial intelligence with the potential to greatly reduce radiologists' workload and shorten patient wait times. Despite recent advances, existing approaches often lack sufficient disease-awareness in visual representations and adequate vision-language alignment to meet the specialized requirements of medical image analysis. As a result, these models usually overlook critical pathological features on chest X-rays and struggle to generate clinically accurate reports. To address these limitations, we propose a novel dual-stage disease-aware framework for chest X-ray report generation. In Stage~1, our model learns Disease-Aware Semantic Tokens (DASTs) corresponding to specific pathology categories through cross-attention mechanisms and multi-label classification, while simultaneously aligning vision and language representations via contrastive learning. In Stage~2, we introduce a Disease-Visual Attention Fusion (DVAF) module to integrate disease-aware representations with visual features, along with a Dual-Modal Similarity Retrieval (DMSR) mechanism that combines visual and disease-specific similarities to retrieve relevant exemplars, providing contextual guidance during report generation. Extensive experiments on benchmark datasets (i.e., CheXpert Plus, IU X-ray, and MIMIC-CXR) demonstrate that our disease-aware framework achieves state-of-the-art performance in chest X-ray report generation, with significant improvements in clinical accuracy and linguistic quality.

PDF Details DOI

IJCAI Conference 2025 Conference Paper

VRD-IU: Lessons from Visually Rich Document Intelligence and Understanding

Yihao Ding
Soyeon Caren Han
Yan Li
Josiah Poon

Visually Rich Document Understanding (VRDU) has emerged as a critical field in document intelligence, enabling automated extraction of key information from complex documents across domains such as medical, financial, and educational applications. However, form-like documents pose unique challenges due to their complex layouts, multi-stakeholder involvement, and high structural variability. Addressing these issues, the VRD-IU Competition was introduced, focusing on extracting and localizing key information from multi-format forms within the Form-NLU dataset, which includes digital, printed, and handwritten documents. This paper presents insights from the competition, which featured two tracks: Track A, emphasizing entity-based key information retrieval, and Track B, targeting end-to-end key information localization from raw document images. With over 20 participating teams, the competition showcased various state-of-the-art methodologies, including hierarchical decomposition, transformer-based retrieval, multimodal feature fusion, and advanced object detection techniques. The top-performing models set new benchmarks in VRDU, providing valuable insights into document intelligence.

PDF Details DOI

IJCAI Conference 2024 Conference Paper

MMVQA: A Comprehensive Dataset for Investigating Multipage Multimodal Information Retrieval in PDF-based Visual Question Answering

Yihao Ding
Kaixuan Ren
Jiabin Huang
Siwen Luo
Soyeon Caren Han

Document Question Answering (QA) presents a challenge in understanding visually-rich documents (VRD), particularly with lengthy textual content. Existing studies primarily focus on real-world documents with sparse text, while challenges persist in comprehending the hierarchical semantic relations among multiple pages to locate multimodal components. The paper introduces PDF-MVQA, tailored for research journal articles, encompassing multiple pages and multimodal retrieval. Our approach aims to retrieve entire paragraphs containing answers or visually rich document entities like tables and figures. The main contribution is introducing a comprehensive PDF Document VQA dataset, allowing the examination of semantically hierarchical layout structures in text-dominant documents. We also present new VRD-QA frameworks to grasp textual contents and relations among document layouts simultaneously, extending page-level understanding to the entire multi-page document. We aim to enhance the capabilities of existing vision-and-language models in handling challenges posed by text-dominant documents in VRD-QA. Code and Appendix are in https: //github. com/adlnlp/pdfmvqa

PDF Details DOI

AAAI Conference 2024 Short Paper

The Language Model Can Have the Personality: Joint Learning for Personality Enhanced Language Model (Student Abstract)

Tianyi Chen
Feiqi Cao
Yihao Ding
Caren Han

With the introduction of large language models, chatbots are becoming more conversational to communicate effectively and capable of handling increasingly complex tasks. To make a chatbot more relatable and engaging, we propose a new language model idea that maps the human-like personality. In this paper, we propose a systematic Personality-Enhanced Language Model (PELM) approach by using a joint learning mechanism of personality classification and language generation tasks. The proposed PELM leverages a dataset of defined personality typology, Myers-Briggs Type Indicator, and produces a Personality-Enhanced Language Model by using a joint learning and cross-teaching structure consisting of a classification and language modelling to incorporate personalities via both distinctive types and textual information. The results show that PELM can generate better personality-based outputs than baseline models.

PDF Details DOI