Author name cluster

Ting Yu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

5 papers

1 author row

JBHI Journal 2025 Journal Article

Adapter-Enhanced Hierarchical Cross-Modal Pre-Training for Lightweight Medical Report Generation

Ting Yu
Wangwen Lu
Yan Yang
Weidong Han
Qingming Huang
Jun Yu
Ke Zhang

Automatic medical report generation is an emerging field that aims to transform medical images into descriptive, clinically relevant narratives, potentially reducing the workload for radiologists significantly. Despite substantial progress, the increasing model parameter size and corresponding marginal performance gains have limited further development and application. To address this challenge, we introduce an Adapter-enhanced Hierarchical cross-modal Pre-training (AHP) strategy for lightweight medical report generation. This approach significantly reduces the pre-trained model's parameter size while maintaining superior report generation performance through our proposed spatial adapters. To further address the issue of inadequate representation of visual space details, we employ a convolutional stem combined with hierarchical injectors and extractors, fully integrating with traditional Vision Transformers to achieve more comprehensive visual representations. Additionally, our cross-modal pre-training model effectively handles the inherent complex visual-textual relationships in medical imaging. Extensive experiments on multiple datasets, including IU X-Ray, MIMIC-CXR, and bladder pathology, demonstrate our model's exceptional generalization and transfer performance in downstream medical report generation tasks, highlighting AHP's potential in significantly reducing model parameters while enhancing report generation accuracy and efficiency.

Details DOI

JBHI Journal 2025 Journal Article

Consistency Conditioned Memory Augmented Dynamic Diagnosis Model for Medical Visual Question Answering

Ting Yu
Binhui Ge
Shuhui Wang
Yan Yang
Qingming Huang
Jun Yu

Medical Visual Question Answering (Med-VQA) holds immense promise as an invaluable medical assistance aid, offering timely diagnostic outcomes based on medical images and accompanying questions, thereby supporting medical professionals in making accurate clinical decisions. However, Med-VQA is still in its infancy, with existing solutions falling short in imitating human diagnostic processes and ensuring result consistency. To address these challenges, we propose a Co nsistency Co nditioned Me mory augmented D ynamic diagnosis model (CoCoMeD), incorporating two core components: a dynamic memory diagnosis engine and a consistency-conditioned enforcer. The dynamic memory diagnosis engine enables intricate diagnostic interactions by retaining vital visual cues from medical images and iteratively updating pertinent memories. This dynamic reasoning capability mirrors the cognitive processes observed in skilled medical diagnosticians, thus effectively enhancing the model's ability to reason over diverse medical visual facts and patient-specific questions. Moreover, to strengthen diagnostic coherence, the consistency-conditioned enforcer imposes coherence constraints linking interrelated questions with identical medical facts, ensuring the credibility and reliability of its diagnostic outcomes. Additionally, we present C-SLAKE, an extended Med-VQA dataset encompassing diverse medical image types, and categorized diagnostic question-answer pairs for consistent Med-VQA evaluation on rich medical sources. Comprehensive experiments on DME and C-SLAKE showcase CoCoMeD's superior performance and potential to advance trustworthy multi-source medical question answering.

Details DOI

AAAI Conference 2025 Conference Paper

Fine-grained Adaptive Visual Prompt for Generative Medical Visual Question Answering

Ting Yu
Zixuan Tong
Jun Yu
Ke Zhang

Medical Visual Question Answering (MedVQA) serves as an automated medical assistant, capable of answering patient queries and aiding physician diagnoses based on medical images and questions. Recent advancements have shown that incorporating Large Language Models (LLMs) into MedVQA tasks significantly enhances the capability for answer generation. However, for tasks requiring fine-grained organ-level precise localization, relying solely on language prompts struggles to accurately locate relevant regions within medical images due to substantial background noise. To address this challenge, we explore the use of visual prompts in MedVQA tasks for the first time and propose fine-grained adaptive visual prompts to enhance generative MedVQA. Specifically, we introduce an Adaptive Visual Prompt Creator that adaptively generates region-level visual prompts based on image characteristics of various organs, providing fine-grained references for LLMs during answer retrieval and generation from the medical domain, thereby improving the model's precise cross-modal localization capabilities on original images. Furthermore, we incorporate a Hierarchical Answer Generator with Parameter-Efficient Fine-Tuning (PEFT) techniques, significantly enhancing the model's understanding of spatial and contextual information with minimal parameter increase, promoting the alignment of representation learning with the medical space. Extensive experiments on VQA-RAD, SLAKE, and DME datasets validate the effectiveness of our proposed method, demonstrating its potential in generative MedVQA.

PDF Details DOI

TIST Journal 2021 Journal Article

TARA-Net: A Fusion Network for Detecting Takeaway Rider Accidents

Yifan He
Zhao Li
Lei Fu
Anhui Wang
Peng Zhang
Shuigeng Zhou
Ji Zhang
Ting Yu

In the emerging business of food delivery, rider traffic accidents raise financial cost and social traffic burden. Although there has been much effort on traffic accident forecasting using temporal-spatial prediction models, none of the existing work studies the problem of detecting the takeaway rider accidents based on food delivery trajectory data. In this article, we aim to detect whether a takeaway rider meets an accident on a certain time period based on trajectories of food delivery and riders’ contextual information. The food delivery data has a heterogeneous information structure and carries contextual information such as weather and delivery history, and trajectory data are collected as a spatial-temporal sequence. In this article, we propose a TakeAway Rider Accident detection fusion network TARA-Net to jointly model these heterogeneous and spatial-temporal sequence data. We utilize the residual network to extract basic contextual information features and take advantage of a transformer encoder to capture trajectory features. These embedding features are concatenated into a pyramidal feed-forward neural network. We jointly train the above three components to combine the benefits of spatial-temporal trajectory data and sparse basic contextual data for early detecting traffic accidents. Furthermore, although traffic accidents rarely happen in food delivery, we propose a sampling mechanism to alleviate the imbalance of samples when training the model. We evaluate the model on a transportation mode classification dataset Geolife and a real-world Ele.me dataset with over 3 million riders. The experimental results show that the proposed model is superior to the state-of-the-art.

Details DOI

AAAI Conference 2019 Conference Paper

ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering

Zhou Yu
Dejing Xu
Jun Yu
Ting Yu
Zhou Zhao
Yueting Zhuang
Dacheng Tao

Recent developments in modeling language and vision have been successfully applied to image question answering. It is both crucial and natural to extend this research direction to the video domain for video question answering (VideoQA). Compared to the image domain where large scale and fully annotated benchmark datasets exists, VideoQA datasets are limited to small scale and are automatically generated, etc. These limitations restrict their applicability in practice. Here we introduce ActivityNet-QA, a fully annotated and large scale VideoQA dataset. The dataset consists of 58, 000 QA pairs on 5, 800 complex web videos derived from the popular ActivityNet dataset. We present a statistical analysis of our ActivityNet-QA dataset and conduct extensive experiments on it by comparing existing VideoQA baselines. Moreover, we explore various video representation strategies to improve VideoQA performance, especially for long videos.

PDF Details