Yanhui Chen Papers

AAAI Conference 2026 Conference Paper

From Attribution to Action: Jointly ALIGNing Predictions and Explanations

Dongsheng Hong
Chao Chen
Yanhui Chen
Shanshan Lin
Zhihao Chen
Xiangwen Liao

Explanation-guided learning (EGL) has shown promise in aligning model predictions with interpretable reasoning, particularly in computer vision tasks. However, most approaches rely on external annotations or heuristic-based segmentation to supervise model explanations, which can be noisy, imprecise and difficult to scale. In this work, we provide both empirical and theoretical evidence that low-quality supervision signals can degrade model performance rather than improve it. In response, we propose ALIGN, a novel framework that jointly trains a classifier and a masker in an iterative manner. The masker learns to produce soft, task-relevant masks that highlight informative regions, while the classifier is optimized for both prediction accuracy and alignment between its saliency maps and the learned masks. By leveraging high-quality masks as guidance, ALIGN improves both interpretability and generalizability, showing its superiority across various settings. Experiments on the two domain generalization benchmarks, VLCS and Terra Incognita, show that ALIGN consistently outperforms six strong baselines in both in-distribution and out-of-distribution settings. Besides, ALIGN also yields superior explanation quality concerning sufficiency and comprehensiveness, highlighting its effectiveness in producing accurate and interpretable models.

PDF Details DOI

EAAI Journal 2025 Journal Article

Diffusion-based vision-language model for zero-shot anomaly detection in medical images

Yanhui Chen
Hongkang Tao
Zan Yang
Yunkang Cao
Chen Jiang
Longhua Hu
Pengwen Xiong
Haobo Qiu

With the rapid advancement of diagnostic technology, the ability to detect pathological areas such as tumors and polyps has significantly improved. This progress provides medical imaging specialists with more precise visual information to support anomaly identification, diagnosis, treatment planning, and patient monitoring. However, existing unsupervised and semi-supervised anomaly detection methods struggle with data privacy constraints, limited annotated medical datasets, and challenges in generalization. Zero-Shot Anomaly Detection (ZSAD), which enables the detection of unseen categories without requiring class-specific training, has emerged as a promising solution by leveraging the vision-language alignment capabilities of Vision-Language Models (VLMs), such as Contrastive Language-Image Pretraining (CLIP). Despite recent progress, ZSAD remains hindered by high noise levels, sparse targets, and poor adaptability in complex medical imaging scenarios. To address these issues, we propose a novel framework: DiffusionCLIP, a diffusion-based VLM for zero-shot anomaly detection in two-dimensional medical images. Specifically, DiffusionCLIP integrates diffusion models into the VLM to progressively denoise multi-level features extracted from the CLIP visual encoder, enhancing feature robustness and discriminability. A multi-level feature fusion strategy is designed to aggregate multi-scale representations from different depths of the visual encoder, ensuring complementary semantic alignment across layers. In addition, a dynamically modulated weight loss function is introduced to adaptively balance the learning of hard and easy samples, further improving model generalization. Extensive experiments on multiple benchmark medical imaging datasets, demonstrate that the proposed method significantly outperforms existing zero-shot anomaly detection approaches in terms of accuracy, robustness, and generalization.

Details DOI

Possible papers

From Attribution to Action: Jointly ALIGNing Predictions and Explanations

Diffusion-based vision-language model for zero-shot anomaly detection in medical images