Author name cluster

Liwen Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

4 papers

1 author row

AAAI Conference 2026 Conference Paper

Is Your (Reasoning) Multimodal Language Model Vulnerable Toward Distractions?

Ming Liu
Hao Chen
Jindong Wang
Liwen Wang
Jingchen Sun
Wensheng Zhang

Vision-Language Models (VLMs) have achieved success in tasks such as visual question answering, yet their resilience to distractions remains underexplored. Understanding how distractions affect VLMs' performance is crucial for real-world applications, as input data often contains noisy or irrelevant content. This paper assesses the robustness of VLMs—including general-purpose models and those specialized for reasoning—against distractions in the context of science question answering. We introduce I-ScienceQA, a new benchmark based on the ScienceQA dataset, which systematically injects distractions into both visual and textual contexts. We evaluate how distractions perturb the underlying reasoning processes of these models by analyzing changes in textual explanations leading to answers. Our findings show that most VLMs are vulnerable to distractions, with a noticeable degradation in reasoning when extraneous content is present. In particular, some models (including GPT-o4 mini) exhibit a higher degree of robustness. We also observe that textual distractions generally cause greater performance declines than visual distractions. Finally, we explore mitigation strategies such as prompt engineering. Although these strategies improve resilience modestly, our analysis highlights considerable room for further improvement in the robustness of VLMs.

PDF Details DOI

AAAI Conference 2025 Conference Paper

Anatomical Knowledge Mining and Matching for Semi-supervised Medical Multi-structure Detection

Bin Pu
Liwen Wang
Jiewen Yang
Xingbo Dong
Benteng Ma
Zhuangzhuang Chen
Lei Zhao
Shengli Li

In medical image analysis, detecting multiple structures is crucial for evaluations and diagnosis but is often limited by the lack of high-quality annotations. Semi-supervised object detection emerges as a potent methodology to enhance model performance and generalization by leveraging a vast pool of unlabeled data alongside a minimal set of labeled data. A striking observation is that both unlabelled and labeled medical images contain a priori anatomical knowledge from human screening. In this work, we introduce a novel semi-supervised approach named Semi-akmm for mining and matching anatomical knowledge in ultrasound images. We develop an Adaptive Prior Knowledge Transfer (APKT) module to mine and explore the distribution and knowledge of potential proposal boxes by proposal proportion constraint. Furthermore, within a teacher-student learning framework, we put forward an Anatomical Structure Matching (ASM) module to facilitate co-learning consistent topological prior knowledge between the student and teacher models. To our knowledge, this marks the inception of an efficient semi-supervised medical multi-structure detection model. Our experiments across five publicly available ultrasound datasets demonstrate that Semi-akmm sets a new benchmark in performance with solid results that outperform existing methods.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

Learning to Zoom with Anatomical Relations for Medical Structure Detection

Bin Pu
Liwen Wang
Xingbo Dong
Xingguo Lv
Zhe Jin

Accurate anatomical structure detection is a critical preliminary step for diagnosing diseases characterized by structural abnormalities. In clinical practice, medical experts frequently adjust the zoom level of medical images to obtain comprehensive views for diagnosis. This common interaction results in significant variations in the apparent scale of anatomical structures across different images or fields of view. However, the information embedded in these zoom-induced scale changes is often overlooked by existing detection algorithms. In addition, human organs possess a priori, fixed topological knowledge. To overcome this limitation, we propose ZR-DETR, a zoom-aware probabilistic framework tailored for medical object detection. ZR-DETR uniquely incorporates scale-sensitive zoom embeddings, anatomical relation constraints, and a Gaussian Process-based detection head. This architecture enables the framework to jointly model semantic context, enforce anatomical plausibility, and quantify detection uncertainty. Empirical validation across three diverse medical imaging benchmarks demonstrates that ZR-DETR consistently outperforms strong baselines in both single-domain and unsupervised domain adaptation scenarios.

PDF Details

NeurIPS Conference 2025 Conference Paper

On Fairness of Unified Multimodal Large Language Model for Image Generation

Ming Liu
Hao Chen
Jindong Wang
Liwen Wang
Bhiksha Raj
Wensheng Zhang

Unified multimodal large language models (U-MLLMs) have demonstrated impressive performance in end-to-end visual understanding and generation tasks. However, compared to generation-only systems (e. g. , Stable Diffusion), the unified architecture of U-MLLMs introduces new risks of propagating demographic stereotypes. In this paper, we benchmark several state-of-the-art U-MLLMs and show that they exhibit significant gender and race biases in the generated outputs. To diagnose the source of these biases, we propose a locate-then-fix framework: we first audit the vision and language components — using techniques such as linear probing and controlled generation — and find that the language model appears to be a primary origin of the observed generative bias. Moreover, we observe a ``partial alignment'' phenomenon, where the U-MLLMs exhibit less bias in understanding tasks yet produce substantially biased images. To address this, we introduce a novel \emph{balanced preference loss} that enforces uniform generation probabilities across demographics by leveraging a synthetically balanced dataset. Extensive experiments show that our approach significantly reduces demographic bias while preserving semantic fidelity and image quality. Our findings underscore the need for targeted debiasing strategies in unified multimodal systems and introduce a practical approach to mitigate biases.

PDF Details