Author name cluster

Yifei Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

45 papers

2 author rows

AAAI Conference 2026 Conference Paper

MIRA: Evaluating Multimodal AI on Complex Clinical Reasoning in Interventional Radiology

Jingxiong Li
Chenglu Zhu
Sunyi Zheng
Yuxuan Sun
Yifei Wang
He Liu
Yunlong Zhang
Yixuan Si

We present MIRA (Multimodal Interventional RAdiology evaluation), a comprehensive benchmark for evaluating large multimodal models in expert-level interventional radiology tasks requiring specialized domain knowledge and advanced visual reasoning capabilities. Unlike existing medical benchmarks that primarily provide binary labels without contextual depth, MIRA offers diverse question formats, including open-ended, closed-ended, single-choice, and multiple-choice categories, each accompanied by detailed expert-validated explanations. The benchmark incorporates approximately 184K high-quality medical images spanning multiple imaging modalities with 1.2M meticulously generated question-answer pairs across various anatomical regions. These pairs were created through a sophisticated cascade methodology involving expert interventional radiologists at both the data collection and validation stages. Our comprehensive evaluation, encompassing zero-shot testing and fine-tuning experiments of large multimodal models, revealing significant performance gaps between AI systems and human specialists. Fine-tuning experiments demonstrate substantial improvements, with models achieving up to 0.80 accuracy on single-choice questions. MIRA establishes a challenging benchmark that suggests promising directions for developing specialized clinical AI systems for interventional radiology.

PDF Details DOI

TMLR Journal 2026 Journal Article

StructEval: Benchmarking LLMs' Capabilities to Generate Structural Outputs

Jialin Yang
Dongfu Jiang
Tony He
Sherman Siu
Yuxuan Zhang
Disen Liao
Zhuofeng Li
Huaye Zeng

As Large Language Models (LLMs) become integral to software development workflows, their ability to generate structured outputs has become critically important. We introduce $\textbf{StructEval}$, a comprehensive benchmark for evaluating LLMs' capabilities in producing both non-renderable (JSON, YAML, CSV) and renderable (HTML, React, SVG) structured formats. Unlike prior benchmarks, StructEval systematically evaluates structural fidelity across diverse formats through two paradigms: $\textbf{1)}$ generation tasks, producing structured output from natural language prompts, and $\textbf{2)}$ conversion tasks, translating between structured formats. Our benchmark encompasses 18 formats and 44 types of task, with novel metrics for format adherence and structural correctness. Results reveal significant performance gaps—even state-of-the-art models like o1-mini achieve only $75.58$ average score, with open-source alternatives lagging approximately $10$ points behind. We find generation tasks more challenging than conversion tasks, and producing correct visual content more difficult than generating text-only structures.