Arrow Research search

Author name cluster

Yuxuan Sun

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

8 papers
2 author rows

Possible papers

8

AAAI Conference 2026 Conference Paper

MIRA: Evaluating Multimodal AI on Complex Clinical Reasoning in Interventional Radiology

  • Jingxiong Li
  • Chenglu Zhu
  • Sunyi Zheng
  • Yuxuan Sun
  • Yifei Wang
  • He Liu
  • Yunlong Zhang
  • Yixuan Si

We present MIRA (Multimodal Interventional RAdiology evaluation), a comprehensive benchmark for evaluating large multimodal models in expert-level interventional radiology tasks requiring specialized domain knowledge and advanced visual reasoning capabilities. Unlike existing medical benchmarks that primarily provide binary labels without contextual depth, MIRA offers diverse question formats, including open-ended, closed-ended, single-choice, and multiple-choice categories, each accompanied by detailed expert-validated explanations. The benchmark incorporates approximately 184K high-quality medical images spanning multiple imaging modalities with 1.2M meticulously generated question-answer pairs across various anatomical regions. These pairs were created through a sophisticated cascade methodology involving expert interventional radiologists at both the data collection and validation stages. Our comprehensive evaluation, encompassing zero-shot testing and fine-tuning experiments of large multimodal models, revealing significant performance gaps between AI systems and human specialists. Fine-tuning experiments demonstrate substantial improvements, with models achieving up to 0.80 accuracy on single-choice questions. MIRA establishes a challenging benchmark that suggests promising directions for developing specialized clinical AI systems for interventional radiology.

AAAI Conference 2026 Conference Paper

Towards Effective and Efficient Context-aware Nucleus Detection in Histopathology Whole Slide Images

  • Zhongyi Shui
  • Honglin Li
  • Yunlong Zhang
  • Yuxuan Sun
  • Yiwen Ye
  • Pingyi Chen
  • Ruizhe Guo
  • Lei Cui

Nucleus detection in histopathology whole slide images (WSIs) is crucial for a broad spectrum of clinical applications. The gigapixel size of WSIs necessitates the use of sliding window methodology for nucleus detection. However, mainstream methods process each sliding window independently, which overlooks broader contextual information and easily leads to inaccurate predictions. To address this limitation, recent studies additionally crop a large Filed-of-View (LFoV) patch centered on each sliding window to extract contextual features. However, such methods substantially increase whole-slide inference latency. In this work, we propose an effective and efficient context-aware nucleus detection approach. Specifically, instead of using lFoV patches, we aggregate contextual clues from off-the-shelf features of historically visited sliding windows, which greatly enhances the inference efficiency. Moreover, compared to lFoV patches used in previous works, the sliding window patches have higher magnification and provide finer-grained tissue details, thereby enhancing the classification accuracy. To develop the proposed context-aware model, we utilize annotated patches along with their surrounding unlabeled patches for training. Beyond exploiting high-level tissue context from these surrounding regions, we design a post-training strategy that leverages abundant unlabeled nucleus samples within them to enhance the model's context adaptability. Extensive experimental results on three challenging benchmarks demonstrate the superiority of our method.

NeurIPS Conference 2025 Conference Paper

CPathAgent: An Agent-based Foundation Model for Interpretable High-Resolution Pathology Image Analysis Mimicking Pathologists' Diagnostic Logic

  • Yuxuan Sun
  • Yixuan Si
  • Chenglu Zhu
  • Kai Zhang
  • Zhongyi Shui
  • Bowen Ding
  • Tao Lin
  • Lin Yang

Recent advances in computational pathology have led to the emergence of numerous foundation models. These models typically rely on general-purpose encoders with multi-instance learning for whole slide image (WSI) classification or apply multimodal approaches to generate reports directly from images. However, these models cannot emulate the diagnostic approach of pathologists, who systematically examine slides at low magnification to obtain an overview before progressively zooming in on suspicious regions to formulate comprehensive diagnoses. Instead, existing models directly output final diagnoses without revealing the underlying reasoning process. To address this gap, we introduce CPathAgent, an innovative agent-based approach that mimics pathologists' diagnostic workflow by autonomously navigating across WSI through zoom-in/out and move operations based on observed visual features, thereby generating substantially more transparent and interpretable diagnostic summaries. To achieve this, we develop a multi-stage training strategy that unifies patch-level, region-level, and WSI-level capabilities within a single model, which is essential for replicating how pathologists understand and reason across diverse image scales. Additionally, we construct PathMMU-HR², the first expert-validated benchmark for large region analysis. This represents a critical intermediate scale between patches and whole slides, reflecting a key clinical reality where pathologists typically examine several key large regions rather than entire slides at once. Extensive experiments demonstrate that CPathAgent consistently outperforms existing approaches across benchmarks at three different image scales, validating the effectiveness of our agent-based diagnostic approach and highlighting a promising direction for computational pathology.

ICML Conference 2025 Conference Paper

FlatQuant: Flatness Matters for LLM Quantization

  • Yuxuan Sun
  • Ruikang Liu
  • Haoli Bai
  • Han Bao
  • Kang Zhao
  • Yuening Li
  • Jiaxin Hu
  • Xianzhi Yu

Recently, quantization has been widely used for the compression and acceleration of large language models (LLMs). Due to the outliers in LLMs, it is crucial to flatten weights and activations to minimize quantization error with equally spaced quantization points. Prior research explores various pre-quantization transformations to suppress outliers, such as per-channel scaling and Hadamard transformation. However, we observe that these transformed weights and activations can still exhibit steep and dispersed distributions. In this paper, we propose FlatQuant (Fast and Learnable Affine Transformation), a new post-training quantization approach that enhances the flatness of weights and activations. Our approach identifies optimal affine transformations for each linear layer, calibrated in hours via a lightweight objective. To reduce runtime overhead of affine transformation, we apply Kronecker product with two lightweight matrices, and fuse all operations in FlatQuant into a single kernel. Extensive experiments demonstrate that FlatQuant establishes a new state-of-the-art benchmark for quantization. For example, it achieves less than 1% accuracy drop for W4A4 quantization on the LLaMA-3-70B model, surpassing SpinQuant by 7. 5%. Additionally, it provides up to 2. 3x prefill speedup and 1. 7x decoding speedup compared to the FP16 model. Code is available at: https: //github. com/ruikangliu/FlatQuant.

TMLR Journal 2025 Journal Article

Wasserstein Modality Alignment Makes Your Multimodal Transformer More Robust

  • Zhuo Zhi
  • Yuxuan Sun
  • Qiangqiang Wu
  • Ziquan Liu
  • Miguel R. D. Rodrigues

Multimodal fusion with a multimodal transformer is an effective method for both early and late fusion paradigms. However, in a multimodal transformer, the modality fusion is performed solely through the self-attention mechanism, which is originally designed for unimodal token sequences. To improve the self-attention mechanism for handling multimodal input, a parametric adapter model, like the Q-former in BLIP-2, is often used to align tokens from different modalities. Our empirical study unveils that only using the self-attention layer to perform the modality fusion makes the model less robust to missing modalities and input noise, as the model will overly rely on one certain modality. To improve the robustness of the transformer, our paper proposes an implicit approach based on Wasserstein distance that aligns tokens from different modalities without using any additional trainable parameters. Our empirical study shows that the implicit modality alignment improves the effectiveness of the multimodal Transformer in discriminative tasks, as well as its robustness to input noise and missing modalities. We conduct experiments on four downstream task datasets, including 2-modalities and 3-modalities tasks. We also consider different fusion paradigms, i.e., early and late fusion. The experimental results show that our proposed method has a significant improvement in both performance and robustness over all baselines across all datasets and fusion paradigms.

AAAI Conference 2024 Conference Paper

PathAsst: A Generative Foundation AI Assistant towards Artificial General Intelligence of Pathology

  • Yuxuan Sun
  • Chenglu Zhu
  • Sunyi Zheng
  • Kai Zhang
  • Lin Sun
  • Zhongyi Shui
  • Yunlong Zhang
  • Honglin Li

As advances in large language models (LLMs) and multimodal techniques continue to mature, the development of general-purpose multimodal large language models (MLLMs) has surged, offering significant applications in interpreting natural images. However, the field of pathology has largely remained untapped, particularly in gathering high-quality data and designing comprehensive model frameworks. To bridge the gap in pathology MLLMs, we present PathAsst, a multimodal generative foundation AI assistant to revolutionize diagnostic and predictive analytics in pathology. The development of PathAsst involves three pivotal steps: data acquisition, CLIP model adaptation, and the training of PathAsst's multimodal generative capabilities. Firstly, we collect over 207K high-quality pathology image-text pairs from authoritative sources. Leveraging the advanced power of ChatGPT, we generate over 180K instruction-following samples. Furthermore, we devise additional instruction-following data specifically tailored for invoking eight pathology-specific sub-models we prepared, allowing the PathAsst to effectively collaborate with these models, enhancing its diagnostic ability. Secondly, by leveraging the collected data, we construct PathCLIP, a pathology-dedicated CLIP, to enhance PathAsst's capabilities in interpreting pathology images. Finally, we integrate PathCLIP with the Vicuna-13b and utilize pathology-specific instruction-tuning data to enhance the multimodal generation capacity of PathAsst and bolster its synergistic interactions with sub-models. The experimental results of PathAsst show the potential of harnessing AI-powered generative foundation model to improve pathology diagnosis and treatment processes. We open-source our dataset, as well as a comprehensive toolkit for extensive pathology data collection and preprocessing at https://github.com/superjamessyx/Generative-Foundation-AI-Assistant-for-Pathology.

AAAI Conference 2023 Conference Paper

A Data Source for Reasoning Embodied Agents

  • Jack Lanchantin
  • Sainbayar Sukhbaatar
  • Gabriel Synnaeve
  • Yuxuan Sun
  • Kavya Srinet
  • Arthur Szlam

Recent progress in using machine learning models for reasoning tasks has been driven by novel model architectures, large-scale pre-training protocols, and dedicated reasoning datasets for fine-tuning. In this work, to further pursue these advances, we introduce a new data generator for machine reasoning that integrates with an embodied agent. The generated data consists of templated text queries and answers, matched with world-states encoded into a database. The world-states are a result of both world dynamics and the actions of the agent. We show the results of several baseline models on instantiations of train sets. These include pre-trained language models fine-tuned on a text-formatted representation of the database, and graph-structured Transformers operating on a knowledge-graph representation of the database. We find that these models can answer some questions about the world-state, but struggle with others. These results hint at new research directions in designing neural reasoning models and database representations. Code to generate the data and train the models will be released at github.com/facebookresearch/neuralmemory

AAAI Conference 2023 Conference Paper

Mind the Gap: Polishing Pseudo Labels for Accurate Semi-supervised Object Detection

  • Lei Zhang
  • Yuxuan Sun
  • Wei Wei

Exploiting pseudo labels (e.g., categories and bounding boxes) of unannotated objects produced by a teacher detector have underpinned much of recent progress in semi-supervised object detection (SSOD). However, due to the limited generalization capacity of the teacher detector caused by the scarce annotations, the produced pseudo labels often deviate from ground truth, especially those with relatively low classification confidences, thus limiting the generalization performance of SSOD. To mitigate this problem, we propose a dual pseudo-label polishing framework for SSOD. Instead of directly exploiting the pseudo labels produced by the teacher detector, we take the first attempt at reducing their deviation from ground truth using dual polishing learning, where two differently structured polishing networks are elaborately developed and trained using synthesized paired pseudo labels and the corresponding ground truth for categories and bounding boxes on the given annotated objects, respectively. By doing this, both polishing networks can infer more accurate pseudo labels for unannotated objects through sufficiently exploiting their context knowledge based on the initially produced pseudo labels, and thus improve the generalization performance of SSOD. Moreover, such a scheme can be seamlessly plugged into the existing SSOD framework for joint end-to-end learning. In addition, we propose to disentangle the polished pseudo categories and bounding boxes of unannotated objects for separate category classification and bounding box regression in SSOD, which enables introducing more unannotated objects during model training and thus further improves the performance. Experiments on both PASCAL VOC and MS-COCO benchmarks demonstrate the superiority of the proposed method over existing state-of-the-art baselines. The code can be found at https://github.com/snowdusky/DualPolishLearning.