Arrow Research search

Author name cluster

Junjun He

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

14 papers
2 author rows

Possible papers

14

JBHI Journal 2026 Journal Article

DiffM 4 RI: A Latent Diffusion Model With Modality Inpainting for Synthesizing Missing Modalities in MRI Analysis

  • Wen Ye
  • Zhetao Guo
  • Yuxiang Ren
  • Yi Tian
  • Yushi Shen
  • Zan Chen
  • Junjun He
  • Jing Ke

Foundation Models (FMs) have shown great promise for multimodal medical image analysis such as Magnetic Resonance Imaging (MRI). However, certain MRI sequences may be unavailable due to various constraints, such as limited scanning time, patient discomfort, or scanner limitations. The absence of certain modalities can hinder the performance of FMs in clinical applications, making effective missing modality imputation crucial for ensuring their applicability. Previous approaches, including generative adversarial networks (GANs), have been employed to synthesize missing modalities in either a one-to-one or many-to-one manner. However, these methods have limitations, as they require training a new model for different missing scenarios and are prone to mode collapse, generating limited diversity in the synthesized images. To address these challenges, we propose DiffM 4 RI, a diffusion model for many-to-many missing modality imputation in MRI. DiffM 4 RI innovatively formulates the missing modality imputation as a modality-level inpainting task, enabling it to handle arbitrary missing modality situations without the need for training multiple networks. Experiments on the BraTs datasets demonstrate DiffM 4 RI can achieve an average SSIM improvement of 0. 15 over MustGAN, 0. 1 over SynDiff, and 0. 02 over VQ-VAE-2. These results highlight the potential of DiffM 4 RI in enhancing the reliability of FMs in clinical applications.

AAAI Conference 2026 Conference Paper

GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and a Comprehensive Multimodal Dataset Towards General Medical AI

  • Tianbin Li
  • Yanzhou Su
  • Wei Li
  • Bin Fu
  • Zhe Chen
  • Ziyan Huang
  • Guoan Wang
  • Chenglong Ma

Despite significant advancements in general AI, its effectiveness in the medical domain is limited by the lack of specialized medical knowledge. To address this, we formulate GMAI-VL-5.5M, a multimodal medical dataset created by converting hundreds of specialized medical datasets with various annotations into high-quality image-text pairs. This dataset offers comprehensive task coverage, diverse modalities, and rich image-text data. Building upon this dataset, we develop GMAI-VL, a 7B-parameter general medical vision-language model, with a three-stage training strategy that enhances the integration of visual and textual information. This approach significantly improves the model's ability to process multimodal data, supporting accurate diagnoses and clinical decision-making. Experiments show that GMAI-VL achieves state-of-the-art performance across various multimodal medical tasks, including visual question answering and medical image diagnosis.

JBHI Journal 2026 Journal Article

MedSegAgent: A Universal and Scalable Multi-Agent System for Instructive Medical Image Segmentation

  • Ziyan Huang
  • Haoyu Wang
  • Jin Ye
  • Yuanfeng Ji
  • Xiaowei Hu
  • Lihao Liu
  • Zhikai Yang
  • Wei Li

Medical image segmentation is vital for clinical diagnosis and treatment; however, current solutions face three major limitations: (1) the lack of a universal framework capable of handling diverse modalities and anatomical targets, (2) the limited scalability to adapt to evolving clinical needs and new datasets, and (3) the lack of instructive interfaces that make models usable for non-expert users. To address these challenges, this paper presents MedSegAgent, a universal and scalable multi-agent system for instructive medical image segmentation. Specifically, MedSegAgent comprises five agents: one query parsing agent that processes natural language requests, three coarse-to-fine filtering agents (modality filtering, anatomical filtering, and label selection) for identifying relevant datasets and label values, and one execution agent responsible for model inference and result integration. Based on this framework, MedSegAgent utilizes 23 diverse datasets and pre-trained models to perform 343 types of segmentation across various modalities and anatomical targets. Experimental results demonstrate that MedSegAgent simplifies model selection while maintaining high performance, accurately identifying matching datasets and labels in 94. 27% of queries and locating at least one suitable match in 99. 03% of queries. MedSegAgent offers a universal and scalable solution for diverse medical image segmentation tasks, bridging the gap between user-friendly queries and the complexities of model selection and deployment. Our code is publicly available at https://github.com/uni-medical/MedSegAgent.

AAAI Conference 2026 Conference Paper

S2-UniSeg: Fast Universal Agglomerative Pooling for Scalable Segment Anything Without Supervision

  • Huihui Xu
  • Jin Ye
  • Hongqiu Wang
  • Changkai Ji
  • Jiashi Lin
  • Ming Hu
  • Ziyan Huang
  • Ying Chen

Recent self-supervised image segmentation models have achieved promising performance on semantic segmentation and class-agnostic instance segmentation. However, their pretraining schedule is multi-stage, requiring a time-consuming pseudo-masks generation process between each training epoch. This time-consuming offline process not only makes it difficult to scale with training dataset size, but also leads to sub-optimal solutions due to its discontinuous optimization routine. To solve these, we first present a novel pseudo-mask algorithm, Fast Universal Agglomerative Pooling (UniAP). Each layer of UniAP can identify groups of similar nodes in parallel, allowing to generate both semantic-level and instance-level and multi-granular pseudo-masks within ens of milliseconds for one image. Based on the fast UniAP, we propose the Scalable Self-Supervised Universal Segmentation (S2-UniSeg), which employs a student and a momentum teacher for continuous pretraining. A novel segmentation-oriented pretext task, Query-wise Self-Distillation (QuerySD), is proposed to pretrain S2-UniSeg to learn the local-to-global correspondences. Under the same setting, S2-UniSeg outperforms the SOTA UnSAM model, achieving notable improvements of AP+6.9 on COCO, AR+11.1 on UVO, PixelAcc+4.5 on COCOStuff-27, RQ+8.0 on Cityscapes. After scaling up to a larger 2M-image subset of SA-1B, S2-UniSeg further achieves performance gains on all four benchmarks.

AIIM Journal 2025 Journal Article

A survey for large language models in biomedicine

  • Chong Wang
  • Mengyao Li
  • Junjun He
  • Zhongruo Wang
  • Erfan Darzi
  • Zan Chen
  • Jin Ye
  • Tianbin Li

Recent breakthroughs in large language models (LLMs) offer unprecedented natural language understanding and generation capabilities. However, existing surveys on LLMs in biomedicine often focus on specific applications or model architectures, lacking a comprehensive analysis that integrates the latest advancements across various biomedical domains. This review, based on an analysis of 484 publications sourced from databases including PubMed, Web of Science, and arXiv, provides an in-depth examination of the current landscape, applications, challenges, and prospects of LLMs in biomedicine, distinguishing itself by focusing on the practical implications of these models in real-world biomedical contexts. Firstly, we explore the capabilities of LLMs in zero-shot learning across a broad spectrum of biomedical tasks, including diagnostic assistance, drug discovery, and personalized medicine, among others, with insights drawn from 137 key studies. Then, we discuss adaptation strategies of LLMs, including fine-tuning methods for both uni-modal and multi-modal LLMs to enhance their performance in specialized biomedical contexts where zero-shot fails to achieve, such as medical question answering and efficient processing of biomedical literature. Finally, we discuss the challenges that LLMs face in the biomedicine domain including data privacy concerns, limited model interpretability, issues with dataset quality, and ethics due to the sensitive nature of biomedical data, the need for highly reliable model outputs, and the ethical implications of deploying AI in healthcare. To address these challenges, we also identify future research directions of LLM in biomedicine including federated learning methods to preserve data privacy and integrating explainable AI methodologies to enhance the transparency of LLMs. As this field of LLM rapidly evolves, continued research and development are essential to fully harness the capabilities of LLMs in biomedicine while ensuring their responsible and effective deployment.

NeurIPS Conference 2025 Conference Paper

AOR: Anatomical Ontology-Guided Reasoning for Medical Large Multimodal Model in Chest X-Ray Interpretation

  • Qingqiu Li
  • Zihang Cui
  • Seongsu Bae
  • Jilan Xu
  • Runtian Yuan
  • Yuejie Zhang
  • Rui Feng
  • Quanli Shen

Chest X-rays (CXRs) are the most frequently performed imaging examinations in clinical settings. Recent advancements in Medical Large Multimodal Models (MLMMs) have enabled automated CXR interpretation, improving diagnostic accuracy and efficiency. However, despite their strong visual understanding, current MLMMs still face two major challenges: (1) insufficient region-level understanding and interaction, and (2) limited accuracy and interpretability due to single-step prediction. In this paper, we address these challenges by empowering MLMMs with anatomy-centric reasoning capabilities to enhance their interactivity and explainability. Specifically, we propose an Anatomical Ontology-Guided Reasoning (AOR) framework that accommodates both textual and optional visual prompts, centered on region-level information to enable multimodal multi-step reasoning. We also develop AOR-Instruction, a large instruction dataset for MLMMs training, under the guidance of expert physicians. Our experiments demonstrate AOR's superior performance in both Visual Question Answering (VQA) and report generation tasks. Code and data are available at: https: //github. com/Liqq1/AOR.

ICLR Conference 2025 Conference Paper

Lumina-T2X: Scalable Flow-based Large Diffusion Transformer for Flexible Resolution Generation

  • Peng Gao 0007
  • Le Zhuo
  • Dongyang Liu
  • Ruoyi Du
  • Xu Luo
  • Longtian Qiu
  • Yuhang Zhang
  • Rongjie Huang 0001

Sora unveils the potential of scaling Diffusion Transformer (DiT) for generating photorealistic images and videos at arbitrary resolutions, aspect ratios, and durations, yet it still lacks sufficient implementation details. In this paper, we introduce the Lumina-T2X family -- a series of Flow-based Large Diffusion Transformers (Flag-DiT) equipped with zero-initialized attention, as a simple and scalable generative framework that can be adapted to various modalities, e.g., transforming noise into images, videos, multi-view 3D objects, or audio clips conditioned on text instructions. By tokenizing the latent spatial-temporal space and incorporating learnable placeholders such as |[nextline]| and |[nextframe]| tokens, Lumina-T2X seamlessly unifies the representations of different modalities across various spatial-temporal resolutions. Advanced techniques like RoPE, KQ-Norm, and flow matching enhance the stability, flexibility, and scalability of Flag-DiT, enabling models of Lumina-T2X to scale up to 7 billion parameters and extend the context window to 128K tokens. This is particularly beneficial for creating ultra-high-definition images with our Lumina-T2I model and long 720p videos with our Lumina-T2V model. Remarkably, Lumina-T2I, powered by a 5-billion-parameter Flag-DiT, requires only 35% of the training computational costs of a 600-million-parameter naive DiT (PixArt-alpha), indicating that increasing the number of parameters significantly accelerates convergence of generative models without compromising visual quality. Our further comprehensive analysis underscores Lumina-T2X's preliminary capability in resolution extrapolation, high-resolution editing, generating consistent 3D views, and synthesizing videos with seamless transitions. All code and checkpoints of Lumina-T2X are released at https://github.com/Alpha-VLLM/Lumina-T2X to further foster creativity, transparency, and diversity in the generative AI community.

NeurIPS Conference 2025 Conference Paper

Reliable Lifelong Multimodal Editing: Conflict-Aware Retrieval Meets Multi-Level Guidance

  • Qiang Zhang
  • Fanrui Zhang
  • Jiawei Liu
  • Ming Hu
  • Junjun He
  • Zheng-Jun Zha

The dynamic nature of real-world information demands efficient knowledge editing in multimodal large language models (MLLMs) to ensure continuous knowledge updates. However, existing methods often struggle with precise matching in large-scale knowledge retrieval and lack multi-level guidance for coordinated editing, leading to less reliable outcomes. To tackle these challenges, we propose CARML, a novel retrieval-augmented editing framework that integrates conflict-aware dynamic retrieval with multi-level implicit and explicit guidance for reliable lifelong multimodal editing. Specifically, CARML introduces intra-modal uncertainty and inter-modal conflict quantification to dynamically integrate multi-channel retrieval results, so as to pinpoint the most relevant knowledge to the incoming edit samples. Afterwards, an edit scope classifier discerns whether the edit sample semantically aligns with the edit scope of the retrieved knowledge. If deemed in-scope, CARML refines the retrieved knowledge into information-rich continuous prompt prefixes, serving as the implicit knowledge guide. These prefixes not only include static knowledge prompt that capture key textual semantics but also incorporate token-level, context-aware dynamic prompt to explore fine-grained cross-modal associations between the edit sample and retrieved knowledge. To further enhance reliability, CARML incorporates a "hard correction" mechanism, leveraging explicit label knowledge to adjust the model’s output logits. Extensive experiments across multiple MLLMs and datasets indicate the superior performance of CARML in lifelong multimodal editing scenarios.

NeurIPS Conference 2025 Conference Paper

Towards Dynamic 3D Reconstruction of Hand-Instrument Interaction in Ophthalmic Surgery

  • Ming Hu
  • Zhengdi Yu
  • Feilong Tang
  • Kaiwen Chen
  • Yulong Li
  • Imran Razzak
  • Junjun He
  • Tolga Birdal

Accurate 3D reconstruction of hands and instruments is critical for vision-based analysis of ophthalmic microsurgery, yet progress has been hampered by the lack of realistic, large-scale datasets and reliable annotation tools. In this work, we introduce OphNet-3D, the first extensive RGB-D dynamic 3D reconstruction dataset for ophthalmic surgery, comprising 41 sequences from 40 surgeons and totaling 7. 1 million frames, with fine-grained annotations of 12 surgical phases, 10 instrument categories, dense MANO hand meshes, and full 6-DoF instrument poses. To scalably produce high-fidelity labels, we design a multi-stage automatic annotation pipeline that integrates multi-view data observation, data-driven motion prior with cross-view geometric consistency and biomechanical constraints, along with a combination of collision-aware interaction constraints for instrument interactions. Building upon OphNet-3D, we establish two challenging benchmarks—bimanual hand pose estimation and hand–instrument interaction reconstruction—and propose two dedicated architectures: H-Net for dual-hand mesh recovery and OH-Net for joint reconstruction of two-hand–two-instrument interactions. These models leverage a novel spatial reasoning module with weak-perspective camera modeling and collision-aware center-based representation. Both architectures outperform existing methods by substantial margins, achieving improvements of over 2mm in Mean Per Joint Position Error (MPJPE) and up to 23\% in ADD-S metrics for hand and instrument reconstruction, respectively.

NeurIPS Conference 2024 Conference Paper

GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI

  • Pengcheng Chen
  • Jin Ye
  • Guoan Wang
  • Yanjun Li
  • Zhongying Deng
  • Wei Li
  • Tianbin Li
  • Haodong Duan

Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals, and can be applied in various fields. In the medical field, LVLMs have a high potential to offer substantial assistance for diagnosis and treatment. Before that, it is crucial to develop benchmarks to evaluate LVLMs' effectiveness in various medical applications. Current benchmarks are often built upon specific academic literature, mainly focusing on a single domain, and lacking varying perceptual granularities. Thus, they face specific challenges, including limited clinical relevance, incomplete evaluations, and insufficient guidance for interactive LVLMs. To address these limitations, we developed the GMAI-MMBench, the most comprehensive general medical AI benchmark with well-categorized data structure and multi-perceptual granularity to date. It is constructed from 284 datasets across 38 medical image modalities, 18 clinical-related tasks, 18 departments, and 4 perceptual granularities in a Visual Question Answering (VQA) format. Additionally, we implemented a lexical tree structure that allows users to customize evaluation tasks, accommodating various assessment needs and substantially supporting medical AI research and applications. We evaluated 50 LVLMs, and the results show that even the advanced GPT-4o only achieves an accuracy of 53. 96\%, indicating significant room for improvement. Moreover, we identified five key insufficiencies in current cutting-edge LVLMs that need to be addressed to advance the development of better medical applications. We believe that GMAI-MMBench will stimulate the community to build the next generation of LVLMs toward GMAI.

ICML Conference 2024 Conference Paper

SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models

  • Dongyang Liu
  • Renrui Zhang
  • Longtian Qiu
  • Siyuan Huang 0004
  • Weifeng Lin
  • Shitian Zhao
  • Shijie Geng
  • Ziyi Lin

We propose SPHINX-X, an extensive Multi-modality Large Language Model (MLLM) series developed upon SPHINX. To improve the architecture and training efficiency, we modify the SPHINX framework by removing redundant visual encoders, bypassing fully-padded sub-images with skip tokens, and simplifying multi-stage training into a one-stage all-in-one paradigm. To fully unleash the potential of MLLMs, we assemble a comprehensive multi-domain and multi-modal dataset covering publicly available resources in language, vision, and vision-language tasks. We further enrich this collection with our curated OCR intensive and Set-of-Mark datasets, extending the diversity and generality. By training over different base LLMs including TinyLlama-1. 1B, InternLM2-7B, LLaMA2-13B, and Mixtral-8$\times$7B, we obtain a spectrum of MLLMs that vary in parameter size and multilingual capabilities. Comprehensive benchmarking reveals a strong correlation between the multi-modal performance with the data and parameter scales. Code and models are released at https: //github. com/Alpha-VLLM/LLaMA2-Accessory.

NeurIPS Conference 2024 Conference Paper

Touchstone Benchmark: Are We on the Right Way for Evaluating AI Algorithms for Medical Segmentation?

  • Pedro R. Bassi
  • Wenxuan Li
  • Yucheng Tang
  • Fabian Isensee
  • Zifu Wang
  • Jieneng Chen
  • Yu-Cheng Chou
  • Saikat Roy

How can we test AI performance? This question seems trivial, but it isn't. Standard benchmarks often have problems such as in-distribution and small-size test sets, oversimplified metrics, unfair comparisons, and short-term outcome pressure. As a consequence, good performance on standard benchmarks does not guarantee success in real-world scenarios. To address these problems, we present Touchstone, a large-scale collaborative segmentation benchmark of 9 types of abdominal organs. This benchmark is based on 5, 195 training CT scans from 76 hospitals around the world and 5, 903 testing CT scans from 11 additional hospitals. This diverse test set enhances the statistical significance of benchmark results and rigorously evaluates AI algorithms across various out-of-distribution scenarios. We invited 14 inventors of 19 AI algorithms to train their algorithms, while our team, as a third party, independently evaluated these algorithms on three test sets. In addition, we also evaluated pre-existing AI frameworks---which, differing from algorithms, are more flexible and can support different algorithms—including MONAI from NVIDIA, nnU-Net from DKFZ, and numerous other open-source frameworks. We are committed to expanding this benchmark to encourage more innovation of AI algorithms for the medical domain.

ICLR Conference 2023 Conference Paper

Vision Transformer Adapter for Dense Predictions

  • Zhe Chen 0017
  • Yuchen Duan
  • Wenhai Wang
  • Junjun He
  • Tong Lu 0002
  • Jifeng Dai
  • Yu Qiao 0001

This work investigates a simple yet powerful dense prediction task adapter for Vision Transformer (ViT). Unlike recently advanced variants that incorporate vision-specific inductive biases into their architectures, the plain ViT suffers inferior performance on dense predictions due to weak prior assumptions. To address this issue, we propose the ViT-Adapter, which allows plain ViT to achieve comparable performance to vision-specific transformers. Specifically, the backbone in our framework is a plain ViT that can learn powerful representations from large-scale multi-modal data. When transferring to downstream tasks, a pre-training-free adapter is used to introduce the image-related inductive biases into the model, making it suitable for these tasks. We verify ViT-Adapter on multiple dense prediction tasks, including object detection, instance segmentation, and semantic segmentation. Notably, without using extra detection data, our ViT-Adapter-L yields state-of-the-art 60.9 box AP and 53.0 mask AP on COCO test-dev. We hope that the ViT-Adapter could serve as an alternative for vision-specific transformers and facilitate future research. Code and models will be released at https://github.com/czczup/ViT-Adapter.

AAAI Conference 2020 Conference Paper

Dynamic Sampling Network for Semantic Segmentation

  • Bin Fu
  • Junjun He
  • Zhengfu Zhang
  • Yu Qiao

Sampling is a basic operation of modern convolutional neural networks (CNN) since down-sampling operators are employed to enlarge the receptive field while up-sampling operators are adopted to increase resolution. Most existing deep segmentation networks employ regular grid sampling operators, which can be suboptimal for semantic segmentation task due to large shape and scale variance. To address this problem, this paper proposes a Context Guided Dynamic Sampling (CGDS) module to obtain an effective representation with rich shape and scale information by adaptively sampling useful segmentation information in spatial space. Moreover, we utilize the multi-scale contextual representations to guide the sampling process. Therefore, our CGDS can adaptively capture shape and scale information according to not only the input feature map but also the multi-scale semantic context. CGDS provides a plug-and-play module which can be easily incorporated in deep segmentation networks. We incorporate our proposed CGDS module into Dynamic Sampling Network (DSNet) and perform extensive experiments on segmentation datasets. Experimental results show that our CGDS significantly improves semantic segmentation performance and achieves state-of-the-art performance on PASCAL VOC 2012 and ADE20K datasets. Our model achieves 85. 2% mIOU on PASCAL VOC 2012 test set without MS COCO dataset pre-trained and 46. 4% on ADE20K validation set. The codes will become publicly available after publication.