Arrow Research search

Author name cluster

Ming Hu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

18 papers
2 author rows

Possible papers

18

AAAI Conference 2026 Conference Paper

GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and a Comprehensive Multimodal Dataset Towards General Medical AI

  • Tianbin Li
  • Yanzhou Su
  • Wei Li
  • Bin Fu
  • Zhe Chen
  • Ziyan Huang
  • Guoan Wang
  • Chenglong Ma

Despite significant advancements in general AI, its effectiveness in the medical domain is limited by the lack of specialized medical knowledge. To address this, we formulate GMAI-VL-5.5M, a multimodal medical dataset created by converting hundreds of specialized medical datasets with various annotations into high-quality image-text pairs. This dataset offers comprehensive task coverage, diverse modalities, and rich image-text data. Building upon this dataset, we develop GMAI-VL, a 7B-parameter general medical vision-language model, with a three-stage training strategy that enhances the integration of visual and textual information. This approach significantly improves the model's ability to process multimodal data, supporting accurate diagnoses and clinical decision-making. Experiments show that GMAI-VL achieves state-of-the-art performance across various multimodal medical tasks, including visual question answering and medical image diagnosis.

JBHI Journal 2026 Journal Article

MedSegAgent: A Universal and Scalable Multi-Agent System for Instructive Medical Image Segmentation

  • Ziyan Huang
  • Haoyu Wang
  • Jin Ye
  • Yuanfeng Ji
  • Xiaowei Hu
  • Lihao Liu
  • Zhikai Yang
  • Wei Li

Medical image segmentation is vital for clinical diagnosis and treatment; however, current solutions face three major limitations: (1) the lack of a universal framework capable of handling diverse modalities and anatomical targets, (2) the limited scalability to adapt to evolving clinical needs and new datasets, and (3) the lack of instructive interfaces that make models usable for non-expert users. To address these challenges, this paper presents MedSegAgent, a universal and scalable multi-agent system for instructive medical image segmentation. Specifically, MedSegAgent comprises five agents: one query parsing agent that processes natural language requests, three coarse-to-fine filtering agents (modality filtering, anatomical filtering, and label selection) for identifying relevant datasets and label values, and one execution agent responsible for model inference and result integration. Based on this framework, MedSegAgent utilizes 23 diverse datasets and pre-trained models to perform 343 types of segmentation across various modalities and anatomical targets. Experimental results demonstrate that MedSegAgent simplifies model selection while maintaining high performance, accurately identifying matching datasets and labels in 94. 27% of queries and locating at least one suitable match in 99. 03% of queries. MedSegAgent offers a universal and scalable solution for diverse medical image segmentation tasks, bridging the gap between user-friendly queries and the complexities of model selection and deployment. Our code is publicly available at https://github.com/uni-medical/MedSegAgent.

AAAI Conference 2026 Conference Paper

S2-UniSeg: Fast Universal Agglomerative Pooling for Scalable Segment Anything Without Supervision

  • Huihui Xu
  • Jin Ye
  • Hongqiu Wang
  • Changkai Ji
  • Jiashi Lin
  • Ming Hu
  • Ziyan Huang
  • Ying Chen

Recent self-supervised image segmentation models have achieved promising performance on semantic segmentation and class-agnostic instance segmentation. However, their pretraining schedule is multi-stage, requiring a time-consuming pseudo-masks generation process between each training epoch. This time-consuming offline process not only makes it difficult to scale with training dataset size, but also leads to sub-optimal solutions due to its discontinuous optimization routine. To solve these, we first present a novel pseudo-mask algorithm, Fast Universal Agglomerative Pooling (UniAP). Each layer of UniAP can identify groups of similar nodes in parallel, allowing to generate both semantic-level and instance-level and multi-granular pseudo-masks within ens of milliseconds for one image. Based on the fast UniAP, we propose the Scalable Self-Supervised Universal Segmentation (S2-UniSeg), which employs a student and a momentum teacher for continuous pretraining. A novel segmentation-oriented pretext task, Query-wise Self-Distillation (QuerySD), is proposed to pretrain S2-UniSeg to learn the local-to-global correspondences. Under the same setting, S2-UniSeg outperforms the SOTA UnSAM model, achieving notable improvements of AP+6.9 on COCO, AR+11.1 on UVO, PixelAcc+4.5 on COCOStuff-27, RQ+8.0 on Cityscapes. After scaling up to a larger 2M-image subset of SA-1B, S2-UniSeg further achieves performance gains on all four benchmarks.

NeurIPS Conference 2025 Conference Paper

Decoding Causal Structure: End-to-End Mediation Pathways Inference

  • Yulong Li
  • Xiwei Liu
  • Feilong Tang
  • Ming Hu
  • Jionglong Su
  • Zongyuan Ge
  • Imran Razzak
  • Eran Segal

Causal mediation analysis is crucial for deconstructing complex mechanisms of action. However, in current mediation analysis, complex structures derived from causal discovery lack direct interpretation of mediation pathways, while traditional mediation analysis and effect estimation are limited by the reliance on pre-specified pathways, leading to a disconnection between structure discovery and causal mechanism understanding. Therefore, a unified framework integrating structure discovery, pathway identification, and effect estimation systematically quantifies mediation pathways under structural uncertainty, enabling automated identification and inference of mediation pathways. To this end, we propose Structure-Informed Guided Mediation Analysis (SIGMA), which guides automated mediation pathway identification through probabilistic causal structure discovery and uncertainty quantification, enabling end-to-end propagation of structural uncertainty from structure learning to effect estimation. Specifically, SIGMA employs differentiable Flow-Structural Equation Models to learn structural posteriors, generating diverse Directed Acyclic Graphs (DAGs) to quantify structural uncertainty. Based on these DAGs, we introduce the Path Stability Score to evaluate the marginal probability of pathways, identifying high-confidence mediation paths. For identified mediation pathways, we integrate Efficient Influence Functions with Bayesian model averaging to fuse within-structure estimation uncertainty and between-structure effect variation, propagating uncertainty to the final effect estimates. In synthetic data experiments, SIGMA achieves state-of-the-art performance in pathway identification accuracy and effect quantification precision under structures uncertainty, concurrent multiple pathways, and nonlinear scenarios. In real-world applications using Human Phenotype Project data, SIGMA identifies mediation effects of sleep quality on cardiovascular health through inflammatory and metabolic pathways, uncovering previously unspecified multiple mediation paths.

IJCAI Conference 2025 Conference Paper

DONIS: Importance Sampling for Training Physics-Informed DeepONet

  • Shudong Huang
  • Rui Huang
  • Ming Hu
  • Wentao Feng
  • Jiancheng Lv

Deep Operator Network (DeepONet) effectively learns complex operator mappings, especially for systems governed by differential equations. Physics-informed DeepONet (PI-DeepONet) extends these capabilities by integrating physical constraints, enabling robust performance with limited or no labeled data. However, combining operator learning with these constraints increases computational complexity, which makes training more difficult and convergence slower, particularly for nonlinear or high-dimensional problems. In this work, we present an enhanced PI-DeepONet framework, that applies importance sampling to both of DeepONet inputs (i. e. , the functions and the collocation points) to alleviate these training challenges. By focusing on critical data regions in both input domains, our approach showcases accelerated convergence and improved accuracy across various complex applications.

AAAI Conference 2025 Conference Paper

MultiSFL: Towards Accurate Split Federated Learning via Multi-Model Aggregation and Knowledge Replay

  • Zeke Xia
  • Ming Hu
  • Dengke Yan
  • Ruixuan Liu
  • Anran Li
  • Xiaofei Xie
  • Mingsong Chen

Although Split Federated Learning (SFL) effectively enables knowledge sharing among resource-constrained clients, it suffers from low training performance due to the neglect of data heterogeneity and catastrophic forgetting problems. To address these issues, we propose a novel SFL approach named MultiSFL, which adopts i) an effective multi-model aggregation mechanism to alleviate gradient divergence caused by heterogeneous data and ii) a novel knowledge replay strategy to deal with the catastrophic forgetting problem. MultiSFL adopts two servers (i.e., the fed server and main server) to maintain multiple branch models for local training and an aggregated master model for knowledge sharing among branch models. To mitigate catastrophic forgetting, the main server of MultiSFL selects multiple assistant devices for knowledge replay according to the training data distribution of each full branch model. Experimental results obtained from various non-IID and IID scenarios demonstrate that MultiSFL significantly outperforms conventional SFL methods by up to a 23.25% test accuracy improvement.

AAAI Conference 2025 Conference Paper

Neighbor Does Matter: Density-Aware Contrastive Learning for Medical Semi-supervised Segmentation

  • Feilong Tang
  • Zhongxing Xu
  • Ming Hu
  • Wenxue Li
  • Peng Xia
  • Yiheng Zhong
  • Hanjun Wu
  • Jionglong Su

In medical image analysis, multi-organ semi-supervised segmentation faces challenges such as insufficient labels and low contrast in soft tissues. To address these issues, existing studies typically employ semi-supervised segmentation techniques using pseudo-labeling and consistency regularization. However, these methods mainly rely on individual data samples for training, ignoring the rich neighborhood information present in the feature space. In this work, we argue that supervisory information can be directly extracted from the geometry of the feature space. Inspired by the density-based clustering hypothesis, we propose using feature density to locate sparse regions within feature clusters. Our goal is to increase intra-class compactness by addressing sparsity issues. To achieve this, we propose a Density-Aware Contrastive Learning (DACL) strategy, pushing anchored features in sparse regions towards cluster centers approximated by high-density positive samples, resulting in more compact clusters. Specifically, our method constructs density-aware neighbor graphs using labeled and unlabeled data samples to estimate feature density and locate sparse regions. We also combine label-guided co-training with density-guided geometric regularization to form complementary supervision for unlabeled data. Experiments on the Multi-Organ Segmentation Challenge dataset demonstrate that our proposed method outperforms state-of-the-art methods, highlighting its efficacy in medical image segmentation tasks.

ICML Conference 2025 Conference Paper

One Arrow, Two Hawks: Sharpness-aware Minimization for Federated Learning via Global Model Trajectory

  • Yuhang Li
  • Tong Liu 0001
  • Yangguang Cui
  • Ming Hu
  • Xiaoqiang Li 0002

Federated learning (FL) presents a promising strategy for distributed and privacy-preserving learning, yet struggles with performance issues in the presence of heterogeneous data distributions. Recently, a series of works based on sharpness-aware minimization (SAM) have emerged to improve local learning generality, proving to be effective in mitigating data heterogeneity effects. However, most SAM-based methods do not directly consider the global objective and require two backward pass per iteration, resulting in diminished effectiveness. To overcome these two bottlenecks, we leverage the global model trajectory to directly measure sharpness for the global objective, requiring only a single backward pass. We further propose a novel and general algorithm FedGMT to overcome data heterogeneity and the pitfalls of previous SAM-based methods. We analyze the convergence of FedGMT and conduct extensive experiments on visual and text datasets in a variety of scenarios, demonstrating that FedGMT achieves competitive accuracy with state-of-the-art FL methods while minimizing computation and communication overhead. Code is available at https: //github. com/harrylee999/FL-SAM.

NeurIPS Conference 2025 Conference Paper

Reliable Lifelong Multimodal Editing: Conflict-Aware Retrieval Meets Multi-Level Guidance

  • Qiang Zhang
  • Fanrui Zhang
  • Jiawei Liu
  • Ming Hu
  • Junjun He
  • Zheng-Jun Zha

The dynamic nature of real-world information demands efficient knowledge editing in multimodal large language models (MLLMs) to ensure continuous knowledge updates. However, existing methods often struggle with precise matching in large-scale knowledge retrieval and lack multi-level guidance for coordinated editing, leading to less reliable outcomes. To tackle these challenges, we propose CARML, a novel retrieval-augmented editing framework that integrates conflict-aware dynamic retrieval with multi-level implicit and explicit guidance for reliable lifelong multimodal editing. Specifically, CARML introduces intra-modal uncertainty and inter-modal conflict quantification to dynamically integrate multi-channel retrieval results, so as to pinpoint the most relevant knowledge to the incoming edit samples. Afterwards, an edit scope classifier discerns whether the edit sample semantically aligns with the edit scope of the retrieved knowledge. If deemed in-scope, CARML refines the retrieved knowledge into information-rich continuous prompt prefixes, serving as the implicit knowledge guide. These prefixes not only include static knowledge prompt that capture key textual semantics but also incorporate token-level, context-aware dynamic prompt to explore fine-grained cross-modal associations between the edit sample and retrieved knowledge. To further enhance reliability, CARML incorporates a "hard correction" mechanism, leveraging explicit label knowledge to adjust the model’s output logits. Extensive experiments across multiple MLLMs and datasets indicate the superior performance of CARML in lifelong multimodal editing scenarios.

NeurIPS Conference 2025 Conference Paper

Rising from Ashes: Generalized Federated Learning via Dynamic Parameter Reset

  • Jiahao Wu
  • Ming Hu
  • Yanxin Yang
  • Xiaofei Xie
  • Zekai Chen
  • Chenyu Song
  • Mingsong Chen

Although Federated Learning (FL) is promising in privacy-preserving collaborative model training, it faces low inference performance due to heterogeneous data among clients. Due to heterogeneous data in each client, FL training easily learns the specific overfitting features. Existing FL methods adopt the coarse-grained average aggregation strategy, which causes the global model to easily get stuck in local optima, resulting in low generalization of the global model. Specifically, this paper presents a novel FL framework named FedPhoenix to address this issue, which stochastically resets partial parameters to destroy some features of the global model in each round to guide the FL training to learn multiple generalized features for inference rather than specific overfitting features. Experimental results on various well-known datasets demonstrate that compared to SOTA FL methods, FedPhoenix can achieve up to 20. 73\% accuracy improvement.

NeurIPS Conference 2025 Conference Paper

Towards Dynamic 3D Reconstruction of Hand-Instrument Interaction in Ophthalmic Surgery

  • Ming Hu
  • Zhengdi Yu
  • Feilong Tang
  • Kaiwen Chen
  • Yulong Li
  • Imran Razzak
  • Junjun He
  • Tolga Birdal

Accurate 3D reconstruction of hands and instruments is critical for vision-based analysis of ophthalmic microsurgery, yet progress has been hampered by the lack of realistic, large-scale datasets and reliable annotation tools. In this work, we introduce OphNet-3D, the first extensive RGB-D dynamic 3D reconstruction dataset for ophthalmic surgery, comprising 41 sequences from 40 surgeons and totaling 7. 1 million frames, with fine-grained annotations of 12 surgical phases, 10 instrument categories, dense MANO hand meshes, and full 6-DoF instrument poses. To scalably produce high-fidelity labels, we design a multi-stage automatic annotation pipeline that integrates multi-view data observation, data-driven motion prior with cross-view geometric consistency and biomechanical constraints, along with a combination of collision-aware interaction constraints for instrument interactions. Building upon OphNet-3D, we establish two challenging benchmarks—bimanual hand pose estimation and hand–instrument interaction reconstruction—and propose two dedicated architectures: H-Net for dual-hand mesh recovery and OH-Net for joint reconstruction of two-hand–two-instrument interactions. These models leverage a novel spatial reasoning module with weak-perspective camera modeling and collision-aware center-based representation. Both architectures outperform existing methods by substantial margins, achieving improvements of over 2mm in Mean Per Joint Position Error (MPJPE) and up to 23\% in ADD-S metrics for hand and instrument reconstruction, respectively.

AAAI Conference 2025 Conference Paper

Towards Realistic Semi-supervised Medical Image Classification

  • Wenxue Li
  • Lie Ju
  • Feilong Tang
  • Peng Xia
  • Xinyu Xiong
  • Ming Hu
  • Lei Zhu
  • Zongyuan Ge

Existing semi-supervised learning (SSL) approaches follow the idealized closed-world assumption, neglecting the challenges present in realistic medical scenarios, such as open-set distribution and imbalanced class distribution. Although some methods in natural domains attempt to address the open-set problem, they are insufficient for medical domains, where intertwined challenges like class imbalance and small inter-class lesion discrepancies persist. Thus, this paper presents a novel self-recalibrated semantic training framework, which is tailored for SSL in medical imaging by ingeniously harvesting realistic unlabeled samples. Inspired by the observation that certain open-set samples share some similar disease-related representations with in-distribution samples, we first propose an informative sample selection strategy that identifies high-value samples to serve as augmentations, thereby effectively enriching the semantics of known categories. Furthermore, we adopt a compact semantic clustering strategy to address the semantic confusion raised by the above newly introduced open-set semantics. Moreover, to mitigate the interference of class imbalance in open-set SSL, we introduce a less biased dual-balanced classifier with similarity pseudo-label regularization and category-customized regularization. Extensive experiments on a variety of medical image datasets demonstrate the superior performance of our proposed method over state-of-the-art Closed-set and Open-set SSL methods.

NeurIPS Conference 2025 Conference Paper

UniViT: Unifying Image and Video Understanding in One Vision Encoder

  • Feilong Tang
  • xiangan xiangan
  • Haolin Yang
  • Yin Xie
  • Kaicheng Yang
  • Ming Hu
  • Zheng Cheng
  • Xingyu Zhou

Despite the impressive progress of recent pretraining methods on multimodal tasks, existing methods are inherently biased towards either spatial modeling (e. g. , CLIP) or temporal modeling (e. g. , V-JEPA), limiting their joint capture of spatial details and temporal dynamics. To this end, we propose UniViT, a cluster-driven unified self-supervised learning framework that effectively captures the structured semantics of both image spatial content and video temporal dynamics through event-level and object-level clustering and discrimination. Specifically, we leverage offline clustering to generate semantic clusters across both modalities. For videos, multi-granularity event-level clustering progressively expands from single-event to structured multi-event segments, capturing coarse-to-fine temporal semantics; for images, object-level clustering captures fine-grained spatial semantics. However, while global clustering provides semantically consistent clusters, it lacks modeling of structured semantic relations (e. g. , temporal event structures). To address this, we introduce a contrastive objective that leverages these semantic clusters as pseudo-label supervision to explicitly enforce structural constraints, including temporal event relations and spatial object co-occurrences, capturing structured semantics beyond categories. Meanwhile, UniViT jointly embeds structured object-level and event-level semantics into a unified representation space. Furthermore, UniViT introduces two key components: (i) Unified Rotary Position Embedding integrates relative positional embedding with frequency-aware dimension allocation to support position-invariant semantic learning and enhance the stability of structured semantics in the discrimination stage; and (ii) Variable Spatiotemporal Streams adapt to inputs of varying frame lengths, addressing the rigidity of conventional fixed-input approaches. Extensive experiments across varying model scales demonstrate that UniViT achieves state-of-the-art performance on linear probing, attentive probing, question answering, and spatial understanding tasks.

AAAI Conference 2024 Conference Paper

FedMut: Generalized Federated Learning via Stochastic Mutation

  • Ming Hu
  • Yue Cao
  • Anran Li
  • Zhiming Li
  • Chengwei Liu
  • Tianlin Li
  • Mingsong Chen
  • Yang Liu

Although Federated Learning (FL) enables collaborative model training without sharing the raw data of clients, it encounters low-performance problems caused by various heterogeneous scenarios. Due to the limitation of dispatching the same global model to clients for local training, traditional Federated Average (FedAvg)-based FL models face the problem of easily getting stuck into a sharp solution, which results in training a low-performance global model. To address this problem, this paper presents a novel FL approach named FedMut, which mutates the global model according to the gradient change to generate several intermediate models for the next round of training. Each intermediate model will be dispatched to a client for local training. Eventually, the global model converges into a flat area within the range of mutated models and has a well-generalization compared with the global model trained by FedAvg. Experimental results on well-known datasets demonstrate the effectiveness of our FedMut approach in various data heterogeneity scenarios.

AAAI Conference 2024 Conference Paper

Personalization as a Shortcut for Few-Shot Backdoor Attack against Text-to-Image Diffusion Models

  • Yihao Huang
  • Felix Juefei-Xu
  • Qing Guo
  • Jie Zhang
  • Yutong Wu
  • Ming Hu
  • Tianlin Li
  • Geguang Pu

Although recent personalization methods have democratized high-resolution image synthesis by enabling swift concept acquisition with minimal examples and lightweight computation, they also present an exploitable avenue for highly accessible backdoor attacks. This paper investigates a critical and unexplored aspect of text-to-image (T2I) diffusion models - their potential vulnerability to backdoor attacks via personalization. By studying the prompt processing of popular personalization methods (epitomized by Textual Inversion and DreamBooth), we have devised dedicated personalization-based backdoor attacks according to the different ways of dealing with unseen tokens and divide them into two families: nouveau-token and legacy-token backdoor attacks. In comparison to conventional backdoor attacks involving the fine-tuning of the entire text-to-image diffusion model, our proposed personalization-based backdoor attack method can facilitate more tailored, efficient, and few-shot attacks. Through comprehensive empirical study, we endorse the utilization of the nouveau-token backdoor attack due to its impressive effectiveness, stealthiness, and integrity, markedly outperforming the legacy-token backdoor attack.

NeurIPS Conference 2024 Conference Paper

SampDetox: Black-box Backdoor Defense via Perturbation-based Sample Detoxification

  • Yanxin Yang
  • Chentao Jia
  • Dengke Yan
  • Ming Hu
  • Tianlin Li
  • Xiaofei Xie
  • Xian Wei
  • Mingsong Chen

The advancement of Machine Learning has enabled the widespread deployment of Machine Learning as a Service (MLaaS) applications. However, the untrustworthy nature of third-party ML services poses backdoor threats. Existing defenses in MLaaS are limited by their reliance on training samples or white-box model analysis, highlighting the need for a black-box backdoor purification method. In our paper, we attempt to use diffusion models for purification by introducing noise in a forward diffusion process to destroy backdoors and recover clean samples through a reverse generative process. However, since a higher noise also destroys the semantics of the original samples, it still results in a low restoration performance. To investigate the effectiveness of noise in eliminating different types of backdoors, we conducted a preliminary study, which demonstrates that backdoors with low visibility can be easily destroyed by lightweight noise and those with high visibility need to be destroyed by high noise but can be easily detected. Based on the study, we propose SampDetox, which strategically combines lightweight and high noise. SampDetox applies weak noise to eliminate low-visibility backdoors and compares the structural similarity between the recovered and original samples to localize high-visibility backdoors. Intensive noise is then applied to these localized areas, destroying the high-visibility backdoors while preserving global semantic information. As a result, detoxified samples can be used for inference, even by poisoned models. Comprehensive experiments demonstrate the effectiveness of SampDetox in defending against various state-of-the-art backdoor attacks.

NeurIPS Conference 2023 Conference Paper

NurViD: A Large Expert-Level Video Database for Nursing Procedure Activity Understanding

  • Ming Hu
  • Lin Wang
  • Siyuan Yan
  • Don Ma
  • Qingli Ren
  • Peng Xia
  • Wei Feng
  • Peibo Duan

The application of deep learning to nursing procedure activity understanding has the potential to greatly enhance the quality and safety of nurse-patient interactions. By utilizing the technique, we can facilitate training and education, improve quality control, and enable operational compliance monitoring. However, the development of automatic recognition systems in this field is currently hindered by the scarcity of appropriately labeled datasets. The existing video datasets pose several limitations: 1) these datasets are small-scale in size to support comprehensive investigations of nursing activity; 2) they primarily focus on single procedures, lacking expert-level annotations for various nursing procedures and action steps; and 3) they lack temporally localized annotations, which prevents the effective localization of targeted actions within longer video sequences. To mitigate these limitations, we propose NurViD, a large video dataset with expert-level annotation for nursing procedure activity understanding. NurViD consists of over 1. 5k videos totaling 144 hours, making it approximately four times longer than the existing largest nursing activity datasets. Notably, it encompasses 51 distinct nursing procedures and 177 action steps, providing a much more comprehensive coverage compared to existing datasets that primarily focus on limited procedures. To evaluate the efficacy of current deep learning methods on nursing activity understanding, we establish three benchmarks on NurViD: procedure recognition on untrimmed videos, procedure and action recognition on trimmed videos, and action detection. Our benchmark and code will be available at https: //github. com/minghu0830/NurViD-benchmark.