Author name cluster

Xiaowei Hu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

20 papers

1 author row

AAAI Conference 2026 Conference Paper

GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and a Comprehensive Multimodal Dataset Towards General Medical AI

Tianbin Li
Yanzhou Su
Wei Li
Bin Fu
Zhe Chen
Ziyan Huang
Guoan Wang
Chenglong Ma

Despite significant advancements in general AI, its effectiveness in the medical domain is limited by the lack of specialized medical knowledge. To address this, we formulate GMAI-VL-5.5M, a multimodal medical dataset created by converting hundreds of specialized medical datasets with various annotations into high-quality image-text pairs. This dataset offers comprehensive task coverage, diverse modalities, and rich image-text data. Building upon this dataset, we develop GMAI-VL, a 7B-parameter general medical vision-language model, with a three-stage training strategy that enhances the integration of visual and textual information. This approach significantly improves the model's ability to process multimodal data, supporting accurate diagnoses and clinical decision-making. Experiments show that GMAI-VL achieves state-of-the-art performance across various multimodal medical tasks, including visual question answering and medical image diagnosis.

PDF Details DOI

AAAI Conference 2026 Conference Paper

IdentityStory: Taming Your Identity-Preserving Generator for Human-Centric Story Generation

Donghao Zhou
Jingyu Lin
Guibao Shen
Quande Liu
Jialin Gao
Lihao Liu
Lan Du
Cunjian Chen

Recent visual generative models enable story generation with consistent characters from text, but human-centric story generation faces additional challenges, such as maintaining detailed and diverse human face consistency and coordinating multiple characters across different images. This paper presents IdentityStory, a framework for human-centric story generation that ensures consistent character identity across multiple sequential images. By taming identity-preserving generators, the framework features two key components: Iterative Identity Discovery, which extracts cohesive character identities, and Re-denoising Identity Injection, which re-denoises images to inject identities while preserving desired context. Experiments on the ConsiStory-Human benchmark demonstrate that IdentityStory outperforms existing methods, particularly in face consistency, and supports multi-character combinations. The framework also shows strong potential for applications such as infinite-length story generation and dynamic character composition.

PDF Details DOI

JBHI Journal 2026 Journal Article

MedSegAgent: A Universal and Scalable Multi-Agent System for Instructive Medical Image Segmentation

Ziyan Huang
Haoyu Wang
Jin Ye
Yuanfeng Ji
Xiaowei Hu
Lihao Liu
Zhikai Yang
Wei Li

Medical image segmentation is vital for clinical diagnosis and treatment; however, current solutions face three major limitations: (1) the lack of a universal framework capable of handling diverse modalities and anatomical targets, (2) the limited scalability to adapt to evolving clinical needs and new datasets, and (3) the lack of instructive interfaces that make models usable for non-expert users. To address these challenges, this paper presents MedSegAgent, a universal and scalable multi-agent system for instructive medical image segmentation. Specifically, MedSegAgent comprises five agents: one query parsing agent that processes natural language requests, three coarse-to-fine filtering agents (modality filtering, anatomical filtering, and label selection) for identifying relevant datasets and label values, and one execution agent responsible for model inference and result integration. Based on this framework, MedSegAgent utilizes 23 diverse datasets and pre-trained models to perform 343 types of segmentation across various modalities and anatomical targets. Experimental results demonstrate that MedSegAgent simplifies model selection while maintaining high performance, accurately identifying matching datasets and labels in 94. 27% of queries and locating at least one suitable match in 99. 03% of queries. MedSegAgent offers a universal and scalable solution for diverse medical image segmentation tasks, bridging the gap between user-friendly queries and the complexities of model selection and deployment. Our code is publicly available at https://github.com/uni-medical/MedSegAgent.

Details DOI

AAAI Conference 2026 Conference Paper

SurgPub-Video: A Comprehensive Surgical Video Framework for Enhanced Surgical Intelligence in Vision-Language Model

Yaoqian Li
Xikai Yang
Dunyuan Xu
Yang Yu
Litao Zhao
Xiaowei Hu
Jinpeng Li
Pheng-Ann Heng

Vision-Language Models (VLMs) have shown significant potential in surgical scene analysis, yet existing models are limited by frame-level datasets and lack high-quality video data with procedural surgical knowledge. To address these challenges, we make the following contributions: (i) SurgPub-Video, a comprehensive dataset of over 3,000 surgical videos and 25 million annotated frames across 11 specialities, sourced from peer-reviewed clinical journals, (ii) SurgLLaVA-Video, a specialized VLM for surgical video understanding, built upon the TinyLLaVA-Video architecture that supports both video-level and frame-level inputs, and (iii) a video-level surgical Visual Question Answering (VQA) benchmark, covering diverse 11 surgical specialities, such as vascular, cardiology, and thoracic. Extensive experiments, conducted on the proposed benchmark and three additional surgical downstream tasks (action recognition, skill assessment, and triplet recognition), show that SurgLLaVA-Video significantly outperforms both general-purpose and surgical-specific VLMs with only three billion parameters.

PDF Details DOI

IJCAI Conference 2025 Conference Paper

MagicTailor: Component-Controllable Personalization in Text-to-Image Diffusion Models

Donghao Zhou
Jiancheng Huang
Jinbin Bai
Jiaze Wang
Hao Chen
Guangyong Chen
Xiaowei Hu
Pheng-Ann Heng

Text-to-image diffusion models can generate high-quality images but lack fine-grained control of visual concepts, limiting their creativity. Thus, we introduce component-controllable personalization, a new task that enables users to customize and reconfigure individual components within concepts. This task faces two challenges: semantic pollution, where undesired elements disrupt the target concept, and semantic imbalance, which causes disproportionate learning of the target concept and component. To address these, we design MagicTailor, a framework that uses Dynamic Masked Degradation to adaptively perturb unwanted visual semantics and Dual-Stream Balancing for more balanced learning of desired visual semantics. The experimental results show that MagicTailor achieves superior performance in this task and enables more personalized and creative image generation.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

Real-World Adverse Weather Image Restoration via Dual-Level Reinforcement Learning with High-Quality Cold Start

Fuyang Liu
Jiaqi Xu
Xiaowei Hu

Adverse weather severely impairs real-world visual perception, while existing vision models trained on synthetic data with fixed parameters struggle to generalize to complex degradations. To address this, we first construct HFLS-Weather, a physics-driven, high-fidelity dataset that simulates diverse weather phenomena, and then design a dual-level reinforcement learning framework initialized with HFLS-Weather for cold-start training. Within this framework, at the local level, weather-specific restoration models are refined through perturbation-driven image quality optimization, enabling reward-based learning without paired supervision; at the global level, a meta-controller dynamically orchestrates model selection and execution order according to scene degradation. This framework enables continuous adaptation to real-world conditions and achieves state-of-the-art performance across a wide range of adverse weather scenarios. Code is available at https: //github. com/xxclfy/AgentRL-Real-Weather

PDF Details

NeurIPS Conference 2025 Conference Paper

SceneDecorator: Towards Scene-Oriented Story Generation with Scene Planning and Scene Consistency

Quanjian Song
Donghao Zhou
Jingyu Lin
Fei Shen
Jiaze Wang
Xiaowei Hu
Cunjian Chen
Pheng-Ann Heng

Recent text-to-image models have revolutionized image generation, but they still struggle with maintaining concept consistency across generated images. While existing works focus on character consistency, they often overlook the crucial role of scenes in storytelling, which restricts their creativity in practice. This paper introduces scene-oriented story generation, addressing two key challenges: (i) scene planning, where current methods fail to ensure scene-level narrative coherence by relying solely on text descriptions, and (ii) scene consistency, which remains largely unexplored in terms of maintaining scene consistency across multiple stories. We propose SceneDecorator, a training-free framework that employs VLM-Guided Scene Planning to ensure narrative coherence across different scenes in a ``global-to-local'' manner, and Long-Term Scene-Sharing Attention to maintain long-term scene consistency and subject diversity across generated stories. Extensive experiments demonstrate the superior performance of SceneDecorator, highlighting its potential to unleash creativity in the fields of arts, films, and games.

PDF Details

AAAI Conference 2024 Conference Paper

Semi-supervised TEE Segmentation via Interacting with SAM Equipped with Noise-Resilient Prompting

Sen Deng
Yidan Feng
Haoneng Lin
Yiting Fan
Alex Pui-Wai Lee
Xiaowei Hu
Jing Qin

Semi-supervised learning (SSL) is a powerful tool to address the challenge of insufficient annotated data in medical segmentation problems. However, existing semi-supervised methods mainly rely on internal knowledge for pseudo labeling, which is biased due to the distribution mismatch between the highly imbalanced labeled and unlabeled data. Segmenting left atrial appendage (LAA) from transesophageal echocardiogram (TEE) images is a typical medical image segmentation task featured by scarcity of professional annotations and diverse data distributions, for which existing SSL models cannot achieve satisfactory performance. In this paper, we propose a novel strategy to mitigate the inherent challenge of distribution mismatch in SSL by, for the first time, incorporating a large foundation model (i.e. SAM in our implementation) into an SSL model to improve the quality of pseudo labels. We further propose a new self-reconstruction mechanism to generate both noise-resilient prompts to demonically improve SAM’s generalization capability over TEE images and self-perturbations to stabilize the training process and reduce the impact of noisy labels. We conduct extensive experiments on an in-house TEE dataset; experimental results demonstrate that our method achieves better performance than state-of-the-art SSL models.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

IDRNet: Intervention-Driven Relation Network for Semantic Segmentation

Zhenchao Jin
Xiaowei Hu
Lingting Zhu
Luchuan Song
Li Yuan
Lequan Yu

Co-occurrent visual patterns suggest that pixel relation modeling facilitates dense prediction tasks, which inspires the development of numerous context modeling paradigms, \emph{e. g. }, multi-scale-driven and similarity-driven context schemes. Despite the impressive results, these existing paradigms often suffer from inadequate or ineffective contextual information aggregation due to reliance on large amounts of predetermined priors. To alleviate the issues, we propose a novel \textbf{I}ntervention-\textbf{D}riven \textbf{R}elation \textbf{Net}work (\textbf{IDRNet}), which leverages a deletion diagnostics procedure to guide the modeling of contextual relations among different pixels. Specifically, we first group pixel-level representations into semantic-level representations with the guidance of pseudo labels and further improve the distinguishability of the grouped representations with a feature enhancement module. Next, a deletion diagnostics procedure is conducted to model relations of these semantic-level representations via perceiving the network outputs and the extracted relations are utilized to guide the semantic-level representations to interact with each other. Finally, the interacted representations are utilized to augment original pixel-level representations for final predictions. Extensive experiments are conducted to validate the effectiveness of IDRNet quantitatively and qualitatively. Notably, our intervention-driven context scheme brings consistent performance improvements to state-of-the-art segmentation frameworks and achieves competitive results on popular benchmark datasets, including ADE20K, COCO-Stuff, PASCAL-Context, LIP, and Cityscapes.

PDF Details

AAAI Conference 2022 Conference Paper

An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA

Zhengyuan Yang
Zhe Gan
Jianfeng Wang
Xiaowei Hu
Yumao Lu
Zicheng Liu
Lijuan Wang

Knowledge-based visual question answering (VQA) involves answering questions that require external knowledge not present in the image. Existing methods first retrieve knowledge from external resources, then reason over the selected knowledge, the input image, and question for answer prediction. However, this two-step approach could lead to mismatches that potentially limit the VQA performance. For example, the retrieved knowledge might be noisy and irrelevant to the question, and the re-embedded knowledge features during reasoning might deviate from their original meanings in the knowledge base (KB). To address this challenge, we propose PICa, a simple yet effective method that Prompts GPT- 3 via the use of Image Captions, for knowledge-based VQA. Inspired by GPT-3’s power in knowledge retrieval and question answering, instead of using structured KBs as in previous work, we treat GPT-3 as an implicit and unstructured KB that can jointly acquire and process relevant knowledge. Specifically, we first convert the image into captions (or tags) that GPT-3 can understand, then adapt GPT-3 to solve the VQA task in a few-shot manner by just providing a few in-context VQA examples. We further boost performance by carefully investigating: (i) what text formats best describe the image content, and (ii) how in-context examples can be better selected and used. PICa unlocks the first use of GPT-3 for multimodal tasks. By using only 16 examples, PICa surpasses the supervised state of the art by an absolute +8. 6 points on the OK-VQA dataset. We also benchmark PICa on VQAv2, where PICa also shows a decent few-shot performance.

PDF Details

AAAI Conference 2022 Conference Paper

Enhancing Pseudo Label Quality for Semi-supervised Domain-Generalized Medical Image Segmentation

Huifeng Yao
Xiaowei Hu
Xiaomeng Li

Generalizing the medical image segmentation algorithms to unseen domains is an important research topic for computeraided diagnosis and surgery. Most existing methods require a fully labeled dataset in each source domain. Although some researchers developed a semi-supervised domain generalized method, it still requires the domain labels. This paper presents a novel confidence-aware cross pseudo supervision algorithm for semi-supervised domain generalized medical image segmentation. The main goal is to enhance the pseudo label quality for unlabeled images from unknown distributions. To achieve it, we perform the Fourier transformation to learn low-level statistic information across domains and augment the images to incorporate cross-domain information. With these augmentations as perturbations, we feed the input to a confidence-aware cross pseudo supervision network to measure the variance of pseudo labels and regularize the network to learn with more confident pseudo labels. Our method sets new records on public datasets, i. e. , M&Ms and SCGM. Notably, without using domain labels, our method surpasses the prior art that even uses domain labels by 11. 67% on Dice on M&Ms dataset with 2% labeled data. Code is available at https: //github. com/XMed-Lab/EPL SemiDG.

PDF Details

TMLR Journal 2022 Journal Article

GIT: A Generative Image-to-text Transformer for Vision and Language

Jianfeng Wang
Zhengyuan Yang
Xiaowei Hu
Linjie Li
Kevin Lin
Zhe Gan
Zicheng Liu
Ce Liu

In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question answering. While generative models provide a consistent network architecture between pre-training and fine-tuning, existing work typically contains complex structures (uni/multi-modal encoder/decoder) and depends on external modules such as object detectors/taggers and optical character recognition (OCR). In GIT, we simplify the architecture as one image encoder and one text decoder under a single language modeling task. We also scale up the pre-training data and the model size to boost the model performance. Without bells and whistles, our GIT establishes new state of the arts on numerous challenging benchmarks with a large margin. For instance, our model surpasses the human performance for the first time on TextCaps (138.2 vs. 125.5 in CIDEr). Furthermore, we present a new scheme of generation-based image classification and scene text recognition, achieving decent performance on standard benchmarks.

PDF Details

NeurIPS Conference 2022 Conference Paper

GLIPv2: Unifying Localization and Vision-Language Understanding

Haotian Zhang
Pengchuan Zhang
Xiaowei Hu
Yen-Chun Chen
Liunian Li
Xiyang Dai
Lijuan Wang
Lu Yuan

We present GLIPv2, a grounded VL understanding model, that serves both localization tasks (e. g. , object detection, instance segmentation) and Vision-Language (VL) understanding tasks (e. g. , VQA, image captioning). GLIPv2 elegantly unifies localization pre-training and Vision-Language Pre-training (VLP) with three pre-training tasks: phrase grounding as a VL reformulation of the detection task, region-word contrastive learning as a novel region-word level contrastive learning task, and the masked language modeling. This unification not only simplifies the previous multi-stage VLP procedure but also achieves mutual benefits between localization and understanding tasks. Experimental results show that a single GLIPv2 model (all model weights are shared) achieves near SoTA performance on various localization and understanding tasks. The model also shows (1) strong zero-shot and few-shot adaption performance on open-vocabulary object detection tasks and (2) superior grounding capability on VL understanding tasks.

PDF Details

NeurIPS Conference 2022 Conference Paper

K-LITE: Learning Transferable Visual Models with External Knowledge

Sheng Shen
Chunyuan Li
Xiaowei Hu
Yujia Xie
Jianwei Yang
Pengchuan Zhang
Zhe Gan
Lijuan Wang

The new generation of state-of-the-art computer vision systems are trained from natural language supervision, ranging from simple object category names to descriptive captions. This form of supervision ensures high generality and usability of the learned visual models, based on the broad concept coverage achieved through large-scale data collection process. Alternatively, we argue that learning with external knowledge about images is a promising way which leverages a much more structured source of supervision and offers sample efficiency. In this paper, we propose K-LITE (Knowledge-augmented Language-Image Training and Evaluation), a simple strategy to leverage external knowledge for building transferable visual systems: In training, it enriches entities in natural language with WordNet and Wiktionary knowledge, leading to an efficient and scalable approach to learning image representations that uses knowledge about the visual concepts; In evaluation, the natural language is also augmented with external knowledge and then used to reference learned visual concepts (or describe new ones) to enable zero-shot and few-shot transfer of the pre-trained models. We study the performance of K-LITE on two important computer vision problems, image classification and object detection, benchmarking on 20 and 13 different existing datasets, respectively. The proposed knowledge-augmented models show significant improvement in transfer learning performance over existing methods. Our code is released at https: //github. com/microsoft/klite.

PDF Details

NeurIPS Conference 2022 Conference Paper

NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis

Jian Liang
Chenfei Wu
Xiaowei Hu
Zhe Gan
Jianfeng Wang
Lijuan Wang
Zicheng Liu
Yuejian Fang

Infinite visual synthesis aims to generate high-resolution images, long-duration videos, and even visual generation of infinite size. Some recent work tried to solve this task by first dividing data into processable patches and then training the models on them without considering the dependencies between patches. However, since they fail to model global dependencies between patches, the quality and consistency of the generation can be limited. To address this issue, we propose NUWA-Infinity, a patch-level \emph{``render-and-optimize''} strategy for infinite visual synthesis. Given a large image or a long video, NUWA-Infinity first splits it into non-overlapping patches and uses the ordered patch chain as a complete training instance, a rendering model autoregressively predicts each patch based on its contexts. Once a patch is predicted, it is optimized immediately and its hidden states are saved as contexts for the next \emph{``render-and-optimize''} process. This brings two advantages: ($i$) The autoregressive rendering process with information transfer between contexts provides an implicit global probabilistic distribution modeling; ($ii$) The timely optimization process alleviates the optimization stress of the model and helps convergence. Based on the above designs, NUWA-Infinity shows a strong synthesis ability on high-resolution images and long-duration videos. The homepage link is \url{https: //nuwa-infinity. microsoft. com}.

PDF Details

NeurIPS Conference 2022 Conference Paper

Sparse2Dense: Learning to Densify 3D Features for 3D Object Detection

Tianyu Wang
Xiaowei Hu
Zhengzhe Liu
Chi-Wing Fu

LiDAR-produced point clouds are the major source for most state-of-the-art 3D object detectors. Yet, small, distant, and incomplete objects with sparse or few points are often hard to detect. We present Sparse2Dense, a new framework to efficiently boost 3D detection performance by learning to densify point clouds in latent space. Specifically, we first train a dense point 3D detector (DDet) with a dense point cloud as input and design a sparse point 3D detector (SDet) with a regular point cloud as input. Importantly, we formulate the lightweight plug-in S2D module and the point cloud reconstruction module in SDet to densify 3D features and train SDet to produce 3D features, following the dense 3D features in DDet. So, in inference, SDet can simulate dense 3D features from regular (sparse) point cloud inputs without requiring dense inputs. We evaluate our method on the large-scale Waymo Open Dataset and the Waymo Domain Adaptation Dataset, showing its high performance and efficiency over the state of the arts.

PDF Details

AAAI Conference 2021 Conference Paper

Learning Semantic Context from Normal Samples for Unsupervised Anomaly Detection

Xudong Yan
Huaidong Zhang
Xuemiao Xu
Xiaowei Hu
Pheng-Ann Heng

Unsupervised anomaly detection aims to identify data samples that have low probability density from a set of input samples, and only the normal samples are provided for model training. The inference of abnormal regions on the input image requires an understanding of the surrounding semantic context. This work presents a Semantic Context based Anomaly Detection Network, SCADN, for unsupervised anomaly detection by learning the semantic context from the normal samples. To achieve this, we first generate multi-scale striped masks to remove a part of regions from the normal samples, and then train a generative adversarial network to reconstruct the unseen regions. Note that the masks are designed in multiple scales and stripe directions, and various training examples are generated to obtain the rich semantic context. In testing, we obtain an error map by computing the difference between the reconstructed image and the input image for all samples, and infer the abnormal samples based on the error maps. Finally, we perform various experiments on three public benchmark datasets and a new dataset LaceAD collected by us, and show that our method clearly outperforms the current state-of-the-art methods.

PDF Details

AAAI Conference 2021 Conference Paper

VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning

Xiaowei Hu
Xi Yin
Kevin Lin
Lei Zhang
Jianfeng Gao
Lijuan Wang
Zicheng Liu

It is highly desirable yet challenging to generate image captions that can describe novel objects which are unseen in caption-labeled training data, a capability that is evaluated in the novel object captioning challenge (nocaps). In this challenge, no additional image-caption training data, other than COCO Captions, is allowed for model training. Thus, conventional Vision-Language Pre-training (VLP) methods cannot be applied. This paper presents VIsual VOcabulary pretraining (VIVO) that performs pre-training in the absence of caption annotations. By breaking the dependency of paired image-caption training data in VLP, VIVO can leverage large amounts of paired image-tag data to learn a visual vocabulary. This is done by pre-training a multi-layer Transformer model that learns to align image-level tags with their corresponding image region features. To address the unordered nature of image tags, VIVO uses a Hungarian matching loss with masked tag prediction to conduct pre-training. We validate the effectiveness of VIVO by fine-tuning the pre-trained model for image captioning. In addition, we perform an analysis of the visual-text alignment inferred by our model. The results show that our model can not only generate fluent image captions that describe novel objects, but also identify the locations of these objects. Our single model has achieved new state-of-the-art results on nocaps and surpassed the human CIDEr score.

PDF Details

AAAI Conference 2018 Conference Paper

Recurrently Aggregating Deep Features for Salient Object Detection

Xiaowei Hu
Lei Zhu
Jing Qin
Chi-Wing Fu
Pheng-Ann Heng

Salient object detection is a fundamental yet challenging problem in computer vision, aiming to highlight the most visually distinctive objects or regions in an image. Recent works beneﬁt from the development of fully convolutional neural networks (FCNs) and achieve great success by integrating features from multiple layers of FCNs. However, the integrated features tend to include non-salient regions (due to low level features of the FCN) or lost details of salient objects (due to high level features of the FCN) when producing the saliency maps. In this paper, we develop a novel deep saliency network equipped with recurrently aggregated deep features (RADF) to more accurately detect salient objects from an image by fully exploiting the complementary saliency information captured in different layers. The RADF utilizes the multi-level features integrated from different layers of a FCN to recurrently reﬁne the features at each layer, suppressing the non-salient noise at low-level of the FCN and increasing more salient details into features at high layers. We perform experiments to evaluate the effectiveness of the proposed network on 5 famous saliency detection benchmarks and compare it with 15 state-of-the-art methods. Our method ranks ﬁrst in 4 of the 5 datasets and second in the left dataset.

PDF Details

IJCAI Conference 2018 Conference Paper

R³Net: Recurrent Residual Refinement Network for Saliency Detection

Zijun Deng
Xiaowei Hu
Lei Zhu
Xuemiao Xu
Jing Qin
Guoqiang Han
Pheng-Ann Heng

Saliency detection is a fundamental yet challenging task in computer vision, aiming at highlighting the most visually distinctive objects in an image. We propose a novel recurrent residual refinement network (R^3Net) equipped with residual refinement blocks (RRBs) to more accurately detect salient regions of an input image. Our RRBs learn the residual between the intermediate saliency prediction and the ground truth by alternatively leveraging the low-level integrated features and the high-level integrated features of a fully convolutional network (FCN). While the low-level integrated features are capable of capturing more saliency details, the high-level integrated features can reduce non-salient regions in the intermediate prediction. Furthermore, the RRBs can obtain complementary saliency information of the intermediate prediction, and add the residual into the intermediate prediction to refine the saliency maps. We evaluate the proposed R^3Net on five widely-used saliency detection benchmarks by comparing it with 16 state-of-the-art saliency detectors. Experimental results show that our network outperforms our competitors in all the benchmark datasets.

PDF Details