Author name cluster

Weidi Xie

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

18 papers

2 author rows

AAAI Conference 2026 Conference Paper

Versatile Vision-Language Model for 3D Computed Tomography

Jiayu Lei
Ziqing Fan
Yanyong Zhang
Weidi Xie
Ya Zhang
Yanfeng Wang

Representation learning serves as a foundational component of medical vision-language models (MVLMs), enabling cross-modal alignment, semantic consistency, and enhanced generalization capabilities for downstream tasks. As generalist models rapidly evolve, there is a pressing need to unify diverse downstream tasks, such as diagnosis, segmentation, report generation, and multiple choice within a cohesive framework, demanding more efficient and versatile visual representation learning. However, current MVLMs predominately follow CLIP-style vision pretraining, failing to leverage heterogeneous data resources with multi-dimensional imaging and diverse annotation forms. And there lacks systematic analysis of efficient vision encoder design across varied downstream applications, including diagnosis, segmentation, and text generation tasks, particularly for volumetric imaging like Computed Tomography (CT). Besides, current MVLMs exhibit constrained voxel-level capabilities, lacking effective multi-task instruction tuning framework capable of achieving robust performance across various downstream tasks. To address these challenges, we propose CTInstruct, a novel MVLM employing a hybrid ResNet-ViT encoder with multi-granular vision-language pretraining for efficient heterogeneous data modeling, and unified instruction tuning that jointly optimizes discriminative, generative, and voxel-level reasoning for volumetric medical imaging. CTInstruct achieves SOTA performance across 8 CT benchmarks, setting a new standard for data-efficient multimodal learning in medical imaging.

PDF Details DOI

ICLR Conference 2025 Conference Paper

A Sanity Check for AI-generated Image Detection

Shilin Yan
Ouxiang Li
Jiayin Cai
Yanbin Hao
Xiaolong Jiang
Yao Hu 0002
Weidi Xie

With the rapid development of generative models, discerning AI-generated content has evoked increasing attention from both industry and academia. In this paper, we conduct a sanity check on whether the task of AI-generated image detection has been solved. To start with, we present Chameleon dataset, consisting of AI-generated images that are genuinely challenging for human perception. To quantify the generalization of existing methods, we evaluate 9 off-the-shelf AI-generated image detectors on Chameleon dataset. Upon analysis, almost all models misclassify AI-generated images as real ones. Later, we propose AIDE AI-generated Image DEtector with Hybrid Features, which leverages multiple experts to simultaneously extract visual artifacts and noise patterns. Specifically, to capture the high-level semantics, we utilize CLIP to compute the visual embedding. This effectively enables the model to discern AI-generated images based on semantics and contextual information. Secondly, we select the highest and lowest frequency patches in the image, and compute the low-level patchwise features, aiming to detect AI-generated images by low-level artifacts, for example, noise patterns, anti-aliasing effects. While evaluating on existing benchmarks, for example, AIGCDetectBenchmark and GenImage, AIDE achieves +3.5% and +4.6% improvements to state-of-the-art methods, and on our proposed challenging Chameleon benchmarks, it also achieves promising results, despite the problem of detecting AI-generated images remains far from being solved.

Details

ICLR Conference 2025 Conference Paper

EgoExo-Gen: Ego-centric Video Prediction by Watching Exo-centric Videos

Jilan Xu
Yifei Huang 0002
Baoqi Pei
Junlin Hou
Qingqiu Li
Guo Chen 0006
Yuejie Zhang
Rui Feng 0001

Generating videos in the first-person perspective has broad application prospects in the field of augmented reality and embodied intelligence. In this work, we explore the cross-view video prediction task, where given an exo-centric video, the first frame of the corresponding ego-centric video, and textual instructions, the goal is to generate future frames of the ego-centric video. Inspired by the notion that hand-object interactions (HOI) in ego-centric videos represent the primary intentions and actions of the current actor, we present EgoExo-Gen that explicitly models the hand-object dynamics for cross-view video prediction. EgoExo-Gen consists of two stages. First, we design a cross-view HOI mask prediction model that anticipates the HOI masks in future ego-frames by modeling the spatio-temporal ego-exo correspondence. Next, we employ a video diffusion model to predict future ego-frames using the first ego-frame and textual instructions, while incorporating the HOI masks as structural guidance to enhance prediction quality. To facilitate training, we develop a fully automated pipeline to generate pseudo HOI masks for both ego- and exo-videos by exploiting vision foundation models. Extensive experiments demonstrate that our proposed EgoExo-Gen achieves better prediction performance compared to previous video prediction models on the public Ego-Exo4D and H2O benchmark datasets, with the HOI masks significantly improving the generation of hands and interactive objects in the ego-centric videos.

Details

AAAI Conference 2025 Conference Paper

Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos

Qirui Chen
Shangzhe Di
Weidi Xie

This paper considers the problem of Multi-Hop Video Question Answering (MH-VidQA) in long-form egocentric videos. This task not only requires to answer visual questions, but also to localize multiple relevant time intervals within the video as visual evidences. We develop an automated pipeline to create multi-hop question-answering pairs with associated temporal evidence, enabling to construct a large-scale dataset for instruction-tuning. To monitor the progress of this new task, we further curate a high-quality benchmark, MULTIHOP-EGOQA, with careful manual verification and refinement. Experimental results reveal that existing multimodal systems exhibit inadequate multi-hop grounding and reasoning abilities, resulting in unsatisfactory performance. We then propose a novel architecture, termed as Grounding Scattered Evidence with Large Language Model (GeLM), that enhances multi-modal large language models by incorporating a grounding module to retrieve temporal evidence from videos using flexible grounding tokens. Trained on our visual instruction-tuning data, GeLM demonstrates improved multi-hop grounding and reasoning capabilities, setting a baseline for this new task. Furthermore, when trained on third-person view videos, the same architecture also achieves state-of-the-art performance on the single-hop VidQA benchmark, ActivityNet-RTL, demonstrating its effectiveness.

PDF Details DOI

JBHI Journal 2025 Journal Article

Interpretable Brain MRI Report Generation Anchored by Lesion Topography

Jiayu Lei
Xiaoman Zhang
Chaoyi Wu
Lisong Dai
Ya Zhang
Yanyong Zhang
Yanfeng Wang
Weidi Xie

Radiologists face increasing workloads that make accurate and timely report generation both critical and challenging. This paper presents a novel system for grounded automatic brain MRI report generation, with contributions in three key areas: First, we release RadGenome-Brain MRI, a benchmark dataset featuring multi-modal scans, expert-annotated abnormality masks, and radiology reports with region-level grounding to support fine-grained, explainable report generation. Second, we propose AutoRG-Brain, the first brain MRI report generation framework that combines automatic anomaly segmentation with a visual prompting-based language model to produce structured, anatomically grounded findings. Third, we conduct extensive quantitative and expert evaluations across segmentation and reporting tasks, and demonstrate in real clinical settings that our system significantly enhances junior radiologists' ability to detect subtle abnormalities and compose high-quality reports, narrowing the gap with senior doctors. All code, models, and datasets will be publicly released to facilitate future research and development.

Details DOI

ICLR Conference 2025 Conference Paper

Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning

Baoqi Pei
Yifei Huang 0002
Jilan Xu
Guo Chen 0006
Yuping He
Lijin Yang
Yali Wang 0001
Weidi Xie

In egocentric video understanding, the motion of hands and objects as well as their interactions play a significant role by nature. However, existing egocentric video representation learning methods mainly focus on aligning video representation with high-level narrations, overlooking the intricate dynamics between hands and objects. In this work, we aim to integrate the modeling of fine-grained hand-object dynamics into the video representation learning process. Since no suitable data is available, we introduce HOD, a novel pipeline employing a hand-object detector and a large language model to generate high-quality narrations with detailed descriptions of hand-object dynamics. To learn these fine-grained dynamics, we propose EgoVideo, a model with a new lightweight motion adapter to capture fine-grained hand-object motion information. Through our co-training strategy, EgoVideo effectively and efficiently leverages the fine-grained hand-object dynamics in the HOD data. Extensive experiments demonstrate that our method achieves state-of-the-art performance across multiple egocentric downstream tasks, including improvements of 6.3% in EK-100 multi-instance retrieval, 5.7% in EK-100 classification, and 16.3% in EGTEA classification in zero-shot settings. Furthermore, our model exhibits robust generalization capabilities in hand-object interaction and robot manipulation tasks.

Details

ICLR Conference 2025 Conference Paper

Track-On: Transformer-based Online Point Tracking with Memory

Görkay Aydemir
Xiongyi Cai
Weidi Xie
Fatma Güney

In this paper, we consider the problem of long-term point tracking, which requires consistent identification of points across multiple frames in a video, despite changes in appearance, lighting, perspective, and occlusions. We target online tracking on a frame-by-frame basis, making it suitable for real-world, streaming scenarios. Specifically, we introduce Track-On, a simple transformer-based model designed for online long-term point tracking. Unlike prior methods that depend on full temporal modeling, our model processes video frames causally without access to future frames, leveraging two memory modules —spatial memory and context memory— to capture temporal information and maintain reliable point tracking over long time horizons. At inference time, it employs patch classification and refinement to identify correspondences and track points with high accuracy. Through extensive experiments, we demonstrate that Track-On sets a new state-of-the-art for online models and delivers superior or competitive results compared to offline approaches on seven datasets, including the TAP-Vid benchmark. Our method offers a robust and scalable solution for real-time tracking in diverse applications. Project page: https://kuis-ai.github.io/track_on

Details

NeurIPS Conference 2025 Conference Paper

Universal Video Temporal Grounding with Generative Multi-modal Large Language Models

Zeqian Li
Shangzhe Di
Zhonghua Zhai
Weilin Huang
Yanfeng Wang
Weidi Xie

This paper presents a computational model for universal video temporal grounding, which accurately localizes temporal moments in videos based on natural language queries (e. g. , questions or descriptions). Unlike existing methods that are often limited to specific video domains or durations, we propose UniTime, a robust and universal video grounding model leveraging the strong vision-language understanding capabilities of generative Multi-modal Large Language Models (MLLMs). Our model effectively handles videos of diverse views, genres, and lengths while comprehending complex language queries. The key contributions include: (i) We consider steering strong MLLMs for temporal grounding in videos. To enable precise timestamp outputs, we incorporate temporal information by interleaving timestamp tokens with video tokens. (ii) By training the model to handle videos with different input granularities through adaptive frame scaling, our approach achieves robust temporal grounding for both short and long videos. (iii) Comprehensive experiments show that UniTime outperforms state-of-the-art approaches in both zero-shot and dataset-specific finetuned settings across five public temporal grounding benchmarks. (iv) When employed as a preliminary moment retriever for long-form video question-answering (VideoQA), UniTime significantly improves VideoQA accuracy, highlighting its value for complex video understanding tasks.

PDF Details

NeurIPS Conference 2024 Conference Paper

A General Protocol to Probe Large Vision Models for 3D Physical Understanding

Guanqi Zhan
Chuanxia Zheng
Weidi Xie
Andrew Zisserman

Our objective in this paper is to probe large vision models to determine to what extent they ‘understand’ different physical properties of the 3D scene depicted in an image. To this end, we make the following contributions: (i) We introduce a general and lightweight protocol to evaluate whether features of an off-the-shelf large vision model encode a number of physical ‘properties’ of the 3D scene, by training discriminative classifiers on the features for these properties. The probes are applied on datasets of real images with annotations for the property. (ii) We apply this protocol to properties covering scene geometry, scene material, support relations, lighting, and view-dependent measures, and large vision models including CLIP, DINOv1, DINOv2, VQGAN, Stable Diffusion. (iii) We find that features from Stable Diffusion and DINOv2 are good for discriminative learning of a number of properties, including scene geometry, support relations, shadows and depth, but less performant for occlusion and material, while outperforming DINOv1, CLIP and VQGAN for all properties. (iv) It is observed that different time steps of Stable Diffusion features, as well as different transformer layers of DINO/CLIP/VQGAN, are good at different properties, unlocking potential applications of 3D physical understanding.

PDF Details DOI

ICML Conference 2023 Conference Paper

Multi-Modal Classifiers for Open-Vocabulary Object Detection

Prannay Kaul
Weidi Xie
Andrew Zisserman

The goal of this paper is open-vocabulary object detection (OVOD) — building a model that can detect objects beyond the set of categories seen at training, thus enabling the user to specify categories of interest at inference without the need for model retraining. We adopt a standard two- stage object detector architecture, and explore three ways for specifying novel categories: via language descriptions, via image exemplars, or via a combination of the two. We make three contributions: first, we prompt a large language model (LLM) to generate informative language descriptions for object classes, and construct powerful text-based classifiers; second, we employ a visual aggregator on image exemplars that can ingest any number of images as input, forming vision-based classifiers; and third, we provide a simple method to fuse information from language descriptions and image exemplars, yield- ing a multi-modal classifier. When evaluating on the challenging LVIS open-vocabulary bench- mark we demonstrate that: (i) our text-based classifiers outperform all previous OVOD works; (ii) our vision-based classifiers perform as well as text-based classifiers in prior work; (iii) using multi-modal classifiers perform better than either modality alone; and finally, (iv) our text-based and multi-modal classifiers yield better performance than a fully-supervised detector.

Details

NeurIPS Conference 2023 Conference Paper

Self-supervised Object-Centric Learning for Videos

Görkay Aydemir
Weidi Xie
Fatma Guney

Unsupervised multi-object segmentation has shown impressive results on images by utilizing powerful semantics learned from self-supervised pretraining. An additional modality such as depth or motion is often used to facilitate the segmentation in video sequences. However, the performance improvements observed in synthetic sequences, which rely on the robustness of an additional cue, do not translate to more challenging real-world scenarios. In this paper, we propose the first fully unsupervised method for segmenting multiple objects in real-world sequences. Our object-centric learning framework spatially binds objects to slots on each frame and then relates these slots across frames. From these temporally-aware slots, the training objective is to reconstruct the middle frame in a high-level semantic feature space. We propose a masking strategy by dropping a significant portion of tokens in the feature space for efficiency and regularization. Additionally, we address over-clustering by merging slots based on similarity. Our method can successfully segment multiple instances of complex and high-variety classes in YouTube videos.

PDF Details

JBHI Journal 2023 Journal Article

Self-Supervised Tumor Segmentation With Sim2Real Adaptation

Xiaoman Zhang
Weidi Xie
Chaoqin Huang
Ya Zhang
Xin Chen
Qi Tian
Yanfeng Wang

This paper targets on self-supervised tumor segmentation. We make the following contributions: (i) we take inspiration from the observation that tumors are often characterised independently of their contexts, we propose a novel proxy task “layer-decomposition”, that closely matches the goal of the downstream task, and design a scalable pipeline for generating synthetic tumor data for pre-training; (ii) we propose a two-stage Sim2Real training regime for unsupervised tumor segmentation, where we first pre-train a model with simulated tumors, and then adopt a self-training strategy for downstream data adaptation; (iii) when evaluating on different tumor segmentation benchmarks, e. g. BraTS2018 for brain tumor segmentation and LiTS2017 for liver tumor segmentation, our approach achieves state-of-the-art segmentation performance under the unsupervised setting. While transferring the model for tumor segmentation under a low-annotation regime, the proposed approach also outperforms all existing self-supervised approaches; (iv) we conduct extensive ablation studies to analyse the critical components in data simulation, and validate the necessity of different proxy tasks. We demonstrate that, with sufficient texture randomization in simulation, model trained on synthetic data can effortlessly generalise to datasets with real tumors.

Details DOI

NeurIPS Conference 2022 Conference Paper

Associating Objects and Their Effects in Video through Coordination Games

Erika Lu
Forrester Cole
Weidi Xie
Tali Dekel
Bill Freeman
Andrew Zisserman
Michael Rubinstein

We explore a feed-forward approach for decomposing a video into layers, where each layer contains an object of interest along with its associated shadows, reflections, and other visual effects. This problem is challenging since associated effects vary widely with the 3D geometry and lighting conditions in the scene, and ground-truth labels for visual effects are difficult (and in some cases impractical) to collect. We take a self-supervised approach and train a neural network to produce a foreground image and alpha matte from a rough object segmentation mask under a reconstruction and sparsity loss. Under reconstruction loss, the layer decomposition problem is underdetermined: many combinations of layers may reconstruct the input video. Inspired by the game theory concept of focal points---or \emph{Schelling points}---we pose the problem as a coordination game, where each player (network) predicts the effects for a single object without knowledge of the other players' choices. The players learn to converge on the ``natural'' layer decomposition in order to maximize the likelihood of their choices aligning with the other players'. We train the network to play this game with itself, and show how to design the rules of this game so that the focal point lies at the correct layer decomposition. We demonstrate feed-forward results on a challenging synthetic dataset, then show that pretraining on this dataset significantly reduces optimization time for real videos.

PDF Details

NeurIPS Conference 2022 Conference Paper

ReCo: Retrieve and Co-segment for Zero-shot Transfer

Gyungin Shin
Weidi Xie
Samuel Albanie

Semantic segmentation has a broad range of applications, but its real-world impact has been significantly limited by the prohibitive annotation costs necessary to enable deployment. Segmentation methods that forgo supervision can side-step these costs, but exhibit the inconvenient requirement to provide labelled examples from the target distribution to assign concept names to predictions. An alternative line of work in language-image pre-training has recently demonstrated the potential to produce models that can both assign names across large vocabularies of concepts and enable zero-shot transfer for classification, but do not demonstrate commensurate segmentation abilities. We leverage the retrieval abilities of one such language-image pre-trained model, CLIP, to dynamically curate training sets from unlabelled images for arbitrary collections of concept names, and leverage the robust correspondences offered by modern image representations to co-segment entities among the resulting collections. The synthetic segment collections are then employed to construct a segmentation model (without requiring pixel labels) whose knowledge of concepts is inherited from the scalable pre-training process of CLIP. We demonstrate that our approach, termed Retrieve and Co-segment (ReCo) performs favourably to conventional unsupervised segmentation approaches while inheriting the convenience of nameable predictions and zero-shot transfer. We also demonstrate ReCo’s ability to generate specialist segmenters for extremely rare objects.

PDF Details

NeurIPS Conference 2022 Conference Paper

Segmenting Moving Objects via an Object-Centric Layered Representation

Junyu Xie
Weidi Xie
Andrew Zisserman

The objective of this paper is a model that is able to discover, track and segment multiple moving objects in a video. We make four contributions: First, we introduce an object-centric segmentation model with a depth-ordered layer representation. This is implemented using a variant of the transformer architecture that ingests optical flow, where each query vector specifies an object and its layer for the entire video. The model can effectively discover multiple moving objects and handle mutual occlusions; Second, we introduce a scalable pipeline for generating multi-object synthetic training data via layer compositions, that is used to train the proposed model, significantly reducing the requirements for labour-intensive annotations, and supporting Sim2Real generalisation; Third, we conduct thorough ablation studies, showing that the model is able to learn object permanence and temporal shape consistency, and is able to predict amodal segmentation masks; Fourth, we evaluate our model, trained only on synthetic data, on standard video segmentation benchmarks, DAVIS, MoCA, SegTrack, FBMS-59, and achieve state-of-the-art performance among existing methods that do not rely on any manual annotations. With test-time adaptation, we observe further performance boosts.

PDF Details

YNIMG Journal 2022 Journal Article

Subcortical segmentation of the fetal brain in 3D ultrasound using deep learning

Linde S. Hesse
Moska Aliasi
Felipe Moser
Monique C. Haak
Weidi Xie
Mark Jenkinson
Ana I.L. Namburete

The quantification of subcortical volume development from 3D fetal ultrasound can provide important diagnostic information during pregnancy monitoring. However, manual segmentation of subcortical structures in ultrasound volumes is time-consuming and challenging due to low soft tissue contrast, speckle and shadowing artifacts. For this reason, we developed a convolutional neural network (CNN) for the automated segmentation of the choroid plexus (CP), lateral posterior ventricle horns (LPVH), cavum septum pellucidum et vergae (CSPV), and cerebellum (CB) from 3D ultrasound. As ground-truth labels are scarce and expensive to obtain, we applied few-shot learning, in which only a small number of manual annotations (n = 9) are used to train a CNN. We compared training a CNN with only a few individually annotated volumes versus many weakly labelled volumes obtained from atlas-based segmentations. This showed that segmentation performance close to intra-observer variability can be obtained with only a handful of manual annotations. Finally, the trained models were applied to a large number (n = 278) of ultrasound image volumes of a diverse, healthy population, obtaining novel US-specific growth curves of the respective structures during the second trimester of gestation.

Details DOI

JBHI Journal 2020 Journal Article

Low-Memory CNNs Enabling Real-Time Ultrasound Segmentation Towards Mobile Deployment

Sagar Vaze
Weidi Xie
Ana I. L. Namburete

Convolutional Neural Networks (CNNs), which are currently state-of-the-art for most image analysis tasks, are ill suited to leveraging the key benefits of ultrasound imaging - specifically, ultrasound's portability and real-time capabilities. CNNs have large memory footprints, which obstructs their implementation on mobile devices, and require numerous floating point operations, which results in slow CPU inference times. In this article, we propose three approaches to training efficient CNNs that can operate in real-time on a CPU (catering to a clinical setting), with a low memory footprint, for minimal compromise in accuracy. We first demonstrate the power of `thin' CNNs (with very few feature channels) for fast medical image segmentation. We then leverage separable convolutions to further speed up inference, reduce parameter count and facilitate mobile deployment. Lastly, we propose a novel knowledge distillation technique to boost the accuracy of light-weight models, while maintaining inference speed-up. For a negligible sacrifice in test set Dice performance on the challenging ultrasound analysis task of nerve segmentation, our final proposed model processes images at 30 fps on a CPU, which is 9× faster than the standard U-Net, while requiring 420× less space in memory. Code for this work is available at: https://github.com/sagarvaze96/lightweight_unet.

Details DOI

NeurIPS Conference 2020 Conference Paper

Self-supervised Co-Training for Video Representation Learning

Tengda Han
Weidi Xie
Andrew Zisserman

The objective of this paper is visual-only self-supervised video representation learning. We make the following contributions: (i) we investigate the benefit of adding semantic-class positives to instance-based Info Noise Contrastive Estimation (InfoNCE) training, showing that this form of supervised contrastive learning leads to a clear improvement in performance; (ii) we propose a novel self-supervised co-training scheme to improve the popular infoNCE loss, exploiting the complementary information from different views, RGB streams and optical flow, of the same data source by using one view to obtain positive class samples for the other; (iii) we thoroughly evaluate the quality of the learnt representation on two different downstream tasks: action recognition and video retrieval. In both cases, the proposed approach demonstrates state-of-the-art or comparable performance with other self-supervised approaches, whilst being significantly more efficient to train, i. e. requiring far less training data to achieve similar performance.

PDF Details