Arrow Research search

Author name cluster

Ziwei Liu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

50 papers
1 author row

Possible papers

50

AAAI Conference 2026 Conference Paper

Branch, or Layer? Zeroth-Order Optimization for Continual Learning of Vision-Language Models

  • Ziwei Liu
  • Borui Kang
  • Wei Li
  • Hangjie Yuan
  • Yanbing Yang
  • Wenbin Li
  • Yifan Zhu
  • Tao Feng

Vision-Language Continual Learning (VLCL) has attracted significant research attention for its robust capabilities, and the adoption of Parameter-Efficient Fine-Tuning (PEFT) strategies is enabling these models to achieve competitive performance with substantially reduced resource consumption. However, dominated First-Order (FO) optimization is prone to trap models in suboptimal local minima, especially in limited exploration subspace within PEFT. To overcome this challenge, this paper pioneers a systematic exploration of adopting Zeroth-Order (ZO) optimization for PEFT-based VLCL. We first identify the incompatibility of naive full-ZO adoption in VLCL due to optimization process instability. We then investigate the application of ZO optimization from a modality branch-wise to a fine-grained layer-wise across various training units to identify an optimal strategy. Besides, a key theoretical insight reveals that vision modality exhibit higher variance than language counterparts in VLCL during the ZO optimization process, and we propose a modality-aware stabilized ZO strategy, which adopts gradient sign normalization in ZO and constrains vision modality perturbation to further improve performance. Benefiting from the adoption of ZO optimization, PEFT-based VLCL fulfills better ability to escape local minima during the optimization process, extensive experiments on four benchmarks demonstrate that our method achieves state-of-the-art results.

NeurIPS Conference 2025 Conference Paper

3EED: Ground Everything Everywhere in 3D

  • Rong Li
  • Yuhao Dong
  • Tianshuai Hu
  • Alan Liang
  • Youquan Liu
  • Dongyue Lu
  • Liang Pan
  • Lingdong Kong

Visual grounding in 3D is the key for embodied agents to localize language-referred objects in open-world environments. However, existing benchmarks are limited to indoor focus, single-platform constraints, and small scale. We introduce 3EED, a multi-platform, multi-modal 3D grounding benchmark featuring RGB and LiDAR data from vehicle, drone, and quadruped platforms. We provide over 128, 000 objects and 22, 000 validated referring expressions across diverse outdoor scenes -- 10x larger than existing datasets. We develop a scalable annotation pipeline combining vision-language model prompting with human verification to ensure high-quality spatial grounding. To support cross-platform learning, we propose platform-aware normalization and cross-modal alignment techniques, and establish benchmark protocols for in-domain and cross-platform evaluations. Our findings reveal significant performance gaps, highlighting the challenges and opportunities of generalizable 3D grounding. The 3EED dataset and benchmark toolkit are released to advance future research in language-driven 3D embodied perception.

NeurIPS Conference 2025 Conference Paper

Feed-Forward Bullet-Time Reconstruction of Dynamic Scenes from Monocular Videos

  • hanxue liang
  • Jiawei Ren
  • Ashkan Mirzaei
  • Antonio Torralba
  • Ziwei Liu
  • Igor Gilitschenski
  • Sanja Fidler
  • Cengiz Oztireli

Recent advancements in static feed-forward scene reconstruction have demonstrated significant progress in high-quality novel view synthesis. However, these models often struggle with generalizability across diverse environments and fail to effectively handle dynamic content. We present BTimer (short for Bullet Timer), the first motion-aware feed-forward model for real-time reconstruction and novel view synthesis of dynamic scenes. Our approach reconstructs the full scene in a 3D Gaussian Splatting representation at a given target (‘bullet’) timestamp by aggregating information from all the context frames. Such a formulation allows BTimer to gain scalability and generalization by leveraging both static and dynamic scene datasets. Given a casual monocular dynamic video, BTimer reconstructs a bullet-time scene within 150ms while reaching state-of-the-art performance on both static and dynamic scene datasets, even compared with optimization-based approaches.

NeurIPS Conference 2025 Conference Paper

GeneMAN: Generalizable Single-Image 3D Human Reconstruction from Multi-Source Human Data

  • Wentao Wang
  • Hang Ye
  • Fangzhou Hong
  • Xue Yang
  • Jianfu Zhang
  • Yizhou Wang
  • Ziwei Liu
  • Liang Pan

Given a single in-the-wild human photo, it remains a challenging task to reconstruct a high-fidelity 3D human model. Existing methods face difficulties including a) the varying body proportions captured by in-the-wild human images; b) diverse personal belongings within the shot; and c) ambiguities in human postures and inconsistency in human textures. In addition, the scarcity of high-quality human data intensifies the challenge. To address these problems, we propose a Generalizable image-to-3D huMAN reconstruction framework, dubbed GeneMAN, building upon a comprehensive multi-source collection of high-quality human data, including 3D scans, multi-view videos, single photos, and our generated synthetic human data. GeneMAN encompasses three key modules. 1) Without relying on parametric human models (e. g. , SMPL), GeneMAN first trains a human-specific text-to-image diffusion model and a view-conditioned diffusion model, serving as GeneMAN 2D human prior and 3D human prior for reconstruction, respectively. 2) With the help of the pretrained human prior models, the Geometry Initialization-&-Sculpting pipeline is leveraged to recover high-quality 3D human geometry given a single image. 3) To achieve high-fidelity 3D human textures, GeneMAN employs the Multi-Space Texture Refinement pipeline, consecutively refining textures in the latent and the pixel spaces. Extensive experimental results demonstrate that GeneMAN could generate high-quality 3D human models from a single image input, outperforming prior state-of-the-art methods. Notably, GeneMAN could reveal much better generalizability in dealing with in-the-wild images, often yielding high-quality 3D human models in natural poses with common items, regardless of the body proportions in the input images.

TMLR Journal 2025 Journal Article

Generalized Out-of-Distribution Detection and Beyond in Vision Language Model Era: A Survey

  • Atsuyuki Miyai
  • Jingkang Yang
  • Jingyang Zhang
  • Yifei Ming
  • Yueqian Lin
  • Qing Yu
  • Go Irie
  • Shafiq Joty

Detecting out-of-distribution (OOD) samples is crucial for ensuring the safety of machine learning systems and has shaped the field of OOD detection. Meanwhile, several other problems are closely related to OOD detection, including anomaly detection (AD), novelty detection (ND), open set recognition (OSR), and outlier detection (OD). To unify these problems, a generalized OOD detection framework was proposed, taxonomically categorizing these five problems. However, Vision Language Models (VLMs) such as CLIP have significantly changed the paradigm and blurred the boundaries between these fields, again confusing researchers. In this survey, we first present a generalized OOD detection v2, encapsulating the evolution of these fields in the VLM era. Our framework reveals that, with some field inactivity and integration, the demanding challenges have become OOD detection and AD. Then, we highlight the significant shift in the definition, problem settings, and benchmarks; we thus feature a comprehensive review of the methodology for OOD detection and related tasks to clarify their relationship to OOD detection. Finally, we explore the advancements in the emerging Large Vision Language Model (LVLM) era, such as GPT-4V. We conclude with open challenges and future directions. The resource is available at https://github.com/AtsuMiyai/Awesome-OOD-VLM.

NeurIPS Conference 2025 Conference Paper

GOOD: Training-Free Guided Diffusion Sampling for Out-of-Distribution Detection

  • Xin Gao
  • Jiyao Liu
  • Guanghao Li
  • Yueming LYU
  • Jianxiong Gao
  • Weichen Yu
  • Ningsheng Xu
  • Liang Wang

Recent advancements have explored text-to-image diffusion models for synthesizing out-of-distribution (OOD) samples, substantially enhancing the performance of OOD detection. However, existing approaches typically rely on perturbing text-conditioned embeddings, resulting in semantic instability and insufficient shift diversity, which limit generalization to realistic OOD. To address these challenges, we propose GOOD, a novel and flexible framework that directly guides diffusion sampling trajectories towards OOD regions using off-the-shelf in-distribution (ID) classifiers. GOOD incorporates dual-level guidance: (1) Image-level guidance based on the gradient of log partition to reduce input likelihood, drives samples toward low-density regions in pixel space. (2) Feature-level guidance, derived from k-NN distance in the classifier’s latent space, promotes sampling in feature-sparse regions. Hence, this dual-guidance design enables more controllable and diverse OOD sample generation. Additionally, we introduce a unified OOD score that adaptively combines image and feature discrepancies, enhancing detection robustness. We perform thorough quantitative and qualitative analyses to evaluate the effectiveness of GOOD, demonstrating that training with samples generated by GOOD can notably enhance OOD detection performance.

NeurIPS Conference 2025 Conference Paper

GUI-Reflection: Empowering Multimodal GUI Models with Self-Reflection Behavior

  • Penghao Wu
  • Shengnan Ma
  • Bo Wang
  • Jiaheng Yu
  • Lewei Lu
  • Ziwei Liu

Multimodal Large Language Models (MLLMs) have shown great potential in revolutionizing Graphical User Interface (GUI) automation. However, existing GUI models mostly rely on learning from nearly error-free offline trajectories, thus lacking reflection and error recovery capabilities. To bridge this gap, we propose GUI-Reflection, a novel framework that explicitly integrates self-reflection and error correction capabilities into end-to-end multimodal GUI models throughout dedicated training stages: GUI-specific pre-training, offline supervised fine-tuning (SFT), and online reflection tuning. GUI-reflection enables self-reflection behavior emergence with fully automated data generation and learning processes without requiring any human annotation. Specifically, 1) we first propose scalable data pipelines to automatically construct reflection and error correction data from existing successful trajectories. While existing GUI models mainly focus on grounding and UI understanding ability, we propose the GUI-Reflection Task Suite to learn and evaluate reflection-oriented abilities explicitly. 2) Furthermore, we built a diverse and efficient environment for online training and data collection of GUI models on mobile devices. 3) We also present an iterative online reflection tuning algorithm leveraging the proposed environment, enabling the model to continuously enhance its reflection and error correction abilities. Our framework equips GUI agents with self-reflection and correction capabilities, paving the way for more robust, adaptable, and intelligent GUI automation, with all data, models, environments, and tools to be released publicly.

NeurIPS Conference 2025 Conference Paper

Hi3DEval: Advancing 3D Generation Evaluation with Hierarchical Validity

  • Yuhan Zhang
  • Long Zhuo
  • Ziyang Chu
  • Tong Wu
  • Zhibing Li
  • Liang Pan
  • Dahua Lin
  • Ziwei Liu

Despite rapid advances in 3D content generation, quality assessment for the generated 3D assets remains challenging. Existing methods mainly rely on image-based metrics and operate solely at the object level, limiting their ability to capture spatial Despite rapid advances in 3D content generation, quality assessment for the generated 3D assets remains challenging. Existing methods mainly rely on image-based metrics and operate solely at the object level, limiting their ability to capture spatial coherence, material authenticity, and high-fidelity local details. 1) To address these challenges, we introduce Hi3DEval, a hierarchical evaluation framework tailored for 3D generative content. It combines both object-level and part-level evaluation, enabling holistic assessments across multiple dimensions as well as fine-grained quality analysis. Additionally, we extend texture evaluation beyond aesthetic appearance by explicitly assessing material realism, focusing on attributes such as albedo, saturation, and metallicness. 2) To support this framework, we construct Hi3DBench, a large-scale dataset comprising diverse 3D assets and high-quality annotations, accompanied by a reliable multi-agent annotation pipeline. We further propose a 3D-aware automated scoring system based on hybrid 3D representations. Specifically, we leverage video-based representations for object-level and material-subject evaluations to enhance modeling of spatio-temporal consistency and employ pretrained 3D features for part-level perception. Extensive experiments demonstrate that our approach outperforms existing image-based metrics in modeling 3D characteristics and achieves superior alignment with human preference, providing a scalable alternative to manual evaluations.

NeurIPS Conference 2025 Conference Paper

Imagine360: Immersive 360 Video Generation from Perspective Anchor

  • Jing Tan
  • Shuai Yang
  • Tong Wu
  • Jingwen He
  • Yuwei Guo
  • Ziwei Liu
  • Dahua Lin

$360^\circ$ videos offer a hyper-immersive experience that allows the viewers to explore a dynamic scene from full 360 degrees. To achieve more accessible and personalized content creation in $360^\circ$ video format, we seek to lift standard perspective videos into $360^\circ$ equirectangular videos. To this end, we introduce **Imagine360**, the first perspective-to-$360^\circ$ video generation framework that creates high-quality $360^\circ$ videos with rich and diverse motion patterns from video anchors. Imagine360 learns fine-grained spherical visual and motion patterns from limited $360^\circ$ video data with several key designs. **1)** Firstly we adopt the dual-branch design, including a perspective and a panorama video denoising branch to provide local and global constraints for $360^\circ$ video generation, with motion module and spatial LoRA layers fine-tuned on $360^\circ$ videos. **2)** Additionally, an antipodal mask is devised to capture long-range motion dependencies, enhancing the reversed camera motion between antipodal pixels across hemispheres. **3)** To handle diverse perspective video inputs, we propose rotation-aware designs that adapt to varying video masking due to changing camera poses across frames. **4)** Lastly, we introduce a new 360 video dataset featuring 10K high-quality, trimmed 360 video clips with structured motion to facilitate training. Extensive experiments show Imagine360 achieves superior graphics quality and motion coherence with our curated dataset among state-of-the-art $360^\circ$ video generation methods. We believe Imagine360 holds promise for advancing personalized, immersive $360^\circ$ video creation.

TMLR Journal 2025 Journal Article

Latte: Latent Diffusion Transformer for Video Generation

  • Xin Ma
  • Yaohui Wang
  • Xinyuan Chen
  • Gengyun Jia
  • Ziwei Liu
  • Yuan-Fang Li
  • Cunjian Chen
  • Yu Qiao

We propose Latte, a novel Latent Diffusion Transformer for video generation. Latte first extracts spatio-temporal tokens from input videos and then adopts a series of Transformer blocks to model video distribution in the latent space. In order to model a substantial number of tokens extracted from videos, four efficient variants are introduced from the perspective of decomposing the spatial and temporal dimensions of input videos. To improve the quality of generated videos, we determine the best practices of Latte through rigorous experimental analysis, including video clip patch embedding, model variants, timestep-class information injection, temporal positional embedding, and learning strategies. Our comprehensive evaluation demonstrates that Latte achieves state-of-the-art performance across four standard video generation datasets, \textit{i.e.}, FaceForensics, SkyTimelapse, UCF101, and Taichi-HD. In addition, we extend Latte to the text-to-video generation (T2V) task, where Latte achieves results that are competitive with recent T2V models. We strongly believe that Latte provides valuable insights for future research on incorporating Transformers into diffusion models for video generation.

TMLR Journal 2025 Journal Article

LLaVA-OneVision: Easy Visual Task Transfer

  • Bo Li
  • Yuanhan Zhang
  • Dong Guo
  • Renrui Zhang
  • Feng Li
  • Hao Zhang
  • Kaichen Zhang
  • Peiyuan Zhang

We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series. Our experimental results demonstrate that LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios. Importantly, the design of LLaVA-OneVision allows strong transfer learning across different modalities/scenarios, yielding new emerging capabilities. In particular, strong video understanding and cross-scenario capabilities are demonstrated through task transfer from images to videos.

TMLR Journal 2025 Journal Article

LLaVA-Video: Video Instruction Tuning With Synthetic Data

  • Yuanhan Zhang
  • Jinming Wu
  • Wei Li
  • Bo Li
  • Zejun Ma
  • Ziwei Liu
  • Chunyuan Li

The development of video large multimodal models (LMMs) has been hindered by the difficulty of curating large amounts of high-quality raw data from the web. To address this, we consider an alternative approach, creating a high-quality synthetic dataset specifically for video instruction-following, namely LLaVA-Video-178K. This dataset includes key tasks such as detailed captioning, open-ended question-answering (QA), and multiple-choice QA. By training on this proposed dataset, in combination with existing visual instruction tuning data, we introduce LLaVA-Video, a new video LMM. Our experiments demonstrate that LLaVA-Video achieves strong performance across various video benchmarks, highlighting the effectiveness of our dataset. We plan to release the dataset, its generation pipeline, and the model checkpoints.

TMLR Journal 2025 Journal Article

Long Context Transfer from Language to Vision

  • Peiyuan Zhang
  • Kaichen Zhang
  • Bo Li
  • Guangtao Zeng
  • Jingkang Yang
  • Yuanhan Zhang
  • Ziyue Wang
  • Haoran Tan

Video sequences offer valuable temporal information, but existing large multimodal models (LMMs) fall short in understanding extremely long videos. Many works address this by reducing the number of visual tokens using visual resamplers. Alternatively, in this paper, we approach this problem from the perspective of the language model. By simply extrapolating the context length of the language backbone, we enable LMMs to comprehend orders of magnitude more visual tokens without any video training. We call this phenomenon \textit{long context transfer} and carefully ablate its properties. To effectively measure LMMs' ability to generalize to long contexts in the vision modality, we develop V-NIAH (Visual Needle-In-A-Haystack), a purely synthetic long vision benchmark inspired by the language model' s NIAH test. Our proposed Long Video Assistant (LongVA) can process 2000 frames or over 200K visual tokens without additional complexities. With its extended context length, LongVA achieves state-of-the-art performance on Video-MME and MLVU among 7B-scale models by densely sampling more input frames.

NeurIPS Conference 2025 Conference Paper

PhysX-3D: Physical-Grounded 3D Asset Generation

  • Ziang Cao
  • Zhaoxi Chen
  • Liang Pan
  • Ziwei Liu

3D modeling is moving from virtual to physical. Existing 3D generation primarily emphasizes geometries and textures while neglecting physical-grounded modeling. Consequently, despite the rapid development of 3D generative models, the synthesized 3D assets often overlook rich and important physical properties, hampering their real-world application in physical domains like simulation and embodied AI. As an initial attempt to address this challenge, we propose \textbf{PhysX}, an end-to-end paradigm for physical-grounded 3D asset generation. \textbf{1)} To bridge the critical gap in physics-annotated 3D datasets, we present \textbf{\ourname}\ - the first physics-grounded 3D dataset systematically annotated across five foundational dimensions: \textbf{\textcolor{color2}{absolute scale}}, \textbf{\textcolor{color3}{material}}, \textbf{\textcolor{color1}{affordance}}, \textbf{\textcolor{color4}{kinematics}}, and \textbf{\textcolor{color5}{function description}}. In particular, we devise a scalable human-in-the-loop annotation pipeline based on vision-language models, which enables efficient creation of physics-first assets from raw 3D assets. \textbf{2)} Furthermore, we propose \textbf{PhysXGen}, a feed-forward framework for physics-grounded 3D asset generation, injecting physical knowledge into the pre-trained 3D structural space. Specifically, PhysXGen employs a dual-branch architecture to explicitly model the latent correlations between 3D structures and physical properties, thereby producing 3D assets with plausible physical predictions while preserving the native geometry quality. Extensive experiments validate the superior performance and promising generalization capability of our framework. All the code, data, and models will be released to facilitate future research in generative physical AI.

NeurIPS Conference 2025 Conference Paper

ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models

  • Hongbo Liu
  • Jingwen He
  • Yi Jin
  • Dian Zheng
  • Yuhao Dong
  • Fan Zhang
  • Ziqi Huang
  • Yinan He

Recent Vision-Language Models (VLMs) have shown strong performance in general-purpose visual understanding and reasoning, but their ability to comprehend the visual grammar of movie shots remains underexplored and insufficiently evaluated. To bridge this gap, we present \textbf{ShotBench}, a dedicated benchmark for assessing VLMs’ understanding of cinematic language. ShotBench includes 3, 049 still images and 500 video clips drawn from more than 200 films, with each sample annotated by trained annotators or curated from professional cinematography resources, resulting in 3, 608 high-quality question-answer pairs. We conduct a comprehensive evaluation of over 20 state-of-the-art VLMs across eight core cinematography dimensions. Our analysis reveals clear limitations in fine-grained perception and cinematic reasoning of current VLMs. To improve VLMs capability in cinematography understanding, we construct a large-scale multimodal dataset, named ShotQA, which contains about 70k Question-Answer pairs derived from movie shots. Besides, we propose ShotVL and train this VLM model with a two-stage training strategy, integrating both supervised fine-tuning and Group Relative Policy Optimization (GRPO). Experimental results demonstrate that our model achieves substantial improvements, surpassing all existing strongest open-source and proprietary models evaluated on ShotBench, establishing a new state-of-the-art performance.

AAAI Conference 2025 Conference Paper

SIGMA: Selective Gated Mamba for Sequential Recommendation

  • Ziwei Liu
  • Qidong Liu
  • Yejing Wang
  • Wanyu Wang
  • Pengyue Jia
  • Maolin Wang
  • Zitao Liu
  • Yi Chang

Sequential Recommender Systems (SRS) has stood out as a highly promising technique in numerous domains due to its impressive capability of capturing complex user preferences. Current SRS have employed transformer-based models to give the next-item prediction. Nevertheless, its quadratic computational complexity has often resulted in notable inefficiencies, posing a significant obstacle to real-time recommendation processes. Recently, Mamba has demonstrated its exceptional effectiveness in time series prediction, delivering substantial improvements in both efficiency and effectiveness. However, directly applying Mamba to SRS poses certain challenges. Its unidirectional structure may impede the ability to capture contextual information in user-item interactions, while its instability in state estimation may hinder the ability to capture short-term patterns in interaction sequences. To address these issues, we propose a novel framework called Selective Gated Mamba for Sequential Recommendation (SIGMA). By introducing the Partially Flipped Mamba (PF-Mamba), we construct a special bi-directional structure to address the context modeling challenge. Then, to consolidate PF-Mamba's performance, we employed an input-dependent Dense Selective Gate (DS Gate) to allocate the weights of the two directions and further filter the sequential information. Moreover, for short sequence modeling, we devise a Feature Extract GRU (FE-GRU) to capture the short-term dependencies. Experimental results demonstrate that SIGMA significantly outperforms existing baselines across five real-world datasets. Our implementation code is available in Supplementary Material to ease reproducibility.

NeurIPS Conference 2025 Conference Paper

Video World Models with Long-term Spatial Memory

  • Tong Wu
  • Shuai Yang
  • Ryan Po
  • Yinghao Xu
  • Ziwei Liu
  • Dahua Lin
  • Gordon Wetzstein

Emerging world models autoregressively generate video frames in response to actions, such as camera movements and text prompts, among other control signals. Due to limited temporal context window sizes, these models often struggle to maintain scene consistency during revisits, leading to severe forgetting of previously generated environments. Inspired by the mechanisms of human memory, we introduce a novel framework to enhancing long-term consistency of video world models through a geometry-grounded long-term spatial memory. Our framework includes mechanisms to store and retrieve information from the long-term spatial memory and we curate custom datasets to train and evaluate world models with explicitly stored 3D memory mechanisms. Our evaluations show improved quality, consistency, and context length compared to relevant baselines, paving the way towards long-term consistent world generation.

NeurIPS Conference 2025 Conference Paper

VideoLucy: Deep Memory Backtracking for Long Video Understanding

  • Jialong Zuo
  • Yongtai Deng
  • Lingdong Kong
  • Jingkang Yang
  • Rui Jin
  • Yiwei Zhang
  • Nong Sang
  • Liang Pan

Recent studies have shown that agent-based systems leveraging large language models (LLMs) for key information retrieval and integration have emerged as a promising approach for long video understanding. However, these systems face two major challenges. First, they typically perform modeling and reasoning on individual frames, struggling to capture the temporal context of consecutive frames. Second, to reduce the cost of dense frame-level captioning, they adopt sparse frame sampling, which risks discarding crucial information. To overcome these limitations, we propose VideoLucy, a deep memory backtracking framework for long video understanding. Inspired by the human recollection process from coarse to fine, VideoLucy employs a hierarchical memory structure with progressive granularity. This structure explicitly defines the detail level and temporal scope of memory at different hierarchical depths. Through an agent-based iterative backtracking mechanism, VideoLucy systematically mines video-wide, question-relevant deep memories until sufficient information is gathered to provide a confident answer. This design enables effective temporal understanding of consecutive frames while preserving critical details. In addition, we introduce EgoMem, a new benchmark for long video understanding. EgoMem is designed to comprehensively evaluate a model's ability to understand complex events that unfold over time and capture fine-grained details in extremely long videos. Extensive experiments demonstrate the superiority of VideoLucy. Built on open-source models, VideoLucy significantly outperforms state-of-the-art methods on multiple long video understanding benchmarks, achieving performance even surpassing the latest proprietary models such as GPT-4o. Our code and dataset will be made publicly available.

JBHI Journal 2024 Journal Article

A Two-Stage Generative Model with CycleGAN and Joint Diffusion for MRI-based Brain Tumor Detection

  • Wenxin Wang
  • Zhuo-Xu Cui
  • Guanxun Cheng
  • Chentao Cao
  • Xi Xu
  • Ziwei Liu
  • Haifeng Wang
  • Yulong Qi

Accuratedetection and segmentation of brain tumors is critical for medical diagnosis. However, current supervised learning methods require extensively annotated images and the state-of-the-art generative models used in unsupervised methods often have limitations in covering the whole data distribution. In this paper, we propose a novel framework T wo- S tage G enerative M odel (TSGM) that combines Cycle Generative Adversarial Network (CycleGAN) and V ariance E xploding stochastic differential equation using j oint p robability (VE-JP) to improve brain tumor detection and segmentation. The CycleGAN is trained on unpaired data to generate abnormal images from healthy images as data prior. Then VE-JP is implemented to reconstruct healthy images using synthetic paired abnormal images as a guide, which alters only pathological regions but not regions of healthy. Notably, our method directly learned the joint probability distribution for conditional generation. The residual between input and reconstructed images suggests the abnormalities and a thresholding method is subsequently applied to obtain segmentation results. Furthermore, the multimodal results are weighted with different weights to improve the segmentation accuracy further. We validated our method on three datasets, and compared with other unsupervised methods for anomaly detection and segmentation. The DSC score of 0. 8590 in BraTs2020 dataset, 0. 6226 in ITCS dataset and 0. 7403 in In-house dataset show that our method achieves better segmentation performance and has better generalization.

NeurIPS Conference 2024 Conference Paper

AID: Attention Interpolation of Text-to-Image Diffusion

  • Qiyuan He
  • Jinghao Wang
  • Ziwei Liu
  • Angela Yao

Conditional diffusion models can create unseen images in various settings, aiding image interpolation. Interpolation in latent spaces is well-studied, but interpolation with specific conditions like text or image is less understood. Common approaches interpolate linearly in the conditioning space but tend to result in inconsistent images with poor fidelity. This work introduces a novel training-free technique named \textbf{Attention Interpolation via Diffusion (AID)}. AID has two key contributions: \textbf{1)} a fused inner/outer interpolated attention layer to boost image consistency and fidelity; and \textbf{2)} selection of interpolation coefficients via a beta distribution to increase smoothness. Additionally, we present an AID variant called \textbf{Prompt-guided Attention Interpolation via Diffusion (PAID)}, which \textbf{3)} treats interpolation as a condition-dependent generative process. Experiments demonstrate that our method achieves greater consistency, smoothness, and efficiency in condition-based interpolation, aligning closely with human preferences. Furthermore, PAID offers substantial benefits for compositional generation, controlled image editing, image morphing and image-controlled generation, all while remaining training-free.

JBHI Journal 2024 Journal Article

Exploiting Hierarchical Interactions for Protein Surface Learning

  • Yiqun Lin
  • Liang Pan
  • Yi Li
  • Ziwei Liu
  • Xiaomeng Li

Predicting interactions between proteins is one of the most important yet challenging problems in structural bioinformatics. Intrinsically, potential function sites in protein surfaces are determined by both geometric and chemical features. However, existing works only consider handcrafted or individually learned chemical features from the atom type and extract geometric features independently. Here, we identify two key properties of effective protein surface learning: 1) relationship among atoms: atoms are linked with each other by covalent bonds to form biomolecules instead of appearing alone, leading to the significance of modeling the relationship among atoms in chemical feature learning. 2) hierarchical feature interaction: the neighboring residue effect validates the significance of hierarchical feature interaction among atoms and between surface points and atoms (or residues). In this paper, we present a principled framework based on deep learning techniques, namely Hierarchical Chemical and Geometric Feature Interaction Network (HCGNet), for protein surface analysis by bridging chemical and geometric features with hierarchical interactions. Extensive experiments demonstrate that our method outperforms the prior state-of-the-art method by 2. 3% in site prediction task and 3. 2% in interaction matching task, respectively.

NeurIPS Conference 2024 Conference Paper

FiVA: Fine-grained Visual Attribute Dataset for Text-to-Image Diffusion Models

  • Tong Wu
  • Yinghao Xu
  • Ryan Po
  • Mengchen Zhang
  • Guandao Yang
  • Jiaqi Wang
  • Ziwei Liu
  • Dahua Lin

Recent advances in text-to-image generation have enabled the creation of high-quality images with diverse applications. However, accurately describing desired visual attributes can be challenging, especially for non-experts in art and photography. An intuitive solution involves adopting favorable attributes from source images. Current methods attempt to distill identity and style from source images. However, "style" is a broad concept that includes texture, color, and artistic elements, but does not cover other important attributes like lighting and dynamics. Additionally, a simplified "style" adaptation prevents combining multiple attributes from different sources into one generated image. In this work, we formulate a more effective approach to decompose the aesthetics of a picture into specific visual attributes, letting users apply characteristics like lighting, texture, and dynamics from different images. To achieve this goal, we constructed the first fine-grained visual attributes dataset (FiVA) to the best of our knowledge. This FiVA dataset features a well-organized taxonomy for visual attributes and includes 1 M high-quality generated images with visual attribute annotations. Leveraging this dataset, we propose a fine-grained visual attributes adaptation framework (FiVA-Adapter), which decouples and adapts visual attributes from one or more source images into a generated one. This approach enhances user-friendly customization, allowing users to selectively apply desired attributes to create images that meet their unique preferences and specific content requirements.

NeurIPS Conference 2024 Conference Paper

L4GM: Large 4D Gaussian Reconstruction Model

  • Jiawei Ren
  • Kevin Xie
  • Ashkan Mirzaei
  • hanxue liang
  • Xiaohui Zeng
  • Karsten Kreis
  • Ziwei Liu
  • Antonio Torralba

We present L4GM, the first 4D Large Reconstruction Model that produces animated objects from a single-view video input -- in a single feed-forward pass that takes only a second. Key to our success is a novel dataset of multiview videos containing curated, rendered animated objects from Objaverse. This dataset depicts 44K diverse objects with 110K animations rendered in 48 viewpoints, resulting in 12M videos with a total of 300M frames. We keep our L4GM simple for scalability and build directly on top of LGM, a pretrained 3D Large Reconstruction Model that outputs 3D Gaussian ellipsoids from multiview image input. L4GM outputs a per-frame 3D Gaussian splat representation from video frames sampled at a low fps and then upsamples the representation to a higher fps to achieve temporal smoothness. We add temporal self-attention layers to the base LGM to help it learn consistency across time, and utilize a per-timestep multiview rendering loss to train the model. The representation is upsampled to a higher framerate by training an interpolation model which produces intermediate 3D Gaussian representations. We showcase that L4GM that is only trained on synthetic data generalizes well on in-the-wild videos, producing high quality animated 3D assets.

NeurIPS Conference 2024 Conference Paper

Make-it-Real: Unleashing Large Multimodal Model for Painting 3D Objects with Realistic Materials

  • Ye Fang
  • Zeyi Sun
  • Tong Wu
  • Jiaqi Wang
  • Ziwei Liu
  • Gordon Wetzstein
  • Dahua Lin

Physically realistic materials are pivotal in augmenting the realism of 3D assets across various applications and lighting conditions. However, existing 3D assets and generative models often lack authentic material properties. Manual assignment of materials using graphic software is a tedious and time-consuming task. In this paper, we exploit advancements in Multimodal Large Language Models (MLLMs), particularly GPT-4V, to present a novel approach, Make-it-Real: 1) We demonstrate that GPT-4V can effectively recognize and describe materials, allowing the construction of a detailed material library. 2) Utilizing a combination of visual cues and hierarchical text prompts, GPT-4V precisely identifies and aligns materials with the corresponding components of 3D objects. 3) The correctly matched materials are then meticulously applied as reference for the new SVBRDF material generation according to the original albedo map, significantly enhancing their visual authenticity. Make-it-Real offers a streamlined integration into the 3D content creation workflow, showcasing its utility as an essential tool for developers of 3D assets.

NeurIPS Conference 2023 Conference Paper

4D Panoptic Scene Graph Generation

  • Jingkang Yang
  • Jun CEN
  • Wenxuan Peng
  • Shuai Liu
  • Fangzhou Hong
  • Xiangtai Li
  • Kaiyang Zhou
  • Qifeng Chen

We are living in a three-dimensional space while moving forward through a fourth dimension: time. To allow artificial intelligence to develop a comprehensive understanding of such a 4D environment, we introduce 4D Panoptic Scene Graph (PSG-4D), a new representation that bridges the raw visual data perceived in a dynamic 4D world and high-level visual understanding. Specifically, PSG-4D abstracts rich 4D sensory data into nodes, which represent entities with precise location and status information, and edges, which capture the temporal relations. To facilitate research in this new area, we build a richly annotated PSG-4D dataset consisting of 3K RGB-D videos with a total of 1M frames, each of which is labeled with 4D panoptic segmentation masks as well as fine-grained, dynamic scene graphs. To solve PSG-4D, we propose PSG4DFormer, a Transformer-based model that can predict panoptic segmentation masks, track masks along the time axis, and generate the corresponding scene graphs via a relation component. Extensive experiments on the new dataset show that our method can serve as a strong baseline for future research on PSG-4D. In the end, we provide a real-world application example to demonstrate how we can achieve dynamic scene understanding by integrating a large language model into our PSG-4D system.

NeurIPS Conference 2023 Conference Paper

FineMoGen: Fine-Grained Spatio-Temporal Motion Generation and Editing

  • Mingyuan Zhang
  • Huirong Li
  • Zhongang Cai
  • Jiawei Ren
  • Lei Yang
  • Ziwei Liu

Text-driven motion generation has achieved substantial progress with the emergence of diffusion models. However, existing methods still struggle to generate complex motion sequences that correspond to fine-grained descriptions, depicting detailed and accurate spatio-temporal actions. This lack of fine controllability limits the usage of motion generation to a larger audience. To tackle these challenges, we present FineMoGen, a diffusion-based motion generation and editing framework that can synthesize fine-grained motions, with spatial-temporal composition to the user instructions. Specifically, FineMoGen builds upon diffusion model with a novel transformer architecture dubbed Spatio-Temporal Mixture Attention SAMI. SAMI optimizes the generation of the global attention template from two perspectives: 1) explicitly modeling the constraints of spatio-temporal composition; and 2) utilizing sparsely-activated mixture-of-experts to adaptively extract fine-grained features. To facilitate a large-scale study on this new fine-grained motion generation task, we contribute the HuMMan-MoGen dataset, which consists of 2, 968 videos and 102, 336 fine-grained spatio-temporal descriptions. Extensive experiments validate that FineMoGen exhibits superior motion generation quality over state-of-the-art methods. Notably, FineMoGen further enables zero-shot motion editing capabilities with the aid of modern large language models (LLM), which faithfully manipulates motion sequences with fine-grained instructions.

NeurIPS Conference 2023 Conference Paper

InsActor: Instruction-driven Physics-based Characters

  • Jiawei Ren
  • Mingyuan Zhang
  • Cunjun Yu
  • Xiao Ma
  • Liang Pan
  • Ziwei Liu

Generating animation of physics-based characters with intuitive control has long been a desirable task with numerous applications. However, generating physically simulated animations that reflect high-level human instructions remains a difficult problem due to the complexity of physical environments and the richness of human language. In this paper, we present $\textbf{InsActor}$, a principled generative framework that leverages recent advancements in diffusion-based human motion models to produce instruction-driven animations of physics-based characters. Our framework empowers InsActor to capture complex relationships between high-level human instructions and character motions by employing diffusion policies for flexibly conditioned motion planning. To overcome invalid states and infeasible state transitions in planned motions, InsActor discovers low-level skills and maps plans to latent skill sequences in a compact latent space. Extensive experiments demonstrate that InsActor achieves state-of-the-art results on various tasks, including instruction-driven motion generation and instruction-driven waypoint heading. Notably, the ability of InsActor to generate physically simulated animations using high-level human instructions makes it a valuable tool, particularly in executing long-horizon tasks with a rich set of instructions. Our project page is available at [jiawei-ren. github. io/projects/insactor/index. html](https: //jiawei-ren. github. io/projects/insactor/index. html)

NeurIPS Conference 2023 Conference Paper

Large Language Models are Visual Reasoning Coordinators

  • Liangyu Chen
  • Bo Li
  • Sheng Shen
  • Jingkang Yang
  • Chunyuan Li
  • Kurt Keutzer
  • Trevor Darrell
  • Ziwei Liu

Visual reasoning requires multimodal perception and commonsense cognition of the world. Recently, multiple vision-language models (VLMs) have been proposed with excellent commonsense reasoning ability in various domains. However, how to harness the collective power of these complementary VLMs is rarely explored. Existing methods like ensemble still struggle to aggregate these models with the desired higher-order communications. In this work, we propose Cola, a novel paradigm that coordinates multiple VLMs for visual reasoning. Our key insight is that a large language model (LLM) can efficiently coordinate multiple VLMs by facilitating natural language communication that leverages their distinct and complementary capabilities. Extensive experiments demonstrate that our instruction tuning variant, Cola-FT, achieves state-of-the-art performance on visual question answering (VQA), outside knowledge VQA, visual entailment, and visual spatial reasoning tasks. Moreover, we show that our in-context learning variant, Cola-Zero, exhibits competitive performance in zero and few-shot settings, without finetuning. Through systematic ablation studies and visualizations, we validate that a coordinator LLM indeed comprehends the instruction prompts as well as the separate functionalities of VLMs; it then coordinates them to enable impressive visual reasoning capabilities.

NeurIPS Conference 2023 Conference Paper

PrimDiffusion: Volumetric Primitives Diffusion for 3D Human Generation

  • Zhaoxi Chen
  • Fangzhou Hong
  • Haiyi Mei
  • Guangcong Wang
  • Lei Yang
  • Ziwei Liu

We present PrimDiffusion, the first diffusion-based framework for 3D human generation. Devising diffusion models for 3D human generation is difficult due to the intensive computational cost of 3D representations and the articulated topology of 3D humans. To tackle these challenges, our key insight is operating the denoising diffusion process directly on a set of volumetric primitives, which models the human body as a number of small volumes with radiance and kinematic information. This volumetric primitives representation marries the capacity of volumetric representations with the efficiency of primitive-based rendering. Our PrimDiffusion framework has three appealing properties: **1)** compact and expressive parameter space for the diffusion model, **2)** flexible representation that incorporates human prior, and **3)** decoder-free rendering for efficient novel-view and novel-pose synthesis. Extensive experiments validate that PrimDiffusion outperforms state-of-the-art methods in 3D human generation. Notably, compared to GAN-based methods, our PrimDiffusion supports real-time rendering of high-quality 3D humans at a resolution of $512\times512$ once the denoising process is done. We also demonstrate the flexibility of our framework on training-free conditional generation such as texture transfer and 3D inpainting.

NeurIPS Conference 2023 Conference Paper

RenderMe-360: A Large Digital Asset Library and Benchmarks Towards High-fidelity Head Avatars

  • Dongwei Pan
  • Long Zhuo
  • Jingtan Piao
  • Huiwen Luo
  • Wei Cheng
  • Yuxin Wang
  • Siming Fan
  • Shengqi Liu

Synthesizing high-fidelity head avatars is a central problem for computer vision and graphics. While head avatar synthesis algorithms have advanced rapidly, the best ones still face great obstacles in real-world scenarios. One of the vital causes is the inadequate datasets -- 1) current public datasets can only support researchers to explore high-fidelity head avatars in one or two task directions; 2) these datasets usually contain digital head assets with limited data volume, and narrow distribution over different attributes, such as expressions, ages, and accessories. In this paper, we present RenderMe-360, a comprehensive 4D human head dataset to drive advance in head avatar algorithms across different scenarios. It contains massive data assets, with 243+ million complete head frames and over 800k video sequences from 500 different identities captured by multi-view cameras at 30 FPS. It is a large-scale digital library for head avatars with three key attributes: 1) High Fidelity: all subjects are captured in 360 degrees via 60 synchronized, high-resolution 2K cameras. 2) High Diversity: The collected subjects vary from different ages, eras, ethnicities, and cultures, providing abundant materials with distinctive styles in appearance and geometry. Moreover, each subject is asked to perform various dynamic motions, such as expressions and head rotations, which further extend the richness of assets. 3) Rich Annotations: the dataset provides annotations with different granularities: cameras' parameters, background matting, scan, 2D/3D facial landmarks, FLAME fitting, and text description. Based on the dataset, we build a comprehensive benchmark for head avatar research, with 16 state-of-the-art methods performed on five main tasks: novel view synthesis, novel expression synthesis, hair rendering, hair editing, and talking head generation. Our experiments uncover the strengths and flaws of state-of-the-art methods. RenderMe-360 opens the door for future exploration in modern head avatars. All of the data, code, and models will be publicly available at https: //renderme-360. github. io/.

AAAI Conference 2023 Conference Paper

Robust Video Portrait Reenactment via Personalized Representation Quantization

  • Kaisiyuan Wang
  • Changcheng Liang
  • Hang Zhou
  • Jiaxiang Tang
  • Qianyi Wu
  • Dongliang He
  • Zhibin Hong
  • Jingtuo Liu

While progress has been made in the field of portrait reenactment, the problem of how to produce high-fidelity and robust videos remains. Recent studies normally find it challenging to handle rarely seen target poses due to the limitation of source data. This paper proposes the Video Portrait via Non-local Quantization Modeling (VPNQ) framework, which produces pose- and disturbance-robust reenactable video portraits. Our key insight is to learn position-invariant quantized local patch representations and build a mapping between simple driving signals and local textures with non-local spatial-temporal modeling. Specifically, instead of learning a universal quantized codebook, we identify that a personalized one can be trained to preserve desired position-invariant local details better. Then, a simple representation of projected landmarks can be used as sufficient driving signals to avoid 3D rendering. Following, we employ a carefully designed Spatio-Temporal Transformer to predict reasonable and temporally consistent quantized tokens from the driving signal. The predicted codes can be decoded back to robust and high-quality videos. Comprehensive experiments have been conducted to validate the effectiveness of our approach.

NeurIPS Conference 2023 Conference Paper

Segment Any Point Cloud Sequences by Distilling Vision Foundation Models

  • Youquan Liu
  • Lingdong Kong
  • Jun CEN
  • Runnan Chen
  • Wenwei Zhang
  • Liang Pan
  • Kai Chen
  • Ziwei Liu

Recent advancements in vision foundation models (VFMs) have opened up new possibilities for versatile and efficient visual perception. In this work, we introduce Seal, a novel framework that harnesses VFMs for segmenting diverse automotive point cloud sequences. Seal exhibits three appealing properties: i) Scalability: VFMs are directly distilled into point clouds, obviating the need for annotations in either 2D or 3D during pretraining. ii) Consistency: Spatial and temporal relationships are enforced at both the camera-to-LiDAR and point-to-segment regularization stages, facilitating cross-modal representation learning. iii) Generalizability: Seal enables knowledge transfer in an off-the-shelf manner to downstream tasks involving diverse point clouds, including those from real/synthetic, low/high-resolution, large/small-scale, and clean/corrupted datasets. Extensive experiments conducted on eleven different point cloud datasets showcase the effectiveness and superiority of Seal. Notably, Seal achieves a remarkable 45. 0% mIoU on nuScenes after linear probing, surpassing random initialization by 36. 9% mIoU and outperforming prior arts by 6. 1% mIoU. Moreover, Seal demonstrates significant performance gains over existing methods across 20 different few-shot fine-tuning tasks on all eleven tested point cloud datasets. The code is available at this link.

NeurIPS Conference 2023 Conference Paper

SMPLer-X: Scaling Up Expressive Human Pose and Shape Estimation

  • Zhongang Cai
  • Wanqi Yin
  • Ailing Zeng
  • Chen Wei
  • Qingping SUN
  • Wang Yanjun
  • Hui En Pang
  • Haiyi Mei

Expressive human pose and shape estimation (EHPS) unifies body, hands, and face motion capture with numerous applications. Despite encouraging progress, current state-of-the-art methods still depend largely on a confined set of training datasets. In this work, we investigate scaling up EHPS towards the first generalist foundation model (dubbed SMPLer-X), with up to ViT-Huge as the backbone and training with up to 4. 5M instances from diverse data sources. With big data and the large model, SMPLer-X exhibits strong performance across diverse test benchmarks and excellent transferability to even unseen environments. 1) For the data scaling, we perform a systematic investigation on 32 EHPS datasets, including a wide range of scenarios that a model trained on any single dataset cannot handle. More importantly, capitalizing on insights obtained from the extensive benchmarking process, we optimize our training scheme and select datasets that lead to a significant leap in EHPS capabilities. 2) For the model scaling, we take advantage of vision transformers to study the scaling law of model sizes in EHPS. Moreover, our finetuning strategy turn SMPLer-X into specialist models, allowing them to achieve further performance boosts. Notably, our foundation model SMPLer-X consistently delivers state-of-the-art results on seven benchmarks such as AGORA (107. 2 mm NMVE), UBody (57. 4 mm PVE), EgoBody (63. 6 mm PVE), and EHF (62. 3 mm PVE without finetuning).

NeurIPS Conference 2023 Conference Paper

Towards Robust and Expressive Whole-body Human Pose and Shape Estimation

  • Hui En Pang
  • Zhongang Cai
  • Lei Yang
  • Qingyi Tao
  • Zhonghua Wu
  • Tianwei Zhang
  • Ziwei Liu

Whole-body pose and shape estimation aims to jointly predict different behaviors (e. g. , pose, hand gesture, facial expression) of the entire human body from a monocular image. Existing methods often exhibit suboptimal performance due to the complexity of in-the-wild scenarios. We argue that the prediction accuracy of these models is significantly affected by the quality of the bounding box, e. g. , scale, alignment. The natural discrepancy between the ideal bounding box annotations and model detection results is particularly detrimental to the performance of whole-body pose and shape estimation. In this paper, we propose a novel framework to enhance the robustness of whole-body pose and shape estimation. Our framework incorporates three new modules to address the above challenges from three perspectives: (1) a Localization Module enhances the model's awareness of the subject's location and semantics within the image space; (2) a Contrastive Feature Extraction Module encourages the model to be invariant to robust augmentations by incorporating a contrastive loss and positive samples; (3) a Pixel Alignment Module ensures the reprojected mesh from the predicted camera and body model parameters are more accurate and pixel-aligned. We perform comprehensive experiments to demonstrate the effectiveness of our proposed framework on body, hands, face and whole-body benchmarks.

NeurIPS Conference 2023 Conference Paper

What Makes Good Examples for Visual In-Context Learning?

  • Yuanhan Zhang
  • Kaiyang Zhou
  • Ziwei Liu

Large vision models with billions of parameters and trained on broad data have great potential in numerous downstream applications. However, these models are typically difficult to adapt due to their large parameter size and sometimes lack of accesss to their weights---entities able to develop large vision models often provide APIs only. In this paper, we study how to better utilize large vision models through the lens of in-context learning, a concept that has been well-known in natural language processing but has only been studied very recently in computer vision. In-context learning refers to the ability to perform inference on tasks never seen during training by simply conditioning on in-context examples (i. e. , input-output pairs) without updating any internal model parameters. To demystify in-context learning in computer vision, we conduct an extensive research and identify a critical problem: downstream performance is highly sensitivie to the choice of visual in-context examples. To address this problem, we propose a prompt retrieval framework specifically for large vision models, allowing the selection of in-context examples to be fully automated. Concretely, we provide two implementations: (i) an unsupervised prompt retrieval method based on nearest example search using an off-the-shelf model, and (ii) a supervised prompt retrieval method, which trains a neural network to choose examples that directly maximize in-context learning performance. Both methods do not require access to the internal weights of large vision models. Our results demonstrate that our methods can bring non-trivial improvements to visual in-context learning in comparison to the commonly-used random selection. Code and models will be released.

NeurIPS Conference 2022 Conference Paper

AnimeRun: 2D Animation Visual Correspondence from Open Source 3D Movies

  • Li Siyao
  • Yuhang Li
  • Bo Li
  • Chao Dong
  • Ziwei Liu
  • Chen Change Loy

Visual correspondence of 2D animation is the core of many applications and deserves careful study. Existing correspondence datasets for 2D cartoon suffer from simple frame composition and monotonic movements, making them insufficient to simulate real animations. In this work, we present a new 2D animation visual correspondence dataset, AnimeRun, by converting open source 3D movies to full scenes in 2D style, including simultaneous moving background and interactions of multiple subjects. Statistics show that our proposed dataset not only resembles real anime more in image composition, but also possesses richer and more complex motion patterns compared to existing datasets. With this dataset, we establish a comprehensive benchmark by evaluating several existing optical flow and segment matching methods, and analyze shortcomings of these methods on animation data. Data are available at https: //lisiyao21. github. io/projects/AnimeRun.

NeurIPS Conference 2022 Conference Paper

Audio-Driven Co-Speech Gesture Video Generation

  • Xian Liu
  • Qianyi Wu
  • Hang Zhou
  • Yuanqi Du
  • Wayne Wu
  • Dahua Lin
  • Ziwei Liu

Co-speech gesture is crucial for human-machine interaction and digital entertainment. While previous works mostly map speech audio to human skeletons (e. g. , 2D keypoints), directly generating speakers' gestures in the image domain remains unsolved. In this work, we formally define and study this challenging problem of audio-driven co-speech gesture video generation, i. e. , using a unified framework to generate speaker image sequence driven by speech audio. Our key insight is that the co-speech gestures can be decomposed into common motion patterns and subtle rhythmic dynamics. To this end, we propose a novel framework, Audio-driveN Gesture vIdeo gEneration (ANGIE), to effectively capture the reusable co-speech gesture patterns as well as fine-grained rhythmic movements. To achieve high-fidelity image sequence generation, we leverage an unsupervised motion representation instead of a structural human body prior (e. g. , 2D skeletons). Specifically, 1) we propose a vector quantized motion extractor (VQ-Motion Extractor) to summarize common co-speech gesture patterns from implicit motion representation to codebooks. 2) Moreover, a co-speech gesture GPT with motion refinement (Co-Speech GPT) is devised to complement the subtle prosodic motion details. Extensive experiments demonstrate that our framework renders realistic and vivid co-speech gesture video. Demo video and more resources can be found in: https: //alvinliu0. github. io/projects/ANGIE

NeurIPS Conference 2022 Conference Paper

Benchmarking and Analyzing 3D Human Pose and Shape Estimation Beyond Algorithms

  • Hui En Pang
  • Zhongang Cai
  • Lei Yang
  • Tianwei Zhang
  • Ziwei Liu

3D human pose and shape estimation (a. k. a. ``human mesh recovery'') has achieved substantial progress. Researchers mainly focus on the development of novel algorithms, while less attention has been paid to other critical factors involved. This could lead to less optimal baselines, hindering the fair and faithful evaluations of newly designed methodologies. To address this problem, this work presents the \textit{first} comprehensive benchmarking study from three under-explored perspectives beyond algorithms. \emph{1) Datasets. } An analysis on 31 datasets reveals the distinct impacts of data samples: datasets featuring critical attributes (\emph{i. e. } diverse poses, shapes, camera characteristics, backbone features) are more effective. Strategical selection and combination of high-quality datasets can yield a significant boost to the model performance. \emph{2) Backbones. } Experiments with 10 backbones, ranging from CNNs to transformers, show the knowledge learnt from a proximity task is readily transferable to human mesh recovery. \emph{3) Training strategies. } Proper augmentation techniques and loss designs are crucial. With the above findings, we achieve a PA-MPJPE of 47. 3 (mm) on the 3DPW test set with a relatively simple model. More importantly, we provide strong baselines for fair comparisons of algorithms, and recommendations for building effective training configurations in the future. Codebase is available at \url{https: //github. com/smplbody/hmr-benchmarks}.

NeurIPS Conference 2022 Conference Paper

OpenOOD: Benchmarking Generalized Out-of-Distribution Detection

  • Jingkang Yang
  • Pengyun Wang
  • Dejian Zou
  • Zitang Zhou
  • Kunyuan Ding
  • Wenxuan Peng
  • Haoqi Wang
  • Guangyao Chen

Out-of-distribution (OOD) detection is vital to safety-critical machine learning applications and has thus been extensively studied, with a plethora of methods developed in the literature. However, the field currently lacks a unified, strictly formulated, and comprehensive benchmark, which often results in unfair comparisons and inconclusive results. From the problem setting perspective, OOD detection is closely related to neighboring fields including anomaly detection (AD), open set recognition (OSR), and model uncertainty, since methods developed for one domain are often applicable to each other. To help the community to improve the evaluation and advance, we build a unified, well-structured codebase called OpenOOD, which implements over 30 methods developed in relevant fields and provides a comprehensive benchmark under the recently proposed generalized OOD detection framework. With a comprehensive comparison of these methods, we are gratified that the field has progressed significantly over the past few years, where both preprocessing methods and the orthogonal post-hoc methods show strong potential.

AAAI Conference 2022 Conference Paper

SepFusion: Finding Optimal Fusion Structures for Visual Sound Separation

  • Dongzhan Zhou
  • Xinchi Zhou
  • Di Hu
  • Hang Zhou
  • Lei Bai
  • Ziwei Liu
  • Wanli Ouyang

Multiple modalities can provide rich semantic information; and exploiting such information will normally lead to better performance compared with the single-modality counterpart. However, it is not easy to devise an effective cross-modal fusion structure due to the variations of feature dimensions and semantics, especially when the inputs even come from different sensors, as in the field of audio-visual learning. In this work, we propose SepFusion, a novel framework that can smoothly produce optimal fusion structures for visualsound separation. The framework is composed of two components, namely the model generator and the evaluator. To construct the generator, we devise a lightweight architecture space that can adapt to different input modalities. In this way, we can easily obtain audio-visual fusion structures according to our demands. For the evaluator, we adopt the idea of neural architecture search to select superior networks effectively. This automatic process can significantly save human efforts while achieving competitive performances. Moreover, since our SepFusion provides a series of strong models, we can utilize the model family for broader applications, such as further promoting performance via model assembly, or providing suitable architectures for the separation of certain instrument classes. These potential applications further enhance the competitiveness of our approach.

AAAI Conference 2022 Conference Paper

Visual Sound Localization in the Wild by Cross-Modal Interference Erasing

  • Xian Liu
  • Rui Qian
  • Hang Zhou
  • Di Hu
  • Weiyao Lin
  • Ziwei Liu
  • Bolei Zhou
  • Xiaowei Zhou

The task of audiovisual sound source localization has been well studied under constrained scenes, where the audio recordings are clean. However, in real world scenarios, audios are usually contaminated by off screen sound and background noise. They will interfere with the procedure of identifying desired sources and building visual sound connections, making previous studies nonapplicable. In this work, we propose the Interference Eraser (IEr) framework, which tackles the problem of audiovisual sound source localization in the wild. The key idea is to eliminate the interference by redefining and carving discriminative audio representations. Specifically, we observe that the previous practice of learning only a single audio representation is insufficient due to the additive nature of audio signals. We thus extend the audio representation with our Audio Instance Identifier module, which clearly distinguishes sounding instances when audio signals of different volumes are unevenly mixed. Then we erase the influence of the audible but off screen sounds and the silent but visible objects by a Cross modal Referrer module with cross modality distillation. Quantitative and qualitative evaluations demonstrate that our framework achieves superior results on sound localization tasks, especially under real world scenarios.

NeurIPS Conference 2021 Conference Paper

Balanced Chamfer Distance as a Comprehensive Metric for Point Cloud Completion

  • Tong Wu
  • Liang Pan
  • Junzhe Zhang
  • Tai WANG
  • Ziwei Liu
  • Dahua Lin

Chamfer Distance (CD) and Earth Mover’s Distance (EMD) are two broadly adopted metrics for measuring the similarity between two point sets. However, CD is usually insensitive to mismatched local density, and EMD is usually dominated by global distribution while overlooks the fidelity of detailed structures. Besides, their unbounded value range induces a heavy influence from the outliers. These defects prevent them from providing a consistent evaluation. To tackle these problems, we propose a new similarity measure named Density-aware Chamfer Distance (DCD). It is derived from CD and benefits from several desirable properties: 1) it can detect disparity of density distributions and is thus a more intensive measure of similarity compared to CD; 2) it is stricter with detailed structures and significantly more computationally efficient than EMD; 3) the bounded value range encourages a more stable and reasonable evaluation over the whole test set. We adopt DCD to evaluate the point cloud completion task, where experimental results show that DCD pays attention to both the overall structure and local geometric details and provides a more reliable evaluation even when CD and EMD contradict each other. We can also use DCD as the training loss, which outperforms the same model trained with CD loss on all three metrics. In addition, we propose a novel point discriminator module that estimates the priority for another guided down-sampling step, and it achieves noticeable improvements under DCD together with competitive results for both CD and EMD. We hope our work could pave the way for a more comprehensive and practical point cloud similarity evaluation. Our code will be available at https: //github. com/wutong16/Density aware Chamfer_Distance.

NeurIPS Conference 2021 Conference Paper

Few-Shot Object Detection via Association and DIscrimination

  • Yuhang Cao
  • Jiaqi Wang
  • Ying Jin
  • Tong Wu
  • Kai Chen
  • Ziwei Liu
  • Dahua Lin

Object detection has achieved substantial progress in the last decade. However, detecting novel classes with only few samples remains challenging, since deep learning under low data regime usually leads to a degraded feature space. Existing works employ a holistic fine-tuning paradigm to tackle this problem, where the model is first pre-trained on all base classes with abundant samples, and then it is used to carve the novel class feature space. Nonetheless, this paradigm is still imperfect. Durning fine-tuning, a novel class may implicitly leverage the knowledge of multiple base classes to construct its feature space, which induces a scattered feature space, hence violating the inter-class separability. To overcome these obstacles, we propose a two-step fine-tuning framework, Few-shot object detection via Association and DIscrimination (FADI), which builds up a discriminative feature space for each novel class with two integral steps. 1) In the association step, in contrast to implicitly leveraging multiple base classes, we construct a compact novel class feature space via explicitly imitating a specific base class feature space. Specifically, we associate each novel class with a base class according to their semantic similarity. After that, the feature space of a novel class can readily imitate the well-trained feature space of the associated base class. 2) In the discrimination step, to ensure the separability between the novel classes and associated base classes, we disentangle the classification branches for base and novel classes. To further enlarge the inter-class separability between all classes, a set-specialized margin loss is imposed. Extensive experiments on standard Pascal VOC and MS-COCO datasets demonstrate that FADI achieves new state-of-the-art performance, significantly improving the baseline in any shot/split by +18. 7. Notably, the advantage of FADI is most announced on extremely few-shot scenarios (e. g. 1- and 3- shot).

NeurIPS Conference 2021 Conference Paper

Garment4D: Garment Reconstruction from Point Cloud Sequences

  • Fangzhou Hong
  • Liang Pan
  • Zhongang Cai
  • Ziwei Liu

Learning to reconstruct 3D garments is important for dressing 3D human bodies of different shapes in different poses. Previous works typically rely on 2D images as input, which however suffer from the scale and pose ambiguities. To circumvent the problems caused by 2D images, we propose a principled framework, Garment4D, that uses 3D point cloud sequences of dressed humans for garment reconstruction. Garment4D has three dedicated steps: sequential garments registration, canonical garment estimation, and posed garment reconstruction. The main challenges are two-fold: 1) effective 3D feature learning for fine details, and 2) capture of garment dynamics caused by the interaction between garments and the human body, especially for loose garments like skirts. To unravel these problems, we introduce a novel Proposal-Guided Hierarchical Feature Network and Iterative Graph Convolution Network, which integrate both high-level semantic features and low-level geometric features for fine details reconstruction. Furthermore, we propose a Temporal Transformer for smooth garment motions capture. Unlike non-parametric methods, the reconstructed garment meshes by our method are separable from the human body and have strong interpretability, which is desirable for downstream tasks. As the first attempt at this task, high-quality reconstruction results are qualitatively and quantitatively illustrated through extensive experiments. Codes are available at https: //github. com/hongfz16/Garment4D.

IJCAI Conference 2021 Conference Paper

Speech2Talking-Face: Inferring and Driving a Face with Synchronized Audio-Visual Representation

  • Yasheng Sun
  • Hang Zhou
  • Ziwei Liu
  • Hideki Koike

What can we picture solely from a clip of speech? Previous research has shown the possibility of directly inferring the appearance of a person's face by listening to a voice. However, within human speech lies not only the biometric identity signal but also the identity-irrelevant information such as the talking content. Our goal is to extract as much information from a clip of speech as possible. In particular, we aim at not only inferring the face of a person but also animating it. Our key insight is to synchronize audio and visual representations from two perspectives in a style-based generative framework. Specifically, contrastive learning is leveraged to map both the identity and speech content information within the speech to visual representation spaces. Furthermore, the identity space is strengthened with class centroids. Through curriculum learning, the style-based generator is capable of automatically balancing the information from the two latent spaces. Extensive experiments show that our approach encourages better speech-identity correlation learning while generating vivid faces whose identities are consistent with given speech samples. Moreover, by leveraging the same model, these inferred faces can be driven to talk by the audio.

NeurIPS Conference 2021 Conference Paper

Unsupervised Object-Level Representation Learning from Scene Images

  • Jiahao Xie
  • Xiaohang Zhan
  • Ziwei Liu
  • Yew Soon Ong
  • Chen Change Loy

Contrastive self-supervised learning has largely narrowed the gap to supervised pre-training on ImageNet. However, its success highly relies on the object-centric priors of ImageNet, i. e. , different augmented views of the same image correspond to the same object. Such a heavily curated constraint becomes immediately infeasible when pre-trained on more complex scene images with many objects. To overcome this limitation, we introduce Object-level Representation Learning (ORL), a new self-supervised learning framework towards scene images. Our key insight is to leverage image-level self-supervised pre-training as the prior to discover object-level semantic correspondence, thus realizing object-level representation learning from scene images. Extensive experiments on COCO show that ORL significantly improves the performance of self-supervised learning on scene images, even surpassing supervised ImageNet pre-training on several downstream tasks. Furthermore, ORL improves the downstream performance when more unlabeled scene images are available, demonstrating its great potential of harnessing unlabeled data in the wild. We hope our approach can motivate future research on more general-purpose unsupervised representation learning from scene data.

AAAI Conference 2019 Conference Paper

Instance-Level Facial Attributes Transfer with Geometry-Aware Flow

  • Weidong Yin
  • Ziwei Liu
  • Chen Change Loy

We address the problem of instance-level facial attribute transfer without paired training data, e. g. , faithfully transferring the exact mustache from a source face to a target face. This is a more challenging task than the conventional semantic-level attribute transfer, which only preserves the generic attribute style instead of instance-level traits. We propose the use of geometry-aware flow, which serves as a wellsuited representation for modeling the transformation between instance-level facial attributes. Specifically, we leverage the facial landmarks as the geometric guidance to learn the differentiable flows automatically, despite of the large pose gap existed. Geometry-aware flow is able to warp the source face attribute into the target face context and generate a warp-and-blend result. To compensate for the potential appearance gap between source and target faces, we propose a hallucination sub-network that produces an appearance residual to further refine the warp-and-blend result. Finally, a cycle-consistency framework consisting of both attribute transfer module and attribute removal module is designed, so that abundant unpaired face images can be used as training data. Extensive evaluations validate the capability of our approach in transferring instance-level facial attributes faithfully across large pose and appearance gaps. Thanks to the flow representation, our approach can readily be applied to generate realistic details on high-resolution images1.

AAAI Conference 2019 Conference Paper

Talking Face Generation by Adversarially Disentangled Audio-Visual Representation

  • Hang Zhou
  • Yu Liu
  • Ziwei Liu
  • Ping Luo
  • Xiaogang Wang

Talking face generation aims to synthesize a sequence of face images that correspond to a clip of speech. This is a challenging task because face appearance variation and semantics of speech are coupled together in the subtle movements of the talking face regions. Existing works either construct specific face appearance model on specific subjects or model the transformation between lip motion and speech. In this work, we integrate both aspects and enable arbitrary-subject talking face generation by learning disentangled audio-visual representation. We find that the talking face sequence is actually a composition of both subject-related information and speech-related information. These two spaces are then explicitly disentangled through a novel associative-and-adversarial training process. This disentangled representation has an advantage where both audio and video can serve as inputs for generation. Extensive experiments show that the proposed approach generates realistic talking face sequences on arbitrary subjects with much clearer lip motion patterns than previous work. We also demonstrate the learned audio-visual representation is extremely useful for the tasks of automatic lip reading and audio-video retrieval.

AAAI Conference 2018 Conference Paper

Mix-and-Match Tuning for Self-Supervised Semantic Segmentation

  • Xiaohang Zhan
  • Ziwei Liu
  • Ping Luo
  • Xiaoou Tang
  • Chen Loy

Deep convolutional networks for semantic image segmentation typically require large-scale labeled data, e. g. , ImageNet and MS COCO, for network pre-training. To reduce annotation efforts, self-supervised semantic segmentation is recently proposed to pre-train a network without any humanprovided labels. The key of this new form of learning is to design a proxy task (e. g. , image colorization), from which a discriminative loss can be formulated on unlabeled data. Many proxy tasks, however, lack the critical supervision signals that could induce discriminative representation for the target image segmentation task. Thus self-supervision’s performance is still far from that of supervised pre-training. In this study, we overcome this limitation by incorporating a ‘mix-and-match’ (M&M) tuning stage in the self-supervision pipeline. The proposed approach is readily pluggable to many self-supervision methods and does not use more annotated samples than the original process. Yet, it is capable of boosting the performance of target image segmentation task to surpass fully-supervised pre-trained counterpart. The improvement is made possible by better harnessing the limited pixelwise annotations in the target dataset. Specifically, we first introduce the ‘mix’ stage, which sparsely samples and mixes patches from the target set to reflect rich and diverse local patch statistics of target images. A ‘match’ stage then forms a class-wise connected graph, which can be used to derive a strong triplet-based discriminative loss for fine-tuning the network. Our paradigm follows the standard practice in existing self-supervised studies and no extra data or label is required. With the proposed M&M approach, for the first time, a self-supervision method can achieve comparable or even better performance compared to its ImageNet pretrained counterpart on both PASCAL VOC2012 dataset and CityScapes dataset.

AAAI Conference 2016 Conference Paper

Face Model Compression by Distilling Knowledge from Neurons

  • Ping Luo
  • Zhenyao Zhu
  • Ziwei Liu
  • Xiaogang Wang
  • Xiaoou Tang

The recent advanced face recognition systems were built on large Deep Neural Networks (DNNs) or their ensembles, which have millions of parameters. However, the expensive computation of DNNs make their deployment difficult on mobile and embedded devices. This work addresses model compression for face recognition, where the learned knowledge of a large teacher network or its ensemble is utilized as supervision to train a compact student network. Unlike previous works that represent the knowledge by the soften label probabilities, which are difficult to fit, we represent the knowledge by using the neurons at the higher hidden layer, which preserve as much information as the label probabilities, but are more compact. By leveraging the essential characteristics (domain knowledge) of the learned face representation, a neuron selection method is proposed to choose neurons that are most relevant to face recognition. Using the selected neurons as supervision to mimic the single networks of DeepID2+ and DeepID3, which are the state-of-the-art face recognition systems, a compact student with simple network structure achieves better verification accuracy on LFW than its teachers, respectively. When using an ensemble of DeepID2+ as teacher, a mimicked student is able to outperform it and achieves 51. 6× compression ratio and 90× speed-up in inference, making this cumbersome model applicable on portable devices.