Arrow Research search

Author name cluster

Xian Liu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

13 papers
2 author rows

Possible papers

13

ICLR Conference 2025 Conference Paper

3DTrajMaster: Mastering 3D Trajectory for Multi-Entity Motion in Video Generation

  • Xiao Fu
  • Xian Liu
  • Xintao Wang 0002
  • Sida Peng
  • Menghan Xia
  • Xiaoyu Shi 0002
  • Ziyang Yuan
  • Pengfei Wan 0001

This paper aims to manipulate multi-entity 3D motions in video generation. Previous methods on controllable video generation primarily leverage 2D control signals to manipulate object motions and have achieved remarkable synthesis results. However, 2D control signals are inherently limited in expressing the 3D nature of object motions. To overcome this problem, we introduce 3DTrajMaster, a robust controller that regulates multi-entity dynamics in 3D space, given user-desired 6DoF pose (location and rotation) sequences of entities. At the core of our approach is a plug-and-play 3D-motion grounded object injector that fuses multiple input entities with their respective 3D trajectories through a gated self-attention mechanism. In addition, we exploit an injector architecture to preserve the video diffusion prior, which is crucial for generalization ability. To mitigate video quality degradation, we introduce a domain adaptor during training and employ an annealed sampling strategy during inference. To address the lack of suitable training data, we construct a 360-Motion Dataset, which first correlates collected 3D human and animal assets with GPT-generated trajectory and then captures their motion with 12 evenly-surround cameras on diverse 3D UE platforms. Extensive experiments show that 3DTrajMaster sets a new state-of-the-art in both accuracy and generalization for controlling multi-entity 3D motions. Project page: http://fuxiao0719.github.io/projects/3dtrajmaster

ICLR Conference 2025 Conference Paper

Accelerating Auto-regressive Text-to-Image Generation with Training-free Speculative Jacobi Decoding

  • Yao Teng
  • Han Shi
  • Xian Liu
  • Xuefei Ning
  • Guohao Dai 0001
  • Yu Wang 0002
  • Zhenguo Li
  • Xihui Liu

The current large auto-regressive models can generate high-quality, high-resolution images, but these models require hundreds or even thousands of steps of next-token prediction during inference, resulting in substantial time consumption. In existing studies, Jacobi decoding, an iterative parallel decoding algorithm, has been used to accelerate the auto-regressive generation and can be executed without training. However, the Jacobi decoding relies on a deterministic criterion to determine the convergence of iterations. Thus, it works for greedy decoding but is incompatible with sampling-based decoding which is crucial for visual quality and diversity in the current auto-regressive text-to-image generation. In this paper, we propose a training-free probabilistic parallel decoding algorithm, Speculative Jacobi Decoding (SJD), to accelerate auto-regressive text-to-image generation. By introducing a probabilistic convergence criterion, our SJD accelerates the inference of auto-regressive text-to-image generation while maintaining the randomness in sampling-based token decoding and allowing the model to generate diverse images. Specifically, SJD facilitates the model to predict multiple tokens at each step and accepts tokens based on the probabilistic criterion, enabling the model to generate images with fewer steps than the conventional next-token-prediction paradigm. We also investigate the token initialization strategies that leverage the spatial locality of visual data to further improve the acceleration ratio under specific scenarios. We conduct experiments for our proposed SJD on multiple auto-regressive text-to-image generation models, showing the effectiveness of model acceleration without sacrificing the visual quality. The code of our work is available here: https://github.com/tyshiwo1/Accelerating-T2I-AR-with-SJD/.

ICLR Conference 2025 Conference Paper

EdgeRunner: Auto-regressive Auto-encoder for Artistic Mesh Generation

  • Jiaxiang Tang
  • Zhaoshuo Li
  • Zekun Hao
  • Xian Liu
  • Gang Zeng
  • Ming-Yu Liu 0001
  • Qinsheng Zhang

Current auto-regressive mesh generation methods suffer from issues such as incompleteness, insufficient detail, and poor generalization. In this paper, we propose an Auto-regressive Auto-encoder (ArAE) model capable of generating high-quality 3D meshes with up to 4,000 faces at a spatial resolution of $512^3$. We introduce a novel mesh tokenization algorithm that efficiently compresses triangular meshes into 1D token sequences, significantly enhancing training efficiency. Furthermore, our model compresses variable-length triangular meshes into a fixed-length latent space, enabling training latent diffusion models for better generalization. Extensive experiments demonstrate the superior quality, diversity, and generalization capabilities of our model in both point cloud and image-conditioned mesh generation tasks.

ICLR Conference 2025 Conference Paper

High-Quality Joint Image and Video Tokenization with Causal VAE

  • Dawit Mureja Argaw
  • Xian Liu
  • Qinsheng Zhang
  • Joon Son Chung
  • Ming-Yu Liu 0001
  • Fitsum Reda

Generative modeling has seen significant advancements in image and video synthesis. However, the curse of dimensionality remains a significant obstacle, especially for video generation, given its inherently complex and high-dimensional nature. Many existing works rely on low-dimensional latent spaces from pretrained image autoencoders. However, this approach overlooks temporal redundancy in videos and often leads to temporally incoherent decoding. To address this issue, we propose a video compression network that reduces the dimensionality of visual data both spatially and temporally. Our model, based on a variational autoencoder, employs causal 3D convolution to handle images and videos jointly. The key contributions of our work include a scale-agnostic encoder for preserving video fidelity, a novel spatio-temporal down/upsampling block for robust long-sequence modeling, and a flow regularization loss for accurate motion decoding. Our approach outperforms competitors in video quality and compression rates across various datasets. Experimental analyses also highlight its potential as a robust autoencoder for video generation training.

AAAI Conference 2025 Conference Paper

MotionCraft: Crafting Whole-Body Motion with Plug-and-Play Multimodal Controls

  • Yuxuan Bian
  • Ailing Zeng
  • Xuan Ju
  • Xian Liu
  • Zhaoyang Zhang
  • Wei Liu
  • Qiang Xu

Whole-body multimodal motion generation, controlled by text, speech, or music, has numerous applications including video generation and character animation. However, employing a unified model to process different condition modalities presents two main challenges: motion distribution drifts across different tasks (e.g., co-speech gestures and text-driven daily actions) and the complex optimization of mixed conditions with varying granularities (e.g., text and audio). In this paper, we propose MotionCraft, a unified diffusion transformer that crafts whole-body motion with plug-and-play multimodal control. Our framework employs a coarse-to-fine training strategy, starting with the text-to-motion semantic pre-training, followed by the multimodal low-level control adaptation. To effectively learn and transfer motion knowledge across different distributions, we design MC-Attn for parallel modeling of static and dynamic human topology graphs. To overcome the motion format inconsistency of existing benchmarks, we introduce MC-Bench, the first available multimodal whole-body motion generation benchmark based on the unified SMPL-X format. Extensive experiments show that MotionCraft achieves state-of-the-art performance on various standard motion generation tasks.

NeurIPS Conference 2025 Conference Paper

Speculative Jacobi-Denoising Decoding for Accelerating Autoregressive Text-to-image Generation

  • Yao Teng
  • Fu-Yun Wang
  • Xian Liu
  • Zhekai Chen
  • Han Shi
  • Yu Wang
  • Zhenguo Li
  • Weiyang Liu

As a new paradigm of visual content generation, autoregressive text-to-image models suffer from slow inference due to their sequential token-by-token decoding process, often requiring thousands of model forward passes to generate a single image. To address this inefficiency, we propose Speculative Jacobi-Denoising Decoding (SJD2), a framework that incorporates the denoising process into Jacobi iterations to enable parallel token generation in autoregressive models. Our method introduces a next-clean-token prediction paradigm that enables the pre-trained autoregressive models to accept noise-perturbed token embeddings and predict the next clean tokens through low-cost fine-tuning. This denoising paradigm guides the model towards more stable Jacobi trajectories. During inference, our method initializes token sequences with Gaussian noise and performs iterative next-clean-token-prediction in the embedding space. We employ a probabilistic criterion to verify and accept multiple tokens in parallel, and refine the unaccepted tokens for the next iteration with the denoising trajectory. Experiments show that our method can accelerate generation by reducing model forward passes while maintaining the visual quality of generated images.

ICML Conference 2024 Conference Paper

E2GAN: Efficient Training of Efficient GANs for Image-to-Image Translation

  • Yifan Gong 0004
  • Zheng Zhan 0001
  • Qing Jin
  • Yanyu Li
  • Yerlan Idelbayev
  • Xian Liu
  • Andrey Zharkov
  • Kfir Aberman

One highly promising direction for enabling flexible real-time on-device image editing is utilizing data distillation by leveraging large-scale text-to-image diffusion models to generate paired datasets used for training generative adversarial networks (GANs). This approach notably alleviates the stringent requirements typically imposed by high-end commercial GPUs for performing image editing with diffusion models. However, unlike text-to-image diffusion models, each distilled GAN is specialized for a specific image editing task, necessitating costly training efforts to obtain models for various concepts. In this work, we introduce and address a novel research direction: can the process of distilling GANs from diffusion models be made significantly more efficient? To achieve this goal, we propose a series of innovative techniques. First, we construct a base GAN model with generalized features, adaptable to different concepts through fine-tuning, eliminating the need for training from scratch. Second, we identify crucial layers within the base GAN model and employ Low-Rank Adaptation (LoRA) with a simple yet effective rank search process, rather than fine-tuning the entire base model. Third, we investigate the minimal amount of data necessary for fine-tuning, further reducing the overall training time. Extensive experiments show that we can efficiently empower GANs with the ability to perform real-time high-quality image editing on mobile devices with remarkably reduced training and storage costs for each concept.

ICLR Conference 2024 Conference Paper

HyperHuman: Hyper-Realistic Human Generation with Latent Structural Diffusion

  • Xian Liu
  • Jian Ren 0005
  • Aliaksandr Siarohin
  • Ivan Skorokhodov
  • Yanyu Li
  • Dahua Lin
  • Xihui Liu
  • Ziwei Liu 0002

Despite significant advances in large-scale text-to-image models, achieving hyper-realistic human image generation remains a desirable yet unsolved task. Existing models like Stable Diffusion and DALL·E 2 tend to generate human images with incoherent parts or unnatural poses. To tackle these challenges, our key insight is that human image is inherently structural over multiple granularities, from the coarse-level body skeleton to fine-grained spatial geometry. Therefore, capturing such correlations between the explicit appearance and latent structure in one model is essential to generate coherent and natural human images. To this end, we propose a unified framework, HyperHuman, that generates in-the-wild human images of high realism and diverse layouts. Specifically, 1) we first build a large-scale human-centric dataset, named HumanVerse, which consists of 340M images with comprehensive annotations like human pose, depth, and surface normal. 2) Next, we propose a Latent Structural Diffusion Model that simultaneously denoises the depth and surface normal along with the synthesized RGB image. Our model enforces the joint learning of image appearance, spatial relationship, and geometry in a unified network, where each branch in the model complements to each other with both structural awareness and textural richness. 3) Finally, to further boost the visual quality, we propose a Structure-Guided Refiner to compose the predicted conditions for more detailed generation of higher resolution. Extensive experiments demonstrate that our framework yields the state-of-the-art performance, generating hyper-realistic human images under diverse scenarios.

TMLR Journal 2023 Journal Article

ChemSpacE: Interpretable and Interactive Chemical Space Exploration

  • Yuanqi Du
  • Xian Liu
  • Nilay Mahesh Shah
  • Shengchao Liu
  • Jieyu Zhang
  • Bolei Zhou

Discovering meaningful molecules in the vast combinatorial chemical space has been a long-standing challenge in many fields, from materials science to drug design. Recent progress in machine learning, especially with generative models, shows great promise for automated molecule synthesis. Nevertheless, most molecule generative models remain black-boxes, whose utilities are limited by a lack of interpretability and human participation in the generation process. In this work, we propose \textbf{Chem}ical \textbf{Spac}e \textbf{E}xplorer (ChemSpacE), a simple yet effective method for exploring the chemical space with pre-trained deep generative models. Our method enables users to interact with existing generative models and steer the molecule generation process. We demonstrate the efficacy of ChemSpacE on the molecule optimization task and the latent molecule manipulation task in single-property and multi-property settings. On the molecule optimization task, the performance of ChemSpacE is on par with previous black-box optimization methods yet is considerably faster and more sample efficient. Furthermore, the interface from ChemSpacE facilitates human-in-the-loop chemical space exploration and interactive molecule design. Code and demo are available at \url{https://github.com/yuanqidu/ChemSpacE}.

NeurIPS Conference 2022 Conference Paper

Audio-Driven Co-Speech Gesture Video Generation

  • Xian Liu
  • Qianyi Wu
  • Hang Zhou
  • Yuanqi Du
  • Wayne Wu
  • Dahua Lin
  • Ziwei Liu

Co-speech gesture is crucial for human-machine interaction and digital entertainment. While previous works mostly map speech audio to human skeletons (e. g. , 2D keypoints), directly generating speakers' gestures in the image domain remains unsolved. In this work, we formally define and study this challenging problem of audio-driven co-speech gesture video generation, i. e. , using a unified framework to generate speaker image sequence driven by speech audio. Our key insight is that the co-speech gestures can be decomposed into common motion patterns and subtle rhythmic dynamics. To this end, we propose a novel framework, Audio-driveN Gesture vIdeo gEneration (ANGIE), to effectively capture the reusable co-speech gesture patterns as well as fine-grained rhythmic movements. To achieve high-fidelity image sequence generation, we leverage an unsupervised motion representation instead of a structural human body prior (e. g. , 2D skeletons). Specifically, 1) we propose a vector quantized motion extractor (VQ-Motion Extractor) to summarize common co-speech gesture patterns from implicit motion representation to codebooks. 2) Moreover, a co-speech gesture GPT with motion refinement (Co-Speech GPT) is devised to complement the subtle prosodic motion details. Extensive experiments demonstrate that our framework renders realistic and vivid co-speech gesture video. Demo video and more resources can be found in: https: //alvinliu0. github. io/projects/ANGIE

AAAI Conference 2022 Conference Paper

Visual Sound Localization in the Wild by Cross-Modal Interference Erasing

  • Xian Liu
  • Rui Qian
  • Hang Zhou
  • Di Hu
  • Weiyao Lin
  • Ziwei Liu
  • Bolei Zhou
  • Xiaowei Zhou

The task of audiovisual sound source localization has been well studied under constrained scenes, where the audio recordings are clean. However, in real world scenarios, audios are usually contaminated by off screen sound and background noise. They will interfere with the procedure of identifying desired sources and building visual sound connections, making previous studies nonapplicable. In this work, we propose the Interference Eraser (IEr) framework, which tackles the problem of audiovisual sound source localization in the wild. The key idea is to eliminate the interference by redefining and carving discriminative audio representations. Specifically, we observe that the previous practice of learning only a single audio representation is insufficient due to the additive nature of audio signals. We thus extend the audio representation with our Audio Instance Identifier module, which clearly distinguishes sounding instances when audio signals of different volumes are unevenly mixed. Then we erase the influence of the audible but off screen sounds and the silent but visible objects by a Cross modal Referrer module with cross modality distillation. Quantitative and qualitative evaluations demonstrate that our framework achieves superior results on sound localization tasks, especially under real world scenarios.

YNICL Journal 2019 Journal Article

Transcutaneous auricular vagus nerve stimulation at 1 Hz modulates locus coeruleus activity and resting state functional connectivity in patients with migraine: An fMRI study

  • Yue Zhang
  • Jiao Liu
  • Hui Li
  • Zhaoxian Yan
  • Xian Liu
  • Jin Cao
  • Joel Park
  • Georgia Wilson

BACKGROUND: Migraine is a common episodic neurological disorder. Literature has shown that transcutaneous auricular vagus nerve stimulation (taVNS) at 1 Hz can significantly relieve migraine symptoms. However, its underlying mechanism remains unclear. This study aims to investigate the neural pathways associated with taVNS treatment of migraine. METHODS: Twenty-nine patients with migraine were recruited from outpatient neurology clinics. Each patient attended two magnetic resonance imaging/functional magnetic resonance imaging (MRI/fMRI) scan sessions separated by one week. Each session included a pre-stimulation resting state fMRI scan, fMRI scans during real or sham 1 Hz taVNS (with block design), and a post-stimulation resting state fMRI scan. RESULTS: Twenty-six patients were included in the final analyses. Real taVNS evoked fMRI signal decreases in brain areas belonging to the default mode network (DMN) and brain stem areas including the locus coeruleus (LC), raphe nuclei, parabrachial nucleus, and solitary nucleus. Sham taVNS evoked fMRI signal decreases in brain areas belonging to the DMN. Compared to sham taVNS, real taVNS produced greater deactivation at the bilateral LC. Resting state functional connectivity (rsFC) analysis showed that after taVNS, LC rsFC with the right temporoparietal junction and left secondary somatosensory cortex (S2) significantly increased compared to sham taVNS. The increased rsFC of the left LC-left S2 was significantly negatively associated with the frequency of migraine attacks during the preceding month. CONCLUSION: Our results suggest that taVNS at 1 Hz can significantly modulate activity/connectivity of brain regions associated with the vagus nerve central pathway and pain modulation system, which may shed light on the neural mechanisms underlying taVNS treatment of migraine.

YNICL Journal 2016 Journal Article

Repeated acupuncture treatments modulate amygdala resting state functional connectivity of depressive patients

  • Xiaoyun Wang
  • Zengjian Wang
  • Jian Liu
  • Jun Chen
  • Xian Liu
  • Guangning Nie
  • Joon-Seok Byun
  • Yilin Liang

As a widely-applied alternative therapy, acupuncture is gaining popularity in Western society. One challenge that remains, however, is incorporating it into mainstream medicine. One solution is to combine acupuncture with other conventional, mainstream treatments. In this study, we investigated the combination effect of acupuncture and the antidepressant fluoxetine, as well as its underlying mechanism using resting state functional connectivity (rsFC) in patients with major depressive disorders. Forty-six female depressed patients were randomized into a verum acupuncture plus fluoxetine or a sham acupuncture plus fluoxetine group for eight weeks. Resting-state fMRI data was collected before the first and last treatments. Results showed that compared with those in the sham acupuncture treatment, verum acupuncture treatment patients showed 1) greater clinical improvement as indicated by Montgomery-Åsberg Depression Rating Scale (MADRS) and Self-Rating Depression Scale (SDS) scores; 2) increased rsFC between the left amygdala and subgenual anterior cingulate cortex (sgACC)/preguenual anterior cingulate cortex (pgACC); 3) increased rsFC between the right amygdala and left parahippocampus (Para)/putamen (Pu). The strength of the amygdala-sgACC/pgACC rsFC was positively associated with corresponding clinical improvement (as indicated by a negative correlation with MADRS and SDS scores). Our findings demonstrate the additive effect of acupuncture to antidepressant treatment and suggest that this effect may be achieved through the limbic system, especially the amygdala and the ACC.