Arrow Research search

Author name cluster

Hang Zhou

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

34 papers
2 author rows

Possible papers

34

AAAI Conference 2026 Conference Paper

GRAM-R²: Self-Training Generative Foundation Reward Models for Reward Reasoning

  • Chenglong Wang
  • Yongyu Mu
  • Hang Zhou
  • Yifu Huo
  • Ziming Zhu
  • Jiali Zeng
  • Murun Yang
  • Bei Li

Major progress in reward modeling over recent years has been driven by a paradigm shift from task-specific designs to generalist reward models. Despite this trend, developing effective reward models remains a fundamental challenge: the heavy reliance on large-scale labeled preference data. Pre-training on abundant unlabeled data offers a promising direction, but existing approaches fall short in instilling explicit reasoning capabilities into reward models. To bridge this gap, we propose a self-training approach that can leverage unlabeled data to scale up reward reasoning in reward models. Based on this approach, we develop GRAM-R² a generative reward model trained to produce not only preference labels but also accompanying reward rationales. GRAM-R² can serve as a foundation model for reward reasoning and can be applied to a wide range of tasks with minimal or no additional fine-tuning. It can support downstream applications such as policy optimization and task-specific reward tuning. Experiments on response ranking, task adaptation, and reinforcement learning from human feedback demonstrate that GRAM-R² consistently delivers strong performance, outperforming several strong discriminative and generative baselines.

AAAI Conference 2026 Conference Paper

Inter-Client Dependency Recovery with Hidden Global Components for Federated Traffic Prediction

  • Hang Zhou
  • Wentao Yu
  • Yang Wei
  • Guangyu Li
  • Sha Xu
  • Chen Gong

Traffic prediction plays an important role in urban management. However, existing methods rely on centralized traffic data, which may raise privacy concerns. Federated traffic prediction offers a promising solution for clients (e.g., traffic management administrations) in different regions to collaboratively train models in a distributed manner without exposing private data. Nonetheless, data isolation inherently breaks the correlations between nodes (i.e., traffic sensors collecting data) from different regions, which leads to the missing inter-client dependency. Consequently, current works either fail to capture the missing inter-client dependency or compromise data privacy to recover the inter-client dependency. To address this issue, we propose a novel Federated method which recovers the inter-client dependency with HIdden global componeNTs (FedHINT). We find that the traffic data from different local regions actually contain hidden global components that reflect cross-regional traffic changes. Therefore, our FedHINT aims to extract hidden global components from each client to generate proxy nodes that represent global information, which are then utilized to recover the inter-client dependency. To be specific, we employ an attention module, which is guided by the shared global queries to capture hidden global components from local traffic data, to generate proxy nodes. Subsequently, our FedHINT adaptively learns the correlations between proxy nodes and local nodes through a global encoder. During this process, the global information in proxy nodes compensate for the loss of information from cross-regional nodes, which thereby recovers the missing inter-client dependency. Intensive experiments on multiple datasets demonstrate that our FedHINT significantly outperforms the state-of-the-art methods, with an average decrease of 3.73 and 4.81 on MAE and RMSE, respectively.

ICML Conference 2025 Conference Paper

LLM Data Selection and Utilization via Dynamic Bi-level Optimization

  • Yang Yu 0056
  • Kai Han 0002
  • Hang Zhou
  • Yehui Tang 0001
  • Kaiqi Huang
  • Yunhe Wang 0001
  • Dacheng Tao

While large-scale training data is fundamental for developing capable large language models (LLMs), strategically selecting high-quality data has emerged as a critical approach to enhance training efficiency and reduce computational costs. Current data selection methodologies predominantly rely on static, training-agnostic criteria, failing to account for the dynamic model training and data interactions. In this paper, we propose a new Data Weighting Model (DWM) to adjust the weight of selected data within each batch to achieve a dynamic data utilization during LLM training. Specially, to better capture the dynamic data preference of the trained model, a bi-level optimization framework is implemented to update the weighting model. Our experiments demonstrate that DWM enhances the performance of models trained with randomly-selected data, and the learned weighting model can be transferred to enhance other data selection methods and models of different sizes. Moreover, we further analyze how a model’s data preferences evolve throughout training, providing new insights into the data preference of the model during training.

NeurIPS Conference 2025 Conference Paper

MRO: Enhancing Reasoning in Diffusion Language Models via Multi-Reward Optimization

  • Chenglong Wang
  • Yang Gan
  • Hang Zhou
  • Chi Hu
  • Yongyu Mu
  • Kai Song
  • Murun Yang
  • Bei Li

Recent advances in diffusion language models (DLMs) have presented a promising alternative to traditional autoregressive large language models (LLMs). However, DLMs still lag behind LLMs in reasoning performance, especially as the number of denoising steps decreases. Our analysis reveals that this shortcoming arises primarily from the independent generation of masked tokens across denoising steps, which fails to capture the token correlation. In this paper, we define two types of token correlation: intra-sequence correlation and inter-sequence correlation, and demonstrate that enhancing these correlations improves reasoning performance. To this end, we propose a Multi-Reward Optimization (MRO) approach, which encourages DLMs to consider the token correlation during the denoising process. More specifically, our MRO approach leverages test-time scaling, reject sampling, and reinforcement learning to directly optimize the token correlation with multiple elaborate rewards. Additionally, we introduce group step and importance sampling strategies to mitigate reward variance and enhance sampling efficiency. Through extensive experiments, we demonstrate that MRO not only improves reasoning performance but also achieves significant sampling speedups while maintaining high performance on reasoning benchmarks.

NeurIPS Conference 2025 Conference Paper

PhySense: Sensor Placement Optimization for Accurate Physics Sensing

  • Yuezhou Ma
  • Haixu Wu
  • Hang Zhou
  • Huikun Weng
  • Jianmin Wang
  • Mingsheng Long

Physics sensing plays a central role in many scientific and engineering domains, which inherently involves two coupled tasks: reconstructing dense physical fields from sparse observations and optimizing scattered sensor placements to observe maximum information. While deep learning has made rapid advances in sparse-data reconstruction, existing methods generally omit optimization of sensor placements, leaving the mutual enhancement between reconstruction and placement on the shelf. To change this suboptimal practice, we propose PhySense, a synergistic two-stage framework that learns to jointly reconstruct physical fields and to optimize sensor placements, both aiming for accurate physics sensing. The first stage involves a flow-based generative model enhanced by cross-attention to adaptively fuse sparse observations. Leveraging the reconstruction feedback, the second stage performs sensor placement via projected gradient descent to satisfy spatial constraints. We further prove that the learning objectives of the two stages are consistent with classical variance-minimization principles, providing theoretical guarantees. Extensive experiments across three challenging benchmarks, especially a 3D geometry dataset, indicate PhySense achieves state-of-the-art physics sensing accuracy and discovers informative sensor placements previously unconsidered. Code is available at this repository: https: //github. com/thuml/PhySense.

ICRA Conference 2025 Conference Paper

Real-World Automated Vehicle Longitudinal Stability Analysis: Controller Design and Field Test

  • Ke Ma
  • Yuqin Zhang
  • Hang Zhou
  • Zhaohui Liang
  • Xiaopeng Li 0020

Although extensive research has been conducted on modeling the stable longitudinal controller of automated vehicles (AVs) to dampen traffic oscillations, the real-world performance of these controllers in actual vehicles remains uncertain. In the operation of real-world AVs, the delay between actual dynamics and the commands prevents the controller's command from being effectively implemented to dampen traffic oscillations. Thus, this study adapts the designed controllers within an AV test platform to compare the theoretically stable conditions with the actual oscillation dampening performance. Initially, we compute the stable conditions for both the traditional car-following controller, which assumes no delay, and the longitudinal controller that accounts for the dynamic response of the vehicle. Through empirical experiments, we demonstrate that the longitudinal controller predicts vehicle stability more accurately than conventional car-following controller, showing an improvement from an average prediction accuracy rate of 0. 59 to 0. 91. Also, the experiments uncover specific delays inherent in dynamics systems, with a response delay of 0. 34 seconds. Our work makes two principal contributions to the field of AV control systems. First, it empirically validates that the longitudinal model, which accounts for the vehicle's dynamic responses, offers a more precise representation of vehicular behavior. Second, the relatively brief response delay identified expands the stability region, thereby enhancing vehicle control and safety. The longitudinal controller is critical for enhancing AV performance and reliability in dampening traffic oscillations.

ICML Conference 2025 Conference Paper

Transolver++: An Accurate Neural Solver for PDEs on Million-Scale Geometries

  • Huakun Luo
  • Haixu Wu
  • Hang Zhou
  • Lanxiang Xing
  • Yichen Di
  • Jianmin Wang 0001
  • Mingsheng Long

Although deep models have been widely explored in solving partial differential equations (PDEs), previous works are primarily limited to data only with up to tens of thousands of mesh points, far from the million-point scale required by industrial simulations that involve complex geometries. In the spirit of advancing neural PDE solvers to real industrial applications, we present Transolver++, a highly parallel and efficient neural solver that can accurately solve PDEs on million-scale geometries. Building upon previous advancements in solving PDEs by learning physical states via Transolver, Transolver++ is further equipped with an extremely optimized parallelism framework and a local adaptive mechanism to efficiently capture eidetic physical states from massive mesh points, successfully tackling the thorny challenges in computation and physics learning when scaling up input mesh size. Transolver++ increases the single-GPU input capacity to million-scale points for the first time and is capable of continuously scaling input size in linear complexity by increasing GPUs. Experimentally, Transolver++ yields 13% relative promotion across six standard PDE benchmarks and achieves over 20% performance gain in million-scale high-fidelity industrial simulations, whose sizes are 100$\times$ larger than previous benchmarks, covering car and 3D aircraft designs.

ICML Conference 2025 Conference Paper

Unisolver: PDE-Conditional Transformers Towards Universal Neural PDE Solvers

  • Hang Zhou
  • Yuezhou Ma
  • Haixu Wu
  • Haowen Wang
  • Mingsheng Long

Deep models have recently emerged as promising tools to solve partial differential equations (PDEs), known as neural PDE solvers. While neural solvers trained from either simulation data or physics-informed loss can solve PDEs reasonably well, they are mainly restricted to a few instances of PDEs, e. g. a certain equation with a limited set of coefficients. This limits their generalization to diverse PDEs, preventing them from being practical surrogate models of numerical solvers. In this paper, we present Unisolver, a novel Transformer model trained on diverse data and conditioned on diverse PDEs, aiming towards a universal neural PDE solver capable of solving a wide scope of PDEs. Instead of purely scaling up data and parameters, Unisolver stems from the theoretical analysis of the PDE-solving process. Inspired by the mathematical structure of PDEs that a PDE solution is fundamentally governed by a series of PDE components such as equation symbols and boundary conditions, we define a complete set of PDE components and flexibly embed them as domain-wise and point-wise deep conditions for Transformer PDE solvers. Integrating physical insights with recent Transformer advances, Unisolver achieves consistent state-of-the-art on three challenging large-scale benchmarks, showing impressive performance and generalizability. Code is available at https: //github. com/thuml/Unisolver.

AAAI Conference 2025 Conference Paper

Video Anomaly Detection with Motion and Appearance Guided Patch Diffusion Model

  • Hang Zhou
  • Jiale Cai
  • Yuteng Ye
  • Yonghui Feng
  • Chenxing Gao
  • Junqing Yu
  • Zikai Song
  • Wei Yang

A recent endeavor in one class of video anomaly detection is to leverage diffusion models and posit the task as a generation problem, where the diffusion model is trained to recover normal patterns exclusively, thus reporting abnormal patterns as outliers. Yet, existing attempts neglect the various formations of anomaly and predict normal samples at the feature level regardless that abnormal objects in surveillance videos are often relatively small. To address this, a novel patch-based diffusion model is proposed, specifically engineered to capture fine-grained local information. We further observe that anomalies in videos manifest themselves as deviations in both appearance and motion. Therefore, we argue that a comprehensive solution must consider both of these aspects simultaneously to achieve accurate frame prediction. To address this, we introduce innovative motion and appearance conditions that are seamlessly integrated into our patch diffusion model. These conditions are designed to guide the model in generating coherent and contextually appropriate predictions for both semantic content and motion relations. Experimental results on four challenging video anomaly detection datasets empirically substantiate the efficacy of our proposed approach, demonstrating that it consistently outperforms most existing methods in detecting abnormal behaviors.

AAAI Conference 2024 Conference Paper

Attacking Transformers with Feature Diversity Adversarial Perturbation

  • Chenxing Gao
  • Hang Zhou
  • Junqing Yu
  • Yuteng Ye
  • Jiale Cai
  • Junle Wang
  • Wei Yang

Understanding the mechanisms behind Vision Transformer (ViT), particularly its vulnerability to adversarial perturbations, is crucial for addressing challenges in its real-world applications. Existing ViT adversarial attackers rely on labels to calculate the gradient for perturbation, and exhibit low transferability to other structures and tasks. In this paper, we present a label-free white-box attack approach for ViT-based models that exhibits strong transferability to various black-box models, including most ViT variants, CNNs, and MLPs, even for models developed for other modalities. Our inspiration comes from the feature collapse phenomenon in ViTs, where the critical attention mechanism overly depends on the low-frequency component of features, causing the features in middle-to-end layers to become increasingly similar and eventually collapse. We propose the feature diversity attacker to naturally accelerate this process and achieve remarkable performance and transferability.

NeurIPS Conference 2024 Conference Paper

Coupled Mamba: Enhanced Multimodal Fusion with Coupled State Space Model

  • Wenbing Li
  • Hang Zhou
  • Junqing Yu
  • Zikai Song
  • Wei Yang

The essence of multi-modal fusion lies in exploiting the complementary information inherent in diverse modalities. However, most prevalent fusion methods rely on traditional neural architectures and are inadequately equipped to capture the dynamics of interactions across modalities, particularly in presence of complex intra- and inter-modality correlations. Recent advancements in State Space Models (SSMs), notably exemplified by the Mamba model, have emerged as promising contenders. Particularly, its state evolving process implies stronger modality fusion paradigm, making multi-modal fusion on SSMs an appealing direction. However, fusing multiple modalities is challenging for SSMs due to its hardware-aware parallelism designs. To this end, this paper proposes the Coupled SSM model, for coupling state chains of multiple modalities while maintaining independence of intra-modality state processes. Specifically, in our coupled scheme, we devise an inter-modal hidden states transition scheme, in which the current state is dependent on the states of its own chain and that of the neighbouring chains at the previous time-step. To fully comply with the hardware-aware parallelism, we obtain the global convolution kernel by deriving the state equation while introducing the historical state. Extensive experiments on CMU-MOSEI, CH-SIMS, CH-SIMSV2 through multi-domain input verify the effectiveness of our model compared to current state-of-the-art methods, improved F1-Score by 0. 4%, 0. 9%, and 2. 3% on the three datasets respectively, 49% faster inference and 83. 7% GPU memory save. The results demonstrate that Coupled Mamba model is capable of enhanced multi-modal fusion.

AAAI Conference 2024 Conference Paper

Dynamic Feature Pruning and Consolidation for Occluded Person Re-identification

  • Yuteng Ye
  • Hang Zhou
  • Jiale Cai
  • Chenxing Gao
  • Youjia Zhang
  • Junle Wang
  • Qiang Hu
  • Junqing Yu

Occluded person re-identification (ReID) is a challenging problem due to contamination from occluders. Existing approaches address the issue with prior knowledge cues, such as human body key points and semantic segmentations, which easily fail in the presence of heavy occlusion and other humans as occluders. In this paper, we propose a feature pruning and consolidation (FPC) framework to circumvent explicit human structure parsing. The framework mainly consists of a sparse encoder, a multi-view feature mathcing module, and a feature consolidation decoder. Specifically, the sparse encoder drops less important image tokens, mostly related to background noise and occluders, solely based on correlation within the class token attention. Subsequently, the matching stage relies on the preserved tokens produced by the sparse encoder to identify k-nearest neighbors in the gallery by measuring the image and patch-level combined similarity. Finally, we use the feature consolidation module to compensate pruned features using identified neighbors for recovering essential information while disregarding disturbance from noise and occlusion. Experimental results demonstrate the effectiveness of our proposed framework on occluded, partial, and holistic Re-ID datasets. In particular, our method outperforms state-of-the-art results by at least 8.6% mAP and 6.0% Rank-1 accuracy on the challenging Occluded-Duke dataset.

NeurIPS Conference 2024 Conference Paper

EGODE: An Event-attended Graph ODE Framework for Modeling Rigid Dynamics

  • Jingyang Yuan
  • Gongbo Sun
  • Zhiping Xiao
  • Hang Zhou
  • Xiao Luo
  • Junyu Luo
  • Yusheng Zhao
  • Wei Ju

This paper studies the problem of rigid dynamics modeling, which has a wide range of applications in robotics, graphics, and mechanical design. The problem is partly solved by graph neural network (GNN) simulators. However, these approaches cannot effectively handle the relationship between intrinsic continuity and instantaneous changes in rigid dynamics. Moreover, they usually neglect hierarchical structures across mesh nodes and objects in systems. In this paper, we propose a novel approach named Event-attend Graph ODE (EGODE) for effective rigid dynamics modeling. In particular, we describe the rigid system using both mesh node representations and object representations. To model continuous dynamics across hierarchical structures, we use a coupled graph ODE framework for the evolution of both types of representations over a long period. In addition, to capture instantaneous changes during the collision, we introduce an event module, which can effectively estimate the occurrence of the collision and update the states of both mesh node and object representations during evolution. Extensive experiments on a range of benchmark datasets validate the superiority of the proposed EGODE compared to various state-of-the-art baselines. The source code can be found at https: //github. com/yuanjypku/EGODE.

AAAI Conference 2024 Conference Paper

ESRL: Efficient Sampling-Based Reinforcement Learning for Sequence Generation

  • Chenglong Wang
  • Hang Zhou
  • Yimin Hu
  • Yifu Huo
  • Bei Li
  • Tongran Liu
  • Tong Xiao
  • Jingbo Zhu

Applying Reinforcement Learning (RL) to sequence generation models enables the direct optimization of long-term rewards (e.g., BLEU and human feedback), but typically requires large-scale sampling over a space of action sequences. This is a computational challenge as presented by the practice of sequence generation problems, such as machine translation, where we often deal with a large action space (e.g., a vocabulary) and a long action sequence (e.g., a translation). In this work, we introduce two-stage sampling and dynamic sampling approaches to improve the sampling efficiency during training sequence generation models via RL. We experiment with our approaches on the traditional sequence generation tasks, including machine translation and abstractive summarization. Furthermore, we evaluate our approaches in RL from human feedback (RLHF) through training a large language model using the reward model. Experimental results show that the efficient sampling-based RL, referred to as ESRL, can outperform all baselines in terms of both training efficiency and memory consumption. Notably, ESRL yields consistent performance gains over the strong REINFORCE, minimum risk training, and proximal policy optimization methods. The code is available at https://github.com/wangclnlp/DeepSpeed-Chat-Extension/examples/esrl.

AAAI Conference 2024 Conference Paper

Progressive Text-to-Image Diffusion with Soft Latent Direction

  • Yuteng Ye
  • Jiale Cai
  • Hang Zhou
  • Guanwen Li
  • Youjia Zhang
  • Zikai Song
  • Chenxing Gao
  • Junqing Yu

In spite of the rapidly evolving landscape of text-to-image generation, the synthesis and manipulation of multiple entities while adhering to specific relational constraints pose enduring challenges. This paper introduces an innovative progressive synthesis and editing operation that systematically incorporates entities into the target image, ensuring their adherence to spatial and relational constraints at each sequential step. Our key insight stems from the observation that while a pre-trained text-to-image diffusion model adeptly handles one or two entities, it often falters when dealing with a greater number. To address this limitation, we propose harnessing the capabilities of a Large Language Model (LLM) to decompose intricate and protracted text descriptions into coherent directives adhering to stringent formats. To facilitate the execution of directives involving distinct semantic operations—namely insertion, editing, and erasing—we formulate the Stimulus, Response, and Fusion (SRF) framework. Within this framework, latent regions are gently stimulated in alignment with each operation, followed by the fusion of the responsive latent components to achieve cohesive entity manipulation. Our proposed framework yields notable advancements in object synthesis, particularly when confronted with intricate and lengthy textual inputs. Consequently, it establishes a new benchmark for text-to-image generation tasks, further elevating the field's performance standards.

NeurIPS Conference 2024 Conference Paper

ShowMaker: Creating High-Fidelity 2D Human Video via Fine-Grained Diffusion Modeling

  • Quanwei Yang
  • Jiazhi Guan
  • Kaisiyuan Wang
  • Lingyun Yu
  • Wenqing Chu
  • Hang Zhou
  • Zhiqiang Feng
  • Haocheng Feng

Although significant progress has been made in human video generation, most previous studies focus on either human facial animation or full-body animation, which cannot be directly applied to produce realistic conversational human videos with frequent hand gestures and various facial movements simultaneously. To address these limitations, we propose a 2D human video generation framework, named ShowMaker, capable of generating high-fidelity half-body conversational videos via fine-grained diffusion modeling. We leverage dual-stream diffusion models as the backbone of our framework and carefully design two novel components for crucial local regions (i. e. , hands and face) that can be easily integrated into our backbone. Specifically, to handle the challenging hand generation caused by sparse motion guidance, we propose a novel Key Point-based Fine-grained Hand Modeling module by amplifying positional information from raw hand key points and constructing a corresponding key point-based codebook. Moreover, to restore richer facial details in generated results, we introduce a Face Recapture module, which extracts facial texture features and global identity features from the aligned human face and integrates them into the diffusion process for face enhancement. Extensive quantitative and qualitative experiments demonstrate the superior visual quality and temporal consistency of our method.

NeurIPS Conference 2024 Conference Paper

Star-Agents: Automatic Data Optimization with LLM Agents for Instruction Tuning

  • Hang Zhou
  • Yehui Tang
  • Haochen Qin
  • Yujie Yang
  • Renren Jin
  • Deyi Xiong
  • Kai Han
  • Yunhe Wang

The efficacy of large language models (LLMs) on downstream tasks usually hinges on instruction tuning, which relies critically on the quality of training data. Unfortunately, collecting high-quality and diverse data is both expensive and time-consuming. To mitigate this issue, we propose a novel Star-Agents framework, which automates the enhancement of data quality across datasets through multi-agent collaboration and assessment. The framework adopts a three-pronged strategy. It initially generates diverse instruction data with multiple LLM agents through a bespoke sampling method. Subsequently, the generated data undergo a rigorous evaluation using a dual-model method that assesses both difficulty and quality. Finaly, the above process evolves in a dynamic refinement phase, where more effective LLMs are prioritized, enhancing the overall data quality. Our empirical studies, including instruction tuning experiments with models such as Pythia and LLaMA, demonstrate the effectiveness of the proposed framework. Optimized datasets have achieved substantial improvements, with an average increase of 12\% and notable gains in specific metrics, such as a 40\% improvement in Fermi, as evidenced by benchmarks like MT-bench, Vicuna bench, and WizardLM testset. Codes will be released soon.

AAAI Conference 2023 Short Paper

Disentangling the Benefits of Self-Supervised Learning to Deployment-Driven Downstream Tasks of Satellite Images (Student Abstract)

  • Zhuo Deng
  • Yibing Wei
  • Mingye Zhu
  • Xueliang Wang
  • Junchi Zhou
  • Zhicheng Yang
  • Hang Zhou
  • Zhenjie Cao

In this paper, we investigate the benefits of self-supervised learning (SSL) to downstream tasks of satellite images. Unlike common student academic projects, this work focuses on the advantages of the SSL for deployment-driven tasks which have specific scenarios with low or high-spatial resolution images. Our preliminary experiments demonstrate the robust benefits of the SSL trained by medium-resolution (10m) images to both low-resolution (100m) scene classification case (4.25%↑) and very high-resolution (5cm) aerial image segmentation case (1.96%↑), respectively.

AAAI Conference 2023 Conference Paper

Dual Memory Units with Uncertainty Regulation for Weakly Supervised Video Anomaly Detection

  • Hang Zhou
  • Junqing Yu
  • Wei Yang

Learning discriminative features for effectively separating abnormal events from normality is crucial for weakly supervised video anomaly detection (WS-VAD) tasks. Existing approaches, both video and segment level label oriented, mainly focus on extracting representations for anomaly data while neglecting the implication of normal data. We observe that such a scheme is sub-optimal, i.e., for better distinguishing anomaly one needs to understand what is a normal state, and may yield a higher false alarm rate. To address this issue, we propose an Uncertainty Regulated Dual Memory Units (UR-DMU) model to learn both the representations of normal data and discriminative features of abnormal data. To be specific, inspired by the traditional global and local structure on graph convolutional networks, we introduce a Global and Local Multi-Head Self Attention (GL-MHSA) module for the Transformer network to obtain more expressive embeddings for capturing associations in videos. Then, we use two memory banks, one additional abnormal memory for tackling hard samples, to store and separate abnormal and normal prototypes and maximize the margins between the two representations. Finally, we propose an uncertainty learning scheme to learn the normal data latent space, that is robust to noise from camera switching, object changing, scene transforming, etc. Extensive experiments on XD-Violence and UCF-Crime datasets demonstrate that our method outperforms the state-of-the-art methods by a sizable margin.

AAAI Conference 2023 Conference Paper

Robust Video Portrait Reenactment via Personalized Representation Quantization

  • Kaisiyuan Wang
  • Changcheng Liang
  • Hang Zhou
  • Jiaxiang Tang
  • Qianyi Wu
  • Dongliang He
  • Zhibin Hong
  • Jingtuo Liu

While progress has been made in the field of portrait reenactment, the problem of how to produce high-fidelity and robust videos remains. Recent studies normally find it challenging to handle rarely seen target poses due to the limitation of source data. This paper proposes the Video Portrait via Non-local Quantization Modeling (VPNQ) framework, which produces pose- and disturbance-robust reenactable video portraits. Our key insight is to learn position-invariant quantized local patch representations and build a mapping between simple driving signals and local textures with non-local spatial-temporal modeling. Specifically, instead of learning a universal quantized codebook, we identify that a personalized one can be trained to preserve desired position-invariant local details better. Then, a simple representation of projected landmarks can be used as sufficient driving signals to avoid 3D rendering. Following, we employ a carefully designed Spatio-Temporal Transformer to predict reasonable and temporally consistent quantized tokens from the driving signal. The predicted codes can be decoded back to robust and high-quality videos. Comprehensive experiments have been conducted to validate the effectiveness of our approach.

ICLR Conference 2023 Conference Paper

TimesNet: Temporal 2D-Variation Modeling for General Time Series Analysis

  • Haixu Wu
  • Tengge Hu
  • Yong Liu 0007
  • Hang Zhou
  • Jianmin Wang 0001
  • Mingsheng Long

Time series analysis is of immense importance in extensive applications, such as weather forecasting, anomaly detection, and action recognition. This paper focuses on temporal variation modeling, which is the common key problem of extensive analysis tasks. Previous methods attempt to accomplish this directly from the 1D time series, which is extremely challenging due to the intricate temporal patterns. Based on the observation of multi-periodicity in time series, we ravel out the complex temporal variations into the multiple intraperiod- and interperiod-variations. To tackle the limitations of 1D time series in representation capability, we extend the analysis of temporal variations into the 2D space by transforming the 1D time series into a set of 2D tensors based on multiple periods. This transformation can embed the intraperiod- and interperiod-variations into the columns and rows of the 2D tensors respectively, making the 2D-variations to be easily modeled by 2D kernels. Technically, we propose the TimesNet with TimesBlock as a task-general backbone for time series analysis. TimesBlock can discover the multi-periodicity adaptively and extract the complex temporal variations from transformed 2D tensors by a parameter-efficient inception block. Our proposed TimesNet achieves consistent state-of-the-art in five mainstream time series analysis tasks, including short- and long-term forecasting, imputation, classification, and anomaly detection. Code is available at this repository: https://github.com/thuml/TimesNet.

NeurIPS Conference 2022 Conference Paper

Audio-Driven Co-Speech Gesture Video Generation

  • Xian Liu
  • Qianyi Wu
  • Hang Zhou
  • Yuanqi Du
  • Wayne Wu
  • Dahua Lin
  • Ziwei Liu

Co-speech gesture is crucial for human-machine interaction and digital entertainment. While previous works mostly map speech audio to human skeletons (e. g. , 2D keypoints), directly generating speakers' gestures in the image domain remains unsolved. In this work, we formally define and study this challenging problem of audio-driven co-speech gesture video generation, i. e. , using a unified framework to generate speaker image sequence driven by speech audio. Our key insight is that the co-speech gestures can be decomposed into common motion patterns and subtle rhythmic dynamics. To this end, we propose a novel framework, Audio-driveN Gesture vIdeo gEneration (ANGIE), to effectively capture the reusable co-speech gesture patterns as well as fine-grained rhythmic movements. To achieve high-fidelity image sequence generation, we leverage an unsupervised motion representation instead of a structural human body prior (e. g. , 2D skeletons). Specifically, 1) we propose a vector quantized motion extractor (VQ-Motion Extractor) to summarize common co-speech gesture patterns from implicit motion representation to codebooks. 2) Moreover, a co-speech gesture GPT with motion refinement (Co-Speech GPT) is devised to complement the subtle prosodic motion details. Extensive experiments demonstrate that our framework renders realistic and vivid co-speech gesture video. Demo video and more resources can be found in: https: //alvinliu0. github. io/projects/ANGIE

NeurIPS Conference 2022 Conference Paper

Delving into Sequential Patches for Deepfake Detection

  • Jiazhi Guan
  • Hang Zhou
  • Zhibin Hong
  • Errui Ding
  • Jingdong Wang
  • Chengbin Quan
  • Youjian Zhao

Recent advances in face forgery techniques produce nearly visually untraceable deepfake videos, which could be leveraged with malicious intentions. As a result, researchers have been devoted to deepfake detection. Previous studies have identified the importance of local low-level cues and temporal information in pursuit to generalize well across deepfake methods, however, they still suffer from robustness problem against post-processings. In this work, we propose the Local- & Temporal-aware Transformer-based Deepfake Detection (LTTD) framework, which adopts a local-to-global learning protocol with a particular focus on the valuable temporal information within local sequences. Specifically, we propose a Local Sequence Transformer (LST), which models the temporal consistency on sequences of restricted spatial regions, where low-level information is hierarchically enhanced with shallow layers of learned 3D filters. Based on the local temporal embeddings, we then achieve the final classification in a global contrastive way. Extensive experiments on popular datasets validate that our approach effectively spots local forgery cues and achieves state-of-the-art performance.

IROS Conference 2022 Conference Paper

IMU Dead-Reckoning Localization with RNN-IEKF Algorithm

  • Hang Zhou
  • Yibo Zhao
  • Xiaogang Xiong
  • Yunjiang Lou
  • Shyam Kamal

In complex urban environments, the Inertial Navigation System (INS) is important for navigating unmanned ground vehicles (UAVs) for its environment-independency and reliability of real-time localization. It is usually employed as the baseline in the case of other sensors failures, such as the GPS, Lidar, or Cameras. However, one problem for the INS is that its estimation error of localization accumulates over time, and thus the estimated trajectories of the UAVs continue to drift away from their ground truths. To solve this problem, this paper proposes an improved algorithm based on the Invariant Extended Kalman Filter (IEKF) for dead-reckoning of autonomous vehicles, which dynamically adjusts the process noise and the observation noise covariance matrixes through Attention mechanism and Recurrent Neural Network (RNN). The algorithm achieves more robust and accurate dead-reckoning localization in the experiments conducted on the KITTI dataset, reducing the translational error by about 45%compared to the baseline.

AAAI Conference 2022 Conference Paper

SepFusion: Finding Optimal Fusion Structures for Visual Sound Separation

  • Dongzhan Zhou
  • Xinchi Zhou
  • Di Hu
  • Hang Zhou
  • Lei Bai
  • Ziwei Liu
  • Wanli Ouyang

Multiple modalities can provide rich semantic information; and exploiting such information will normally lead to better performance compared with the single-modality counterpart. However, it is not easy to devise an effective cross-modal fusion structure due to the variations of feature dimensions and semantics, especially when the inputs even come from different sensors, as in the field of audio-visual learning. In this work, we propose SepFusion, a novel framework that can smoothly produce optimal fusion structures for visualsound separation. The framework is composed of two components, namely the model generator and the evaluator. To construct the generator, we devise a lightweight architecture space that can adapt to different input modalities. In this way, we can easily obtain audio-visual fusion structures according to our demands. For the evaluator, we adopt the idea of neural architecture search to select superior networks effectively. This automatic process can significantly save human efforts while achieving competitive performances. Moreover, since our SepFusion provides a series of strong models, we can utilize the model family for broader applications, such as further promoting performance via model assembly, or providing suitable architectures for the separation of certain instrument classes. These potential applications further enhance the competitiveness of our approach.

AAAI Conference 2022 Conference Paper

Visual Sound Localization in the Wild by Cross-Modal Interference Erasing

  • Xian Liu
  • Rui Qian
  • Hang Zhou
  • Di Hu
  • Weiyao Lin
  • Ziwei Liu
  • Bolei Zhou
  • Xiaowei Zhou

The task of audiovisual sound source localization has been well studied under constrained scenes, where the audio recordings are clean. However, in real world scenarios, audios are usually contaminated by off screen sound and background noise. They will interfere with the procedure of identifying desired sources and building visual sound connections, making previous studies nonapplicable. In this work, we propose the Interference Eraser (IEr) framework, which tackles the problem of audiovisual sound source localization in the wild. The key idea is to eliminate the interference by redefining and carving discriminative audio representations. Specifically, we observe that the previous practice of learning only a single audio representation is insufficient due to the additive nature of audio signals. We thus extend the audio representation with our Audio Instance Identifier module, which clearly distinguishes sounding instances when audio signals of different volumes are unevenly mixed. Then we erase the influence of the audible but off screen sounds and the silent but visible objects by a Cross modal Referrer module with cross modality distillation. Quantitative and qualitative evaluations demonstrate that our framework achieves superior results on sound localization tasks, especially under real world scenarios.

IJCAI Conference 2021 Conference Paper

Speech2Talking-Face: Inferring and Driving a Face with Synchronized Audio-Visual Representation

  • Yasheng Sun
  • Hang Zhou
  • Ziwei Liu
  • Hideki Koike

What can we picture solely from a clip of speech? Previous research has shown the possibility of directly inferring the appearance of a person's face by listening to a voice. However, within human speech lies not only the biometric identity signal but also the identity-irrelevant information such as the talking content. Our goal is to extract as much information from a clip of speech as possible. In particular, we aim at not only inferring the face of a person but also animating it. Our key insight is to synchronize audio and visual representations from two perspectives in a style-based generative framework. Specifically, contrastive learning is leveraged to map both the identity and speech content information within the speech to visual representation spaces. Furthermore, the identity space is strengthened with class centroids. Through curriculum learning, the style-based generator is capable of automatically balancing the information from the two latent spaces. Extensive experiments show that our approach encourages better speech-identity correlation learning while generating vivid faces whose identities are consistent with given speech samples. Moreover, by leveraging the same model, these inferred faces can be driven to talk by the audio.

ECAI Conference 2020 Conference Paper

ET-GAN: Cross-Language Emotion Transfer Based on Cycle-Consistent Generative Adversarial Networks

  • Xiaoqi Jia
  • Jianwei Tai
  • Hang Zhou
  • Yakai Li
  • Weijuan Zhang
  • Haichao Du
  • Qingjia Huang

Despite the remarkable progress made in synthesizing emotional speech from text, it is still challenging to provide emotion information to existing speech segments. Previous methods mainly rely on parallel data, and few works have studied the generalization ability for one model to transfer emotion information across different languages. To cope with such problems, we propose an emotion transfer system named ET-GAN, for learning language-independent emotion transfer from one emotion to another without parallel training samples. Based on cycle-consistent generative adversarial network, our method ensures the transfer of only emotion information across speeches with simple loss designs. Besides, we introduce an approach for migrating emotion information across different languages by using transfer learning. The experiment results show that our method can efficiently generate high-quality emotional speech for any given emotion category, without aligned speech pairs.

AAAI Conference 2019 Conference Paper

Talking Face Generation by Adversarially Disentangled Audio-Visual Representation

  • Hang Zhou
  • Yu Liu
  • Ziwei Liu
  • Ping Luo
  • Xiaogang Wang

Talking face generation aims to synthesize a sequence of face images that correspond to a clip of speech. This is a challenging task because face appearance variation and semantics of speech are coupled together in the subtle movements of the talking face regions. Existing works either construct specific face appearance model on specific subjects or model the transformation between lip motion and speech. In this work, we integrate both aspects and enable arbitrary-subject talking face generation by learning disentangled audio-visual representation. We find that the talking face sequence is actually a composition of both subject-related information and speech-related information. These two spaces are then explicitly disentangled through a novel associative-and-adversarial training process. This disentangled representation has an advantage where both audio and video can serve as inputs for generation. Extensive experiments show that the proposed approach generates realistic talking face sequences on arbitrary subjects with much clearer lip motion patterns than previous work. We also demonstrate the learned audio-visual representation is extremely useful for the tasks of automatic lip reading and audio-video retrieval.

JBHI Journal 2016 Journal Article

Registration of Pre- and Postresection Ultrasound Volumes With Noncorresponding Regions in Neurosurgery

  • Hang Zhou
  • Hassan Rivaz

Brain tissue deforms significantly after opening the dura and during tumor resection, invalidating preoperative imaging data. Ultrasound is a popular imaging modality for providing the neurosurgeon with real-time updated images of brain tissue. Interpretation of postresection ultrasound images is difficult due to large brain shift and tissue resection. Furthermore, several factors degrade the quality of postresection ultrasound images such as the strong reflection of waves at the interface of saline water and brain tissue in resection cavities, air bubbles, and the application of blood-clotting agents around the edges of resection. Image registration allows the comparison of postresection ultrasound images with higher quality preresection images, assists in interpretation of postresection images and may help identify residual tumor, and, as such, is of significant clinical importance. In this paper, we propose a nonrigid symmetric registration (NSR) framework for accurate alignment of pre- and postresection volumetric ultrasound images in near real time. We first formulate registration as minimization of a regularized cost function and analytically derive its derivative to efficiently optimize the cost function. An outlier detection algorithm is proposed and utilized in this framework to identify noncorresponding regions (outliers) and therefore improve the robustness and accuracy of registration. We use an efficient second-order minimization method for fast and robust optimization. Furthermore, we exploit a symmetric and inverse-consistent method to generate realistic deformation fields. The results show that NSR significantly improves the quality of the alignment between pre- and postresection ultrasound images.