Author name cluster

Zejun Yang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

3 papers

1 author row

AAAI Conference 2025 Conference Paper

RealPortrait: Realistic Portrait Animation with Diffusion Transformers

Zejun Yang
Huawei Wei
Zhisheng Wang

We introduce RealPortrait, a framework based on Diffusion Transformers (DiT), designed to generate highly expressive and visually appealing portrait animations. Given a static portrait image, our method can transfer complex facial expressions and head pose movements extracted from a driving video onto the portrait, transforming it into a lifelike video. Specifically, we exploit the robust spatial-temporal modeling capabilities of DiT, enabling the generation of portrait videos that maintain high-fidelity visual details and ensure temporal coherence. In contrast to conventional image-to-video generation frameworks that necessitate a separate reference network, we incorporate an efficient reference attention within the DiT backbone, thereby obviating the computational overhead and achieving superior reference appearance preservation. Concurrently, we integrate a parallel ControlNet to precisely regulate intricate facial expressions and head poses. Diverging from prior methods that utilize explicit sparse motion representations, such as facial landmarks or 3DMM coefficients, we adopt a dense implicit motion representation as the control guidance. This implicit motion representation excels in capturing nuanced emotional facial expressions and subtle non-rigid dynamics of the lips. To further enhance the generalization capability of the model, we augment the training dataset by incorporating a substantial volume of facial image data through random crop augmentation. This strategy ensures the model's robustness across a wide variety of facial appearances and expressions. Empirical evaluations demonstrate that RealPortrait excels in generating portrait animations with highly-realistic quality and exceptional temporal coherence in appearance retention.

PDF Details DOI

AAAI Conference 2024 Conference Paper

3D Visibility-Aware Generalizable Neural Radiance Fields for Interacting Hands

Xuan Huang
Hanhui Li
Zejun Yang
Zhisheng Wang
Xiaodan Liang

Neural radiance fields (NeRFs) are promising 3D representations for scenes, objects, and humans. However, most existing methods require multi-view inputs and per-scene training, which limits their real-life applications. Moreover, current methods focus on single-subject cases, leaving scenes of interacting hands that involve severe inter-hand occlusions and challenging view variations remain unsolved. To tackle these issues, this paper proposes a generalizable visibility-aware NeRF (VA-NeRF) framework for interacting hands. Specifically, given an image of interacting hands as input, our VA-NeRF first obtains a mesh-based representation of hands and extracts their corresponding geometric and textural features. Subsequently, a feature fusion module that exploits the visibility of query points and mesh vertices is introduced to adaptively merge features of both hands, enabling the recovery of features in unseen areas. Additionally, our VA-NeRF is optimized together with a novel discriminator within an adversarial learning paradigm. In contrast to conventional discriminators that predict a single real/fake label for the synthesized image, the proposed discriminator generates a pixel-wise visibility map, providing fine-grained supervision for unseen areas and encouraging the VA-NeRF to improve the visual quality of synthesized images. Experiments on the Interhand2.6M dataset demonstrate that our proposed VA-NeRF outperforms conventional NeRFs significantly. Project Page: https://github.com/XuanHuang0/VANeRF.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Monocular 3D Hand Mesh Recovery via Dual Noise Estimation

Hanhui Li
Xiaojian Lin
Xuan Huang
Zejun Yang
Zhisheng Wang
Xiaodan Liang

Current parametric models have made notable progress in 3D hand pose and shape estimation. However, due to the fixed hand topology and complex hand poses, current models are hard to generate meshes that are aligned with the image well. To tackle this issue, we introduce a dual noise estimation method in this paper. Given a single-view image as input, we first adopt a baseline parametric regressor to obtain the coarse hand meshes. We assume the mesh vertices and their image-plane projections are noisy, and can be associated in a unified probabilistic model. We then learn the distributions of noise to refine mesh vertices and their projections. The refined vertices are further utilized to refine camera parameters in a closed-form manner. Consequently, our method obtains well-aligned and high-quality 3D hand meshes. Extensive experiments on the large-scale Interhand2.6M dataset demonstrate that the proposed method not only improves the performance of its baseline by more than 10% but also achieves state-of-the-art performance. Project page: https://github.com/hanhuili/DNE4Hand.

PDF Details DOI