Author name cluster

Jingyi Yu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

24 papers

1 author row

AAAI Conference 2026 Conference Paper

Unsupervised Multi-Parameter Inverse Solving for Reducing Ring Artifacts in 3D X-Ray CBCT

Qing Wu
Hongjiang Wei
Jingyi Yu
Yuyao Zhang

Ring artifacts are prevalent in 3D cone-beam computed tomography (CBCT) due to non-ideal responses of X-ray detectors, substantially affecting image quality and diagnostic reliability. Existing state-of-the-art (SOTA) ring artifact reduction (RAR) methods rely on supervised learning with large-scale paired CT datasets. While effective in-domain, supervised methods tend to struggle to fully capture the physical characteristics of ring artifacts, leading to pronounced performance drops in complex real-world acquisitions. Moreover, their scalability to 3D CBCT is limited by high memory demands. In this work, we propose Riner, a new unsupervised RAR method. Based on a theoretical analysis of ring artifact formation, we reformulate RAR as a multi-parameter inverse problem, where the non-ideal responses of X-ray detectors are parameterized as solvable physical variables. Using a new differentiable forward model, Riner can jointly learn the implicit neural representation of artifact-free images and estimate the physical parameters directly from CT measurements, without external training data. Additionally, Riner is memory-friendly due to its ray-based optimization, enhancing its usability in large-scale 3D CBCT. Experiments on both simulated and real-world datasets show Riner outperforms existing SOTA supervised methods.

PDF Details DOI

AAAI Conference 2025 Conference Paper

Capturing the Unseen: Vision-Free Facial Motion Capture Using Inertial Measurement Units

Youjia Wang
Yiwen Wu
Hengan Zhou
Hongyang Lin
Xingyue Peng
Jingyan Zhang
Yingsheng Zhu
YingWenQi Jiang

We present Capturing the Unseen (CAPUS), a novel facial motion capture (MoCap) technique that operates without visual signals. CAPUS leverages miniaturized Inertial Measurement Units (IMUs) as a new sensing modality for facial motion capture. While IMUs have become essential in full-body MoCap for their portability and independence from environmental conditions, their application in facial MoCap remains underexplored. We address this by customizing micro-IMUs, small enough to be placed on the face, and strategically positioning them in alignment with key facial muscles to capture expression dynamics. CAPUS introduces the first facial IMU dataset, encompassing both IMU and visual signals from participants engaged in diverse activities such as multilingual speech, facial expressions, and emotionally intoned auditions. We train a Transformer Diffusion-based neural network to infer Blendshape parameters directly from IMU data. Our experimental results demonstrate that CAPUS reliably captures facial motion in conditions where visual-based methods struggle, including facial occlusions, rapid movements, and low-light environments. Additionally, by eliminating the need for visual inputs, CAPUS offers enhanced privacy protection, making it a robust solution for various applications.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

LithoSim: A Large, Holistic Lithography Simulation Benchmark for AI-Driven Semiconductor Manufacturing

Hongquan He
Zhen Wang
Jingya Wang
Tao Wu
Xuming He
Bei Yu
Jingyi Yu
Hao GENG

Lithography orchestrates a symphony of light, mask and photochemicals to transfer the integrated circuit patterns onto the wafer. Lithography simulation serves as the critical nexus between circuit design and manufacturing, where its speed and accuracy fundamentally govern the optimization quality of downstream resolution enhancement techniques (RET). While machine learning promises to circumvent computational limitations of lithography process through data-driven or physics-informed approximations of computational lithography, existing simulators suffer from inadequate lithographic awareness due to insufficient training data capturing essential process variations and mask correction rules. We present LithoSim, the most comprehensive lithography simulation benchmark to date, featuring over $4$ million high-resolution input-output pairs with rigorous physical correspondence. The dataset systematically incorporates alterable optical source distributions, metal and via mask topologies with optical proximity correction (OPC) variants, and process windows reflecting fab-realistic variations. By integrating domain-specific metrics spanning AI performance and lithographic fidelity, LithoSim establishes a unified evaluation framework for data-driven and physics-informed computational lithography. The data (https: //huggingface. co/datasets/grandiflorum/LithoSim), code (https: //dw-hongquan. github. io/LithoSim), and pre-trained models (https: //huggingface. co/grandiflorum/LithoSim) are released openly to support the development of hybrid ML-based and high-fidelity lithography simulation for the benefit of semiconductor manufacturing.

PDF Details

NeurIPS Conference 2025 Conference Paper

PartNeXt: A Next-Generation Dataset for Fine-Grained and Hierarchical 3D Part Understanding

Penghao Wang
Yiyang He
Xin Lv
Yukai Zhou
Lan Xu
Jingyi Yu
Jiayuan Gu

Understanding objects at the level of their constituent parts is fundamental to advancing computer vision, graphics, and robotics. While datasets like PartNet have driven progress in 3D part understanding, their reliance on untextured geometries and expert-dependent annotation limits scalability and usability. We introduce PartNeXt, a next-generation dataset addressing these gaps with over 23000 high-quality, textured 3D models annotated with fine-grained, hierarchical part labels across 50 categories. We benchmark PartNeXt on two tasks: (1) class-agnostic part segmentation, where state-of-the-art methods (e. g. , PartField, SAMPart3D) struggle with fine-grained and leaf-level parts, and (2) 3D part-centric question answering, a new benchmark for 3D-LLMs that reveals significant gaps in open-vocabulary part grounding. Additionally, training Point-SAM on PartNeXt yields substantial gains over PartNet, underscoring the dataset’s superior quality and diversity. By combining scalable annotation, texture-aware labels, and multi-task evaluation, PartNeXt opens new avenues for research in structured 3D understanding.

PDF Details

AAAI Conference 2025 Conference Paper

SCOPE: Sign Language Contextual Processing with Embedding from LLMs

Yuqi Liu
Wenqian Zhang
Sihan Ren
Chengyu Huang
Jingyi Yu
Lan Xu

Sign languages, used by around 70 million Deaf individuals globally, are visual languages that convey visual and contextual information. Current methods in vision-based sign language recognition (SLR) and translation (SLT) struggle with dialogue scenes due to limited dataset diversity and the neglect of contextually relevant information. To address these challenges, we introduce SCOPE (Sign language COntextual Processing with Embedding from LLMs), a novel context-aware vision-based SLR and SLT framework. For SLR, we utilize dialogue contexts through a multi-modal encoder to enhance gloss-level recognition. For subsequent SLT, we further fine-tune a Large Language Model (LLM) by incorporating prior conversational context. We also contribute a new sign language dataset that contains 72 hours of Chinese sign language videos in contextual dialogues across various scenarios. Experimental results demonstrate that our SCOPE framework achieves state-of-the-art performance on multiple datasets, including Phoenix-2014T, CSL-Daily, and our SCOPE dataset. Moreover, surveys conducted with participants from the Deaf community further validate the robustness and effectiveness of our approach in real-world applications. Both our dataset and code will be open-sourced to facilitate further research.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

TokMan:Tokenize Manhattan Mask Optimization for Inverse Lithography

Yiwen Wu
Yuyang Chen
Ye Xia
Yao Zhao
Jingya Wang
Xuming He
Hao GENG
Jingyi Yu

Manhattan representations, defined by axis-aligned, orthogonal structures, are widely used in vision, robotics, and semiconductor design for their geometric regularity and algorithmic simplicity. In integrated circuit (IC) design, Manhattan geometry is key for routing, design rule checking, and lithographic manufacturability. However, as feature sizes shrink, optical system distortions lead to inconsistency between intended layout and printed wafer. Although Inverse Lithography Technology(ILT) is proposed to compensates these effects, learning-based ILT methods, while achieving high simulation fidelity, often generate curvilinear masks on continuous pixel grids, violating Manhattan constraints. Therefore, we propose TokMan, the first framework to formulate mask optimization as a discrete, structure-aware sequence modeling task. Our method leverages a Diffusion Transformer to tokenize layouts into discrete geometric primitives with polygon-wise dependencies and denoise Manhattan-aligned point sequences corrupted by optical proximity effects, while ensuring binary, manufacturable masks. Trained with self-supervised lithographic feedback through differentiable simulation and refined with ILT post-processing, TokMan achieves state-of-the-art fidelity, runtime efficiency, and strict manufacturing compliance on a large-scale dataset of IC layouts.

PDF Details

NeurIPS Conference 2024 Conference Paper

CryoGEM: Physics-Informed Generative Cryo-Electron Microscopy

Jiakai Zhang
Qihe Chen
Yan Zeng
Wenyuan Gao
Xuming He
Zhijie Liu
Jingyi Yu

In the past decade, deep conditional generative models have revolutionized the generation of realistic images, extending their application from entertainment to scientific domains. Single-particle cryo-electron microscopy (cryo-EM) is crucial in resolving near-atomic resolution 3D structures of proteins, such as the SARS-COV-2 spike protein. To achieve high-resolution reconstruction, a comprehensive data processing pipeline has been adopted. However, its performance is still limited as it lacks high-quality annotated datasets for training. To address this, we introduce physics-informed generative cryo-electron microscopy (CryoGEM), which for the first time integrates physics-based cryo-EM simulation with a generative unpaired noise translation to generate physically correct synthetic cryo-EM datasets with realistic noises. Initially, CryoGEM simulates the cryo-EM imaging process based on a virtual specimen. To generate realistic noises, we leverage an unpaired noise translation via contrastive learning with a novel mask-guided sampling scheme. Extensive experiments show that CryoGEM is capable of generating authentic cryo-EM images. The generated dataset can be used as training data for particle picking and pose estimation models, eventually improving the reconstruction resolution.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Diffusion-based Reinforcement Learning via Q-weighted Variational Policy Optimization

Shutong Ding
Ke Hu
Zhenhao Zhang
Kan Ren
Weinan Zhang
Jingyi Yu
Jingya Wang
Ye Shi

Diffusion models have garnered widespread attention in Reinforcement Learning (RL) for their powerful expressiveness and multimodality. It has been verified that utilizing diffusion policies can significantly improve the performance of RL algorithms in continuous control tasks by overcoming the limitations of unimodal policies, such as Gaussian policies. Furthermore, the multimodality of diffusion policies also shows the potential of providing the agent with enhanced exploration capabilities. However, existing works mainly focus on applying diffusion policies in offline RL, while their incorporation into online RL has been less investigated. The diffusion model's training objective, known as the variational lower bound, cannot be applied directly in online RL due to the unavailability of 'good' samples (actions). To harmonize the diffusion model with online RL, we propose a novel model-free diffusion-based online RL algorithm named Q-weighted Variational Policy Optimization (QVPO). Specifically, we introduce the Q-weighted variational loss and its approximate implementation in practice. Notably, this loss is shown to be a tight lower bound of the policy objective. To further enhance the exploration capability of the diffusion policy, we design a special entropy regularization term. Unlike Gaussian policies, the log-likelihood in diffusion policies is inaccessible; thus this entropy term is nontrivial. Moreover, to reduce the large variance of diffusion policies, we also develop an efficient behavior policy through action selection. This can further improve its sample efficiency during online interaction. Consequently, the QVPO algorithm leverages the exploration capabilities and multimodality of diffusion policies, preventing the RL agent from converging to a sub-optimal policy. To verify the effectiveness of QVPO, we conduct comprehensive experiments on MuJoCo continuous control benchmarks. The final results demonstrate that QVPO achieves state-of-the-art performance in terms of both cumulative reward and sample efficiency.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

DRACO: A Denoising-Reconstruction Autoencoder for Cryo-EM

Yingjun Shen
Haizhao Dai
Qihe Chen
Yan Zeng
Jiakai Zhang
Yuan Pei
Jingyi Yu

Foundation models in computer vision have demonstrated exceptional performance in zero-shot and few-shot tasks by extracting multi-purpose features from large-scale datasets through self-supervised pre-training methods. However, these models often overlook the severe corruption in cryogenic electron microscopy (cryo-EM) images by high-level noises. We introduce DRACO, a Denoising-Reconstruction Autoencoder for CryO-EM, inspired by the Noise2Noise (N2N) approach. By processing cryo-EM movies into odd and even images and treating them as independent noisy observations, we apply a denoising-reconstruction hybrid training scheme. We mask both images to create denoising and reconstruction tasks. For DRACO's pre-training, the quality of the dataset is essential, we hence build a high-quality, diverse dataset from an uncurated public database, including over 270, 000 movies or micrographs. After pre-training, DRACO naturally serves as a generalizable cryo-EM image denoiser and a foundation model for various cryo-EM downstream tasks. DRACO demonstrates the best performance in denoising, micrograph curation, and particle picking tasks compared to state-of-the-art baselines.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

MeshXL: Neural Coordinate Field for Generative 3D Foundation Models

Sijin Chen
Xin Chen
Anqi Pang
Xianfang Zeng
Wei Cheng
Yijun Fu
Fukun Yin
Zhibin Wang

The polygon mesh representation of 3D data exhibits great flexibility, fast rendering speed, and storage efficiency, which is widely preferred in various applications. However, given its unstructured graph representation, the direct generation of high-fidelity 3D meshes is challenging. Fortunately, with a pre-defined ordering strategy, 3D meshes can be represented as sequences, and the generation process can be seamlessly treated as an auto-regressive problem. In this paper, we validate Neural Coordinate Field (NeurCF), an explicit coordinate representation with implicit neural embeddings, is a simple-yet-effective representation for large-scale sequential mesh modeling. After that, we present MeshXL, a family of generative pre-trained auto-regressive models that addresses 3D mesh generation with modern large language model approaches. Extensive experiments show that MeshXL is able to generate high-quality 3D meshes, and can also serve as foundation models for various down-stream applications.

PDF Details DOI

IJCAI Conference 2024 Conference Paper

RealDex: Towards Human-like Grasping for Robotic Dexterous Hand

Yumeng Liu
Yaxun Yang
Youzhuo Wang
Xiaofei Wu
Jiamin Wang
Yichen Yao
Sören Schwertfeger
Sibei Yang

In this paper, we introduce RealDex, a pioneering dataset capturing authentic dexterous hand grasping motions infused with human behavioral patterns, enriched by multi-view and multimodal visual data. Utilizing a teleoperation system, we seamlessly synchronize human-robot hand poses in real time. This collection of human-like motions is crucial for training dexterous hands to mimic human movements more naturally and precisely. RealDex holds immense promise in advancing humanoid robot for automated perception, cognition, and manipulation in real-world scenarios. Moreover, we introduce a cutting-edge dexterous grasping motion generation framework, which aligns with human experience and enhances real-world applicability through effectively utilizing Multimodal Large Language Models. Extensive experiments have demonstrated the superior performance of our method on RealDex and other open datasets. The dataset and associated code are available at https: //4dvlab. github. io/RealDex_page/.

PDF Details DOI

JBHI Journal 2023 Journal Article

An Arbitrary Scale Super-Resolution Approach for 3D MR Images via Implicit Neural Representation

Qing Wu
Yuwei Li
Yawen Sun
Yan Zhou
Hongjiang Wei
Jingyi Yu
Yuyao Zhang

High Resolution (HR) medical images provide rich anatomical structure details to facilitate early and accurate diagnosis. In magnetic resonance imaging (MRI), restricted by hardware capacity, scan time, and patient cooperation ability, isotropic 3-dimensional (3D) HR image acquisition typically requests long scan time and, results in small spatial coverage and low signal-to-noise ratio (SNR). Recent studies showed that, with deep convolutional neural networks, isotropic HR MR images could be recovered from low-resolution (LR) input via single image super-resolution (SISR) algorithms. However, most existing SISR methods tend to approach scale-specific projection between LR and HR images, thus these methods can only deal with fixed up-sampling rates. In this paper, we propose ArSSR, an Ar bitrary S cale S uper- R esolution approach for recovering 3D HR MR images. In the ArSSR model, the LR image and the HR image are represented using the same implicit neural voxel function with different sampling rates. Due to the continuity of the learned implicit function, a single ArSSR model is able to achieve arbitrary and infinite up-sampling rate reconstructions of HR images from any input LR image. Then the SR task is converted to approach the implicit voxel function via deep neural networks from a set of paired HR and LR training examples. The ArSSR model consists of an encoder network and a decoder network. Specifically, the convolutional encoder network is to extract feature maps from the LR input images and the fully-connected decoder network is to approximate the implicit voxel function. Experimental results on three datasets show that the ArSSR model can achieve state-of-the-art SR performance for 3D HR MR image reconstruction while using a single trained model to achieve arbitrary up-sampling scales.

Details DOI

NeurIPS Conference 2023 Conference Paper

Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM Animator

Hanzhuo Huang
Yufan Feng
Cheng Shi
Lan Xu
Jingyi Yu
Sibei Yang

Text-to-video is a rapidly growing research area that aims to generate a semantic, identical, and temporal coherence sequence of frames that accurately align with the input text prompt. This study focuses on zero-shot text-to-video generation considering the data- and cost-efficient. To generate a semantic-coherent video, exhibiting a rich portrayal of temporal semantics such as the whole process of flower blooming rather than a set of ``moving images'', we propose a novel Free-Bloom pipeline that harnesses large language models (LLMs) as the director to generate a semantic-coherence prompt sequence, while pre-trained latent diffusion models (LDMs) as the animator to generate the high fidelity frames. Furthermore, to ensure temporal and identical coherence while maintaining semantic coherence, we propose a series of annotative modifications to adapting LDMs in the reverse process, including joint noise sampling, step-aware attention shift, and dual-path interpolation. Without any video data and training requirements, Free-Bloom generates vivid and high-quality videos, awe-inspiring in generating complex scenes with semantic meaningful frame sequences. In addition, Free-Bloom is naturally compatible with LDMs-based extensions.

PDF Details

AAAI Conference 2023 Conference Paper

HybridCap: Inertia-Aid Monocular Capture of Challenging Human Motions

Han Liang
Yannan He
Chengfeng Zhao
Mutian Li
Jingya Wang
Jingyi Yu
Lan Xu

Monocular 3D motion capture (mocap) is beneficial to many applications. The use of a single camera, however, often fails to handle occlusions of different body parts and hence it is limited to capture relatively simple movements. We present a light-weight, hybrid mocap technique called HybridCap that augments the camera with only 4 Inertial Measurement Units (IMUs) in a novel learning-and-optimization framework. We first employ a weakly-supervised and hierarchical motion inference module based on cooperative pure residual recurrent blocks that serve as limb, body and root trackers as well as an inverse kinematics solver. Our network effectively narrows the search space of plausible motions via coarse-to-fine pose estimation and manages to tackle challenging movements with high efficiency. We further develop a hybrid optimization scheme that combines inertial feedback and visual cues to improve tracking accuracy. Extensive experiments on various datasets demonstrate HybridCap can robustly handle challenging movements ranging from fitness actions to Latin dance. It also achieves real-time performance up to 60 fps with state-of-the-art accuracy.

PDF Details DOI

AAAI Conference 2023 Conference Paper

IKOL: Inverse Kinematics Optimization Layer for 3D Human Pose and Shape Estimation via Gauss-Newton Differentiation

Juze Zhang
Ye Shi
Yuexin Ma
Lan Xu
Jingyi Yu
Jingya Wang

This paper presents an inverse kinematic optimization layer (IKOL) for 3D human pose and shape estimation that leverages the strength of both optimization- and regression-based methods within an end-to-end framework. IKOL involves a nonconvex optimization that establishes an implicit mapping from an image’s 3D keypoints and body shapes to the relative body-part rotations. The 3D keypoints and the body shapes are the inputs and the relative body-part rotations are the solutions. However, this procedure is implicit and hard to make differentiable. So, to overcome this issue, we designed a Gauss-Newton differentiation (GN-Diff) procedure to differentiate IKOL. GN-Diff iteratively linearizes the nonconvex objective function to obtain Gauss-Newton directions with closed form solutions. Then, an automatic differentiation procedure is directly applied to generate a Jacobian matrix for end-to-end training. Notably, the GN-Diff procedure works fast because it does not rely on a time-consuming implicit differentiation procedure. The twist rotation and shape parameters are learned from the neural networks and, as a result, IKOL has a much lower computational overhead than most existing optimization-based methods. Additionally, compared to existing regression-based methods, IKOL provides a more accurate mesh-image correspondence. This is because it iteratively reduces the distance between the keypoints and also enhances the reliability of the pose structures. Extensive experiments demonstrate the superiority of our proposed framework over a wide range of 3D human pose and shape estimation methods. Code is available at https://github.com/Juzezhang/IKOL

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

MotionGPT: Human Motion as a Foreign Language

Biao Jiang
Xin Chen
Wen Liu
Jingyi Yu
Gang Yu
Tao Chen

Though the advancement of pre-trained large language models unfolds, the exploration of building a unified model for language and other multimodal data, such as motion, remains challenging and untouched so far. Fortunately, human motion displays a semantic coupling akin to human language, often perceived as a form of body language. By fusing language data with large-scale motion models, motion-language pre-training that can enhance the performance of motion-related tasks becomes feasible. Driven by this insight, we propose MotionGPT, a unified, versatile, and user-friendly motion-language model to handle multiple motion-relevant tasks. Specifically, we employ the discrete vector quantization for human motion and transfer 3D motion into motion tokens, similar to the generation process of word tokens. Building upon this "motion vocabulary", we perform language modeling on both motion and text in a unified manner, treating human motion as a specific language. Moreover, inspired by prompt learning, we pre-train MotionGPT with a mixture of motion-language data and fine-tune it on prompt-based question-and-answer tasks. Extensive experiments demonstrate that MotionGPT achieves state-of-the-art performances on multiple motion tasks including text-driven motion generation, motion captioning, motion prediction, and motion in-between.

PDF Details

IJCAI Conference 2023 Conference Paper

StackFLOW: Monocular Human-Object Reconstruction by Stacked Normalizing Flow with Offset

Chaofan Huo
Ye Shi
Yuexin Ma
Lan Xu
Jingyi Yu
Jingya Wang

Modeling and capturing the 3D spatial arrangement of the human and the object is the key to perceiving 3D human-object interaction from monocular images. In this work, we propose to use the Human-Object Offset between anchors which are densely sampled from the surface of human mesh and object mesh to represent human-object spatial relation. Compared with previous works which use contact map or implicit distance filed to encode 3D human-object spatial relations, our method is a simple and efficient way to encode the highly detailed spatial correlation between the human and object. Based on this representation, we propose Stacked Normalizing Flow (StackFLOW) to infer the posterior distribution of human-object spatial relations from the image. During the optimization stage, we finetune the human body pose and object 6D pose by maximizing the likelihood of samples based on this posterior distribution and minimizing the 2D-3D corresponding reprojection loss. Extensive experimental results show that our method achieves impressive results on two challenging benchmarks, BEHAVE and InterCap datasets. Our code has been publicly available at https: //github. com/MoChen-bop/StackFLOW.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

Unsupervised Polychromatic Neural Representation for CT Metal Artifact Reduction

Qing Wu
Lixuan Chen
Ce Wang
Hongjiang Wei
S. Kevin Zhou
Jingyi Yu
Yuyao Zhang

Emerging neural reconstruction techniques based on tomography (e. g. , NeRF, NeAT, and NeRP) have started showing unique capabilities in medical imaging. In this work, we present a novel Polychromatic neural representation (Polyner) to tackle the challenging problem of CT imaging when metallic implants exist within the human body. CT metal artifacts arise from the drastic variation of metal's attenuation coefficients at various energy levels of the X-ray spectrum, leading to a nonlinear metal effect in CT measurements. Recovering CT images from metal-affected measurements hence poses a complicated nonlinear inverse problem where empirical models adopted in previous metal artifact reduction (MAR) approaches lead to signal loss and strongly aliased reconstructions. Polyner instead models the MAR problem from a nonlinear inverse problem perspective. Specifically, we first derive a polychromatic forward model to accurately simulate the nonlinear CT acquisition process. Then, we incorporate our forward model into the implicit neural representation to accomplish reconstruction. Lastly, we adopt a regularizer to preserve the physical properties of the CT images across different energy levels while effectively constraining the solution space. Our Polyner is an unsupervised method and does not require any external training data. Experimenting with multiple datasets shows that our Polyner achieves comparable or better performance than supervised methods on in-domain datasets while demonstrating significant performance improvements on out-of-domain datasets. To the best of our knowledge, our Polyner is the first unsupervised MAR method that outperforms its supervised counterparts. The code for this work is available at: https: //github. com/iwuqing/Polyner.

PDF Details

AAAI Conference 2023 Conference Paper

Weakly Supervised 3D Multi-Person Pose Estimation for Large-Scale Scenes Based on Monocular Camera and Single LiDAR

Peishan Cong
Yiteng Xu
Yiming Ren
Juze Zhang
Lan Xu
Jingya Wang
Jingyi Yu
Yuexin Ma

Depth estimation is usually ill-posed and ambiguous for monocular camera-based 3D multi-person pose estimation. Since LiDAR can capture accurate depth information in long-range scenes, it can benefit both the global localization of individuals and the 3D pose estimation by providing rich geometry features. Motivated by this, we propose a monocular camera and single LiDAR-based method for 3D multi-person pose estimation in large-scale scenes, which is easy to deploy and insensitive to light. Specifically, we design an effective fusion strategy to take advantage of multi-modal input data, including images and point cloud, and make full use of temporal information to guide the network to learn natural and coherent human motions. Without relying on any 3D pose annotations, our method exploits the inherent geometry constraints of point cloud for self-supervision and utilizes 2D keypoints on images for weak supervision. Extensive experiments on public datasets and our newly collected dataset demonstrate the superiority and generalization capability of our proposed method. Project homepage is at \url{https://github.com/4DVLab/FusionPose.git}.

PDF Details DOI

AAAI Conference 2022 Conference Paper

Anisotropic Fourier Features for Neural Image-Based Rendering and Relighting

Huangjie Yu
Anpei Chen
Xin Chen
Lan Xu
Ziyu Shao
Jingyi Yu

Recent neural rendering techniques have greatly benefited image-based modeling and relighting tasks. They provide a continuous, compact, and parallelable representation by modeling the plenoptic function as multilayer perceptrons (MLPs). However, vanilla MLPs suffer from spectral biases on multidimensional datasets. Recent rescues based on isotropic Fourier features mapping mitigate the problem but still fall short of handling heterogeneity across different dimensions, causing imbalanced regression and visual artifacts such as excessive blurs. We present an anisotropic random Fourier features (RFF) mapping scheme to tackle spectral biases. We first analyze the influence of bandwidth from a different perspective: we show that the optimal bandwidth exhibits strong correlations with the frequency spectrum of the training data across various dimensions. We then introduce an anisotropic feature mapping scheme with multiple bandwidths to model the multidimensional signal characteristics. We further propose an efficient bandwidth searching scheme through iterative golden-section search that can significantly reduce the training overload from polynomial time to logarithm. Our anisotropic scheme directly applies to neural surface light-field rendering and image-based relighting. Comprehensive experiments show that our scheme can more faithfully model lighting conditions and object features as well as preserve fine texture details and smooth view transitions even when angular and spatial samples are highly imbalanced.

PDF Details

IJCAI Conference 2021 Conference Paper

Few-shot Neural Human Performance Rendering from Sparse RGBD Videos

Anqi Pang
Xin Chen
Haimin Luo
Minye Wu
Jingyi Yu
Lan Xu

Recent neural rendering approaches for human activities achieve remarkable view synthesis results, but still rely on dense input views or dense training with all the capture frames, leading to deployment difficulty and inefficient training overload. However, existing advances will be ill-posed if the input is both spatially and temporally sparse. To fill this gap, in this paper we propose a few-shot neural human rendering approach (FNHR) from only sparse RGBD inputs, which exploits the temporal and spatial redundancy to generate photo-realistic free-view output of human activities. Our FNHR is trained only on the key-frames which expand the motion manifold in the input sequences. We introduce a two-branch neural blending to combine the neural point render and classical graphics texturing pipeline, which integrates reliable observations over sparse key-frames. Furthermore, we adopt a patch-based adversarial training process to make use of the local redundancy and avoids over-fitting to the key-frames, which generates fine-detailed rendering results. Extensive experiments demonstrate the effectiveness of our approach to generate high-quality free view-point results for challenging human performances under the sparse setting.

PDF Details DOI

IJCAI Conference 2021 Conference Paper

PIANO: A Parametric Hand Bone Model from Magnetic Resonance Imaging

Yuwei Li
Minye Wu
Yuyao Zhang
Lan Xu
Jingyi Yu

Hand modeling is critical for immersive VR/AR, action understanding, or human healthcare. Existing parametric models account only for hand shape, pose, or texture, without modeling the anatomical attributes like bone, which is essential for realistic hand biomechanics analysis. In this paper, we present PIANO, the first parametric bone model of human hands from MRI data. Our PIANO model is biologically correct, simple to animate, and differentiable, achieving more anatomically precise modeling of the inner hand kinematic structure in a data-driven manner than the traditional hand models based on the outer surface only. Furthermore, our PIANO model can be applied in neural network layers to enable training with a fine-grained semantic loss, which opens up the new task of data-driven fine-grained hand bone anatomic and semantic understanding from MRI or even RGB images. We make our model publicly available.

PDF Details DOI

AAAI Conference 2019 Conference Paper

RGBD Based Gaze Estimation via Multi-Task CNN

Dongze Lian
Ziheng Zhang
Weixin Luo
Lina Hu
Minye Wu
Zechao Li
Jingyi Yu
Shenghua Gao

This paper tackles RGBD based gaze estimation with Convolutional Neural Networks (CNNs). Specifically, we propose to decompose gaze point estimation into eyeball pose, head pose, and 3D eye position estimation. Compared with RGB image-based gaze tracking, having depth modality helps to facilitate head pose estimation and 3D eye position estimation. The captured depth image, however, usually contains noise and black holes which noticeably hamper gaze tracking. Thus we propose a CNN-based multi-task learning framework to simultaneously refine depth images and predict gaze points. We utilize a generator network for depth image generation with a Generative Neural Network (GAN), where the generator network is partially shared by both the gaze tracking network and GAN-based depth synthesizing. By optimizing the whole network simultaneously, depth image synthesis improves gaze point estimation and vice versa. Since the only existing RGBD dataset (EYEDIAP) is too small, we build a large-scale RGBD gaze tracking dataset for performance evaluation. As far as we know, it is the largest RGBD gaze dataset in terms of the number of participants. Comprehensive experiments demonstrate that our method outperforms existing methods by a large margin on both our dataset and the EYEDIAP dataset.

PDF Details

IJCAI Conference 2017 Conference Paper

Beyond Universal Saliency: Personalized Saliency Prediction with Multi-task CNN

Yanyu Xu
Nianyi Li
Junru Wu
Jingyi Yu
Shenghua Gao

Saliency detection is a long standing problem in computer vision. Tremendous efforts have been focused on exploring a universal saliency model across users despite their differences in gender, race, age, etc. Yet recent psychology studies suggest that saliency is highly specific than universal: individuals exhibit heterogeneous gaze patterns when viewing an identical scene containing multiple salient objects. In this paper, we first show that such heterogeneity is common and critical for reliable saliency prediction. Our study also produces the first database of personalized saliency maps (PSMs). We model PSM based on universal saliency map (USM) shared by different participants and adopt a multi-task CNN framework to estimate the discrepancy between PSM and USM. Comprehensive experiments demonstrate that our new PSM model and prediction scheme are effective and reliable.

PDF Details