Author name cluster

Ping Tan

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

15 papers

2 author rows

NeurIPS Conference 2025 Conference Paper

SpatialLM: Training Large Language Models for Structured Indoor Modeling

Yongsen Mao
Junhao Zhong
Chuan Fang
Jia Zheng
Rui Tang
Hao Zhu
Ping Tan
Zihan Zhou

SpatialLM is a large language model designed to process 3D point cloud data and generate structured 3D scene understanding outputs. These outputs include architectural elements like walls, doors, windows, and oriented object boxes with their semantic categories. Unlike previous methods which exploit task-specific network designs, our model adheres to the standard multimodal LLM architecture and is fine-tuned directly from open-source LLMs. To train SpatialLM, we collect a large-scale, high-quality synthetic dataset consisting of the point clouds of 12, 328 indoor scenes (54, 778 rooms) with ground-truth 3D annotations, and conduct a careful study on various modeling and training decisions. On public benchmarks, our model gives state-of-the-art performance in layout estimation and competitive results in 3D object detection. With that, we show a feasible path for enhancing the spatial understanding capabilities of modern LLMs for applications in augmented reality, embodied robotics, and more.

PDF Details

AAAI Conference 2025 Conference Paper

SymmCompletion: High-Fidelity and High-Consistency Point Cloud Completion with Symmetry Guidance

Hongyu Yan
Zijun Li
Kunming Luo
Li Lu
Ping Tan

Point cloud completion aims to recover a complete point shape from a partial point cloud. Although existing methods can form satisfactory point clouds in global completeness, they often lose the original geometry details and face the problem of geometric inconsistency between existing point clouds and reconstructed missing parts. To tackle this problem, we introduce SymmCompletion, a highly effective completion method based on symmetry guidance. Our method comprises two primary components: a Local Symmetry Transformation Network (LSTNet) and a Symmetry-Guidance Transformer (SGFormer). First, LSTNet efficiently estimates point-wise local symmetry transformation to transform key geometries of partial inputs into missing regions, thereby generating geometry-align partial-missing pairs and initial point clouds. Second, SGFormer leverages the geometric features of partial-missing pairs as the explicit symmetric guidance that can constrain the refinement process for initial point clouds. As a result, SGFormer can exploit provided priors to form high-fidelity and geometry-consistency final point clouds. Qualitative and quantitative evaluations on several benchmark datasets demonstrate that our method outperforms state-of-the-art completion networks.

PDF Details DOI

AAAI Conference 2025 Conference Paper

Universal Features Guided Zero-Shot Category-Level Object Pose Estimation

Wentian Qu
Chenyu Meng
Heng Li
Jian Cheng
Cuixia Ma
Hongan Wang
Xiao Zhou
Xiaoming Deng

Object pose estimation, crucial in computer vision and robotics applications, faces challenges with the diversity of unseen categories. We propose a zero-shot method to achieve category-level 6-DOF object pose estimation, which exploits both 2D and 3D universal features of input RGB-D image to establish semantic similarity-based correspondences and can be extended to unseen categories without additional model fine-tuning. Our method begins with combining efficient 2D universal features to find sparse correspondences between intra-category objects and gets initial coarse pose. To handle the correspondence degradation of 2D universal features if the pose deviates much from the target pose, we use an iterative strategy to optimize the pose. Subsequently, to resolve pose ambiguities due to shape differences between intra-category objects, the coarse pose is refined by optimizing with dense alignment constraint of 3D universal features. Our method outperforms previous methods on the REAL275 and Wild6D benchmarks for unseen categories.

PDF Details DOI

ICRA Conference 2024 Conference Paper

DVI-SLAM: A Dual Visual Inertial SLAM Network

Xiongfeng Peng
Zhihua Liu
Weiming Li
Ping Tan
SoonYong Cho
Qiang Wang

Recent deep learning based visual simultaneous localization and mapping (SLAM) methods have made significant progress. However, how to make full use of visual information as well as better integrate with inertial measurement unit (IMU) in visual SLAM has potential research value. This paper proposes a novel deep SLAM network with dual visual factors. The basic idea is to integrate both photometric factor and re-projection factor into the end-to-end differentiable structure through multi-factor data association module. We show that the proposed network dynamically learns and adjusts the confidence maps of both visual factors and it can be further extended to include the IMU factors as well. Extensive experiments validate that our proposed method significantly outperforms the state-of-the-art methods on several public datasets, including TartanAir, EuRoC and ETH3D-SLAM. Specifically, when dynamically fusing the three factors together, the absolute trajectory error for both monocular and stereo configurations on EuRoC dataset has reduced by 45. 3% and 36. 2% respectively.

Details

NeurIPS Conference 2024 Conference Paper

Era3D: High-Resolution Multiview Diffusion using Efficient Row-wise Attention

Peng Li
Yuan Liu
Xiaoxiao Long
Feihu Zhang
Cheng Lin
Mengfei Li
Xingqun Qi
Shanghang Zhang

In this paper, we introduce Era3D, a novel multiview diffusion method that generates high-resolution multiview images from a single-view image. Despite significant advancements in multiview generation, existing methods still suffer from camera prior mismatch, inefficacy, and low resolution, resulting in poor-quality multiview images. Specifically, these methods assume that the input images should comply with a predefined camera type, e. g. a perspective camera with a fixed focal length, leading to distorted shapes when the assumption fails. Moreover, the full-image or dense multiview attention they employ leads to a dramatic explosion of computational complexity as image resolution increases, resulting in prohibitively expensive training costs. To bridge the gap between assumption and reality, Era3D first proposes a diffusion-based camera prediction module to estimate the focal length and elevation of the input image, which allows our method to generate images without shape distortions. Furthermore, a simple but efficient attention layer, named row-wise attention, is used to enforce epipolar priors in the multiview diffusion, facilitating efficient cross-view information fusion. Consequently, compared with state-of-the-art methods, Era3D generates high-quality multiview images with up to a 512×512 resolution while reducing computation complexity of multiview attention by 12x times. Comprehensive experiments demonstrate the superior generation power of Era3D- it can reconstruct high-quality and detailed 3D meshes from diverse single-view input images, significantly outperforming baseline multiview diffusion methods.

PDF Details DOI

ICLR Conference 2024 Conference Paper

SweetDreamer: Aligning Geometric Priors in 2D diffusion for Consistent Text-to-3D

Weiyu Li
Rui Chen
Xuelin Chen
Ping Tan

It is inherently ambiguous to lift 2D results from pre-trained diffusion models to a 3D world for text-to-3D generation. 2D diffusion models solely learn view-agnostic priors and thus lack 3D knowledge during the lifting, leading to the multi-view inconsistency problem. We find that this problem primarily stems from geometric inconsistency, and avoiding misplaced geometric structures substantially mitigates the problem in the final outputs. Therefore, we improve the consistency by aligning the 2D geometric priors in diffusion models with well-defined 3D shapes during the lifting, addressing the vast majority of the problem. This is achieved by fine-tuning the 2D diffusion model to be viewpoint-aware and to produce view-specific coordinate maps of canonically oriented 3D objects. In our process, only coarse 3D information is used for aligning. This “coarse” alignment not only resolves the multi-view inconsistency in geometries but also retains the ability in 2D diffusion models to generate detailed and diversified high-quality ob-jects unseen in the 3D datasets. Furthermore, our aligned geometric priors (AGP) are generic and can be seamlessly integrated into various state-of-the-art pipelines, obtaining high generalizability in terms of unseen shapes and visual appearance while greatly alleviating the multi-view inconsistency problem.

Details

IJCAI Conference 2022 Conference Paper

A Speech-driven Sign Language Avatar Animation System for Hearing Impaired Applications

Li Hu
Jiahui Li
Jiashuo Zhang
Qi Wang
Bang Zhang
Ping Tan

Sign language is the communication language used in hearing impaired community. Recently, the research of sign language production has made great progress but still need to cope with some critical challenges. In this paper, we propose a system-level scheme and push forward the implementation of sign language production for practical usage. We build a system capable of translating speech into sign language avatar. Different from previous approach only focusing on single technology, we systematically combine algorithms of language translation, body gesture animation and facial avatar generation. We also develop two applications: Sign Language Interpretation APP and Virtual Sign Language Anchor, to facilitate easy and clear communication for hearing impaired people.

PDF Details DOI

NeurIPS Conference 2022 Conference Paper

DART: Articulated Hand Model with Diverse Accessories and Rich Textures

Daiheng Gao
Yuliang Xiu
Kailin Li
Lixin Yang
Feng Wang
Peng Zhang
Bang Zhang
Cewu Lu

Hand, the bearer of human productivity and intelligence, is receiving much attention due to the recent fever of digital twins. Among different hand morphable models, MANO has been widely used in vision and graphics community. However, MANO disregards textures and accessories, which largely limits its power to synthesize photorealistic hand data. In this paper, we extend MANO with Diverse Accessories and Rich Textures, namely DART. DART is composed of 50 daily 3D accessories which varies in appearance and shape, and 325 hand-crafted 2D texture maps covers different kinds of blemishes or make-ups. Unity GUI is also provided to generate synthetic hand data with user-defined settings, e. g. , pose, camera, background, lighting, textures, and accessories. Finally, we release DARTset, which contains large-scale (800K), high-fidelity synthetic hand images, paired with perfect-aligned 3D labels. Experiments demonstrate its superiority in diversity. As a complement to existing hand datasets, DARTset boosts the generalization in both hand pose estimation and mesh recovery tasks. Raw ingredients (textures, accessories), Unity GUI, source code and DARTset are publicly available at dart2022. github. io.

PDF Details

AAAI Conference 2022 Conference Paper

Efficient Virtual View Selection for 3D Hand Pose Estimation

Jian Cheng
Yanguang Wan
Dexin Zuo
Cuixia Ma
Jian Gu
Ping Tan
Hongan Wang
Xiaoming Deng

3D hand pose estimation from single depth is a fundamental problem in computer vision, and has wide applications. However, the existing methods still can not achieve satisfactory hand pose estimation results due to view variation and occlusion of human hand. In this paper, we propose a new virtual view selection and fusion module for 3D hand pose estimation from single depth. We propose to automatically select multiple virtual viewpoints for pose estimation and fuse the results of all and find this empirically delivers accurate and robust pose estimation. In order to select most effective virtual views for pose fusion, we evaluate the virtual views based on the confidence of virtual views using a light-weight network via network distillation. Experiments on three main benchmark datasets including NYU, ICVL and Hands2019 demonstrate that our method outperforms the state-of-the-arts on NYU and ICVL, and achieves very competitive performance on Hands2019-Task1, and our proposed virtual view selection and fusion module is both effective for 3D hand pose estimation.

PDF Details

NeurIPS Conference 2022 Conference Paper

Streaming Radiance Fields for 3D Video Synthesis

Lingzhi Li
Zhen Shen
Zhongshu Wang
Li Shen
Ping Tan

We present an explicit-grid based method for efficiently reconstructing streaming radiance fields for novel view synthesis of real world dynamic scenes. Instead of training a single model that combines all the frames, we formulate the dynamic modeling problem with an incremental learning paradigm in which per-frame model difference is trained to complement the adaption of a base model on the current frame. By exploiting the simple yet effective tuning strategy with narrow bands, the proposed method realizes a feasible framework for handling video sequences on-the-fly with high training efficiency. The storage overhead induced by using explicit grid representations can be significantly reduced through the use of model difference based compression. We also introduce an efficient strategy to further accelerate model optimization for each frame. Experiments on challenging video sequences demonstrate that our approach is capable of achieving a training speed of 15 seconds per-frame with competitive rendering quality, which attains $1000 \times$ speedup over the state-of-the-art implicit methods.

PDF Details

IJCAI Conference 2022 Conference Paper

Text/Speech-Driven Full-Body Animation

Wenlin Zhuang
Jinwei Qi
Peng Zhang
Bang Zhang
Ping Tan

Due to the increasing demand in films and games, synthesizing 3D avatar animation has attracted much attention recently. In this work, we present a production-ready text/speech-driven full-body animation synthesis system. Given the text and corresponding speech, our system synthesizes face and body animations simultaneously, which are then skinned and rendered to obtain a video stream output. We adopt a learning-based approach for synthesizing facial animation and a graph-based approach to animate the body, which generates high-quality avatar animation efficiently and robustly. Our results demonstrate the generated avatar animations are realistic, diverse and highly text/speech-correlated.

PDF Details DOI

AAAI Conference 2020 Conference Paper

Leveraging Multi-View Image Sets for Unsupervised Intrinsic Image Decomposition and Highlight Separation

Renjiao Yi
Ping Tan
Stephen Lin

We present an unsupervised approach for factorizing object appearance into highlight, shading, and albedo layers, trained by multi-view real images. To do so, we construct a multiview dataset by collecting numerous customer product photos online, which exhibit large illumination variations that make them suitable for training of reﬂectance separation and can facilitate object-level decomposition. The main contribution of our approach is a proposed image representation based on local color distributions that allows training to be insensitive to the local misalignments of multi-view images. In addition, we present a new guidance cue for unsupervised training that exploits synergy between highlight separation and intrinsic image decomposition. Over a broad range of objects, our technique is shown to yield state-of-the-art results for both of these tasks.

PDF Details

ICLR Conference 2019 Conference Paper

BA-Net: Dense Bundle Adjustment Networks

Chengzhou Tang
Ping Tan

This paper introduces a network architecture to solve the structure-from-motion (SfM) problem via feature-metric bundle adjustment (BA), which explicitly enforces multi-view geometry constraints in the form of feature-metric error. The whole pipeline is differentiable, so that the network can learn suitable features that make the BA problem more tractable. Furthermore, this work introduces a novel depth parameterization to recover dense per-pixel depth. The network first generates several basis depth maps according to the input image, and optimizes the final depth as a linear combination of these basis depth maps via feature-metric BA. The basis depth maps generator is also learned via end-to-end training. The whole system nicely combines domain knowledge (i.e. hard-coded multi-view geometry constraints) and deep learning (i.e. feature learning and basis depth maps learning) to address the challenging dense SfM problem. Experiments on large scale real data prove the success of the proposed method.

Details

IJCAI Conference 2018 Conference Paper

Active Recurrence of Lighting Condition for Fine-Grained Change Detection

Qian Zhang
Wei Feng
Liang Wan
Fei-Peng Tian
Ping Tan

This paper addresses active lighting recurrence (ALR), a new problem that actively relocalizes a light source to physically reproduce the lighting condition for a same scene from single reference image. ALR is of great importance for fine-grained visual monitoring and change detection, because some phenomena or minute changes can only be clearly observed under particular lighting conditions. Hence, effective ALR should be able to online navigate a light source toward the target pose, which is challenging due to the complexity and diversity of real-world lighting \& imaging processes. We propose to use the simple parallel lighting as an analogy model and based on Lambertian law to compose an instant navigation ball for this purpose. We theoretically prove the feasibility of this ALR strategy for realistic near point light sources and its invariance to the ambiguity of normal \& lighting decomposition. Extensive quantitative experiments and challenging real-world tasks on fine-grained change monitoring of cultural heritages verify the effectiveness of our approach. We also validate its generality to non-Lambertian scenes.

PDF Details

TIST Journal 2015 Journal Article

Where2Stand

Yinting Wang
Mingli Song
Dacheng Tao
Yong Rui
Jiajun Bu
Ah Chung Tsoi
Shaojie Zhuo
Ping Tan

People often take photographs at tourist sites and these pictures usually have two main elements: a person in the foreground and scenery in the background. This type of “souvenir photo” is one of the most common photos clicked by tourists. Although algorithms that aid a user-photographer in taking a well-composed picture of a scene exist [Ni et al. 2013], few studies have addressed the issue of properly positioning human subjects in photographs. In photography, the common guidelines of composing portrait images exist. However, these rules usually do not consider the background scene. Therefore, in this article, we investigate human-scenery positional relationships and construct a photographic assistance system to optimize the position of human subjects in a given background scene, thereby assisting the user in capturing high-quality souvenir photos. We collect thousands of well-composed portrait photographs to learn human-scenery aesthetic composition rules. In addition, we define a set of negative rules to exclude undesirable compositions. Recommendation results are achieved by combining the first learned positive rule with our proposed negative rules. We implement the proposed system on an Android platform in a smartphone. The system demonstrates its efficacy by producing well-composed souvenir photos.

Details DOI