Author name cluster

Shenlong Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

18 papers

2 author rows

NeurIPS Conference 2025 Conference Paper

HoloScene: Simulation‑Ready Interactive 3D Worlds from a Single Video

Hongchi Xia
Chih-Hao Lin
Hao-Yu Hsu
Quentin Leboutet
Katelyn Gao
Michael Paulitsch
Benjamin Ummenhofer
Shenlong Wang

Digitizing the physical world into accurate simulation‑ready virtual environments offers significant opportunities in a variety of fields such as augmented and virtual reality, gaming, and robotics. However, current 3D reconstruction and scene-understanding methods commonly fall short in one or more critical aspects, such as geometry completeness, object interactivity, physical plausibility, photorealistic rendering, or realistic physical properties for reliable dynamic simulation. To address these limitations, we introduce HoloScene, a novel interactive 3D reconstruction framework that simultaneously achieves these requirements. HoloScene leverages a comprehensive interactive scene-graph representation, encoding object geometry, appearance, and physical properties alongside hierarchical and inter-object relationships. Reconstruction is formulated as an energy-based optimization problem, integrating observational data, physical constraints, and generative priors into a unified, coherent objective. Optimization is efficiently performed via a hybrid approach combining sampling-based exploration with gradient-based refinement. The resulting digital twins exhibit complete and precise geometry, physical stability, and realistic rendering from novel viewpoints. Evaluations conducted on multiple benchmark datasets demonstrate superior performance, while practical use-cases in interactive gaming and real-time digital-twin manipulation illustrate HoloScene's broad applicability and effectiveness.

PDF Details

ICRA Conference 2025 Conference Paper

LidarDM: Generative LiDAR Simulation in a Generated World

Vlas Zyrianov
Henry Che
Zhijian Liu
Shenlong Wang

We present LidarDM, a novel LiDAR generative model capable of producing realistic, layout-aware, physically plausible, and temporally coherent LiDAR videos. LidarDM stands out with two unprecedented capabilities in LiDAR generative modeling: (i) LiDAR generation guided by driving scenarios, offering significant potential for autonomous driving simulations, and (ii) 4D LiDAR point cloud generation, enabling the creation of realistic and temporally coherent sequences. At the heart of our model is a novel integrated 4D world generation framework. Specifically, we employ latent diffusion models to generate the 3D scene, combine it with dynamic actors to form the underlying 4D world, and subsequently produce realistic sensory observations within this virtual environment. Our experiments indicate that our approach outperforms competing algorithms in realism, temporal coherency, and layout consistency. We additionally show that LidarDM can be used as a generative world model simulator for training and testing perception models. We release our source code and checkpoints at https://github.com/vzyrianov/LidarDM

Details

ICLR Conference 2025 Conference Paper

LIFe-GoM: Generalizable Human Rendering with Learned Iterative Feedback Over Multi-Resolution Gaussians-on-Mesh

Jing Wen
Alexander G. Schwing
Shenlong Wang

Generalizable rendering of an animatable human avatar from sparse inputs relies on data priors and inductive biases extracted from training on large data to avoid scene-specific optimization and to enable fast reconstruction. This raises two main challenges: First, unlike iterative gradient-based adjustment in scene-specific optimization, generalizable methods must reconstruct the human shape representation in a single pass at inference time. Second, rendering is preferably computationally efficient yet of high resolution. To address both challenges we augment the recently proposed dual shape representation, which combines the benefits of a mesh and Gaussian points, in two ways. To improve reconstruction, we propose an iterative feedback update framework, which successively improves the canonical human shape representation during reconstruction. To achieve computationally efficient yet high-resolution rendering, we study a coupled-multi-resolution Gaussians-on-Mesh representation. We evaluate the proposed approach on the challenging THuman2.0, XHuman and AIST++ data. Our approach reconstructs an animatable representation from sparse inputs in less than 1s, renders views with 95.1FPS at $1024 \times 1024$, and achieves PSNR/LPIPS*/FID of 24.65/110.82/51.27 on THuman2.0, outperforming the state-of-the-art in rendering quality.

Details

NeurIPS Conference 2025 Conference Paper

NoPo-Avatar: Generalizable and Animatable Avatars from Sparse Inputs without Human Poses

Jing Wen
Alex Schwing
Shenlong Wang

We tackle the task of recovering an animatable 3D human avatar from a single or a sparse set of images. For this task, beyond a set of images, many prior state-of-the-art methods use accurate “ground-truth” camera poses and human poses as input to guide reconstruction at test-time. We show that pose‑dependent reconstruction degrades results significantly if pose estimates are noisy. To overcome this, we introduce NoPo-Avatar, which reconstructs avatars solely from images, without any pose input. By removing the dependence of test-time reconstruction on human poses, NoPo-Avatar is not affected by noisy human pose estimates, making it more widely applicable. Experiments on challenging THuman2. 0, XHuman, and HuGe100K data show that NoPo-Avatar outperforms existing baselines in practical settings (without ground‑truth poses) and delivers comparable results in lab settings (with ground‑truth poses).

PDF Details

NeurIPS Conference 2025 Conference Paper

Visual Sync: Multi‑Camera Synchronization via Cross‑View Object Motion

Shaowei Liu
David Yao
Saurabh Gupta
Shenlong Wang

Today, people can easily record memorable moments, ranging from concerts, sports events, lectures, family gatherings, and birthday parties with multiple consumer cameras. However, synchronizing these cross‑camera streams remains challenging. Existing methods assume controlled settings, specific targets, manual correction, or costly hardware. We present VisualSync, an optimization framework based on multi‑view dynamics that aligns unposed, unsynchronized videos at millisecond accuracy. Our key insight is that any moving 3D point, when co‑visible in two cameras, obeys epipolar constraints once properly synchronized. To exploit this, VisualSync leverages off‑the‑shelf 3D reconstruction, feature matching, and dense tracking to extract tracklets, relative poses, and cross‑view correspondences. It then jointly minimizes the epipolar error to estimate each camera’s time offset. Experiments on four diverse, challenging datasets show that VisualSync outperforms baseline methods, achieving an average synchronization error below 130 ms.

PDF Details

ICRA Conference 2024 Conference Paper

On the Overconfidence Problem in Semantic 3D Mapping

João Marcos Correia Marques
Albert J. Zhai
Shenlong Wang
Kris Hauser

Semantic 3D mapping, the process of fusing depth and image segmentation information between multiple views to build 3D maps annotated with object classes in real-time, is a recent topic of interest. This paper highlights the fusion overconfidence problem, in which conventional mapping methods assign high confidence to the entire map even when they are incorrect, leading to miscalibrated outputs. Several methods to improve uncertainty calibration at different stages in the fusion pipeline are presented and compared on the ScanNet dataset. We show that the most widely used Bayesian fusion strategy is among the worst calibrated, and propose a learned pipeline that combines fusion and calibration, GLFS, which achieves simultaneously higher accuracy and 3D map calibration while retaining real-time capability and adding only 525 learned parameters to the pipeline. We further illustrate the importance of map calibration on a downstream task by showing that incorporating proper semantic fusion to an indoor object search agent improves its success rates.

Details

NeurIPS Conference 2023 Conference Paper

Structure from Duplicates: Neural Inverse Graphics from a Pile of Objects

Tianhang Cheng
Wei-Chiu Ma
Kaiyu Guan
Antonio Torralba
Shenlong Wang

Abstract Our world is full of identical objects (\emph{e. g. }, cans of coke, cars of same model). These duplicates, when seen together, provide additional and strong cues for us to effectively reason about 3D. Inspired by this observation, we introduce Structure from Duplicates (SfD), a novel inverse graphics framework that reconstructs geometry, material, and illumination from a single image containing multiple identical objects. SfD begins by identifying multiple instances of an object within an image, and then jointly estimates the 6DoF pose for all instances. An inverse graphics pipeline is subsequently employed to jointly reason about the shape, material of the object, and the environment light, while adhering to the shared geometry and material constraint across instances. Our primary contributions involve utilizing object duplicates as a robust prior for single-image inverse graphics and proposing an in-plane rotation-robust Structure from Motion (SfM) formulation for joint 6-DoF object pose estimation. By leveraging multi-view cues from a single image, SfD generates more realistic and detailed 3D reconstructions, significantly outperforming existing single image reconstruction models and multi-view reconstruction approaches with a similar or greater number of observations.

PDF Details

NeurIPS Conference 2022 Conference Paper

CASA: Category-agnostic Skeletal Animal Reconstruction

Yuefan Wu
Zeyuan Chen
Shaowei Liu
Zhongzheng Ren
Shenlong Wang

Recovering a skeletal shape from a monocular video is a longstanding challenge. Prevailing nonrigid animal reconstruction methods often adopt a control-point driven animation model and optimize bone transforms individually without considering skeletal topology, yielding unsatisfactory shape and articulation. In contrast, humans can easily infer the articulation structure of an unknown character by associating it with a seen articulated object in their memory. Inspired by this fact, we present CASA, a novel category-agnostic articulated animal reconstruction method. Our method consists of two components, a video-to-shape retrieval process and a neural inverse graphics framework. During inference, CASA first finds a matched articulated shape from a 3D character assets bank so that the input video scores highly with the rendered image, according to a pretrained image-language model. It then integrates the retrieved character into an inverse graphics framework and jointly infers the shape deformation, skeleton structure, and skinning weights through optimization. Experiments validate the efficacy of our method in shape reconstruction and articulation. We further show that we can use the resulting skeletal-animated character for re-animation.

PDF Details

NeurIPS Conference 2022 Conference Paper

SGAM: Building a Virtual 3D World through Simultaneous Generation and Mapping

Yuan Shen
Wei-Chiu Ma
Shenlong Wang

We present simultaneous generation and mapping (SGAM), a novel 3D scene generation algorithm. Our goal is to produce a realistic, globally consistent 3D world on a large scale. Achieving this goal is challenging and goes beyond the capacities of existing 3D generation or video generation approaches, which fail to scale up to create large, globally consistent 3D scene structures. Towards tackling the challenges, we take a hybrid approach that integrates generative sensor model- ing with 3D reconstruction. Our proposed approach is an autoregressive generative framework that simultaneously generates sensor data at novel viewpoints and builds a 3D map at each timestamp. Given an arbitrary camera trajectory, our method repeatedly applies this generation-and-mapping process for thousands of steps, allowing us to create a gigantic virtual world. Our model can be trained from RGB-D sequences without having access to the complete 3D scene structure. The generated scenes are readily compatible with various interactive environments and rendering engines. Experiments on CLEVER and GoogleEarth datasets demon- strates ours can generate consistent, realistic, and geometrically-plausible scenes that compare favorably to existing view synthesis methods. Our project page is available at https: //yshen47. github. io/sgam.

PDF Details

ICRA Conference 2021 Conference Paper

Asynchronous Multi-View SLAM

Anqi Joyce Yang
Can Cui
Ioan Andrei Bârsan
Raquel Urtasun
Shenlong Wang

Existing multi-camera SLAM systems assume synchronized shutters for all cameras, which is often not the case in practice. In this work, we propose a generalized multi-camera SLAM formulation which accounts for asynchronous sensor observations. Our framework integrates a continuous-time motion model to relate information across asynchronous multi-frames during tracking, local mapping, and loop closing. For evaluation, we collected AMV-Bench, a challenging new SLAM dataset covering 482 km of driving recorded using our asynchronous multi-camera robotic platform. AMV-Bench is over an order of magnitude larger than previous multi-view HD outdoor SLAM datasets, and covers diverse and challenging motions and environments. Our experiments emphasize the necessity of asynchronous sensor modeling, and show that the use of multiple cameras is critical towards robust and accurate SLAM in challenging outdoor scenes. The supplementary material is located at: https://www.cs.toronto.edu/~ajyang/amv-slam

Details

NeurIPS Conference 2020 Conference Paper

MuSCLE: Multi Sweep Compression of LiDAR using Deep Entropy Models

Sourav Biswas
Jerry Liu
Kelvin Wong
Shenlong Wang
Raquel Urtasun

We present a novel compression algorithm for reducing the storage of LiDAR sensory data streams. Our model exploits spatio-temporal relationships across multiple LIDAR sweeps to reduce the bitrate of both geometry and intensity values. Towards this goal, we propose a novel conditional entropy model that models the probabilities of the octree symbols, by considering both coarse level geometry and previous sweeps’ geometric and intensity information. We then exploit the learned probability to encode the full data-stream into a compact one. Our experiments demonstrate that our method significantly reduces the joint geometry and intensity bitrate over prior state-of-the-art LiDAR compression methods, with a reduction of 7–17% and 15–35% on the UrbanCity and SemanticKITTI datasets respectively.

PDF Details

IROS Conference 2020 Conference Paper

Pit30M: A Benchmark for Global Localization in the Age of Self-Driving Cars

Julieta Martinez 0001
Sasha Doubov
Jack Fan
Ioan Andrei Bârsan
Shenlong Wang
Gellért Máttyus
Raquel Urtasun

We are interested in understanding whether retrieval-based localization approaches are good enough in the context of self-driving vehicles. Towards this goal, we introduce Pit30M, a new image and LiDAR dataset with over 30 million frames, which is 10 to 100 times larger than those used in previous work. Pit30M is captured under diverse conditions (i. e. , season, weather, time of the day, traffic), and provides accurate localization ground truth. We also automatically annotate our dataset with historical weather and astronomical data, as well as with image and LiDAR semantic segmentation as a proxy measure for occlusion. We benchmark multiple existing methods for image and LiDAR retrieval and, in the process, introduce a simple, yet effective convolutional network-based LiDAR retrieval method that is competitive with the state of the art. Our work provides, for the first time, a benchmark for sub-metre retrieval-based localization at city scale. The dataset, additional experimental results, as well as more information about the sensors, calibration, and metadata, are available on the project website: https://uber.com/atg/datasets/pit30m.

Details

NeurIPS Conference 2019 Conference Paper

Efficient Graph Generation with Graph Recurrent Attention Networks

Renjie Liao
Yujia Li
Yang Song
Shenlong Wang
Will Hamilton
David Duvenaud
Raquel Urtasun
Richard Zemel

We propose a new family of efficient and expressive deep generative models of graphs, called Graph Recurrent Attention Networks (GRANs). Our model generates graphs one block of nodes and associated edges at a time. The block size and sampling stride allow us to trade off sample quality for efficiency. Compared to previous RNN-based graph generative models, our framework better captures the auto-regressive conditioning between the already-generated and to-be-generated parts of the graph using Graph Neural Networks (GNNs) with attention. This not only reduces the dependency on node ordering but also bypasses the long-term bottleneck caused by the sequential nature of RNNs. Moreover, we parameterize the output distribution per block using a mixture of Bernoulli, which captures the correlations among generated edges within the block. Finally, we propose to handle node orderings in generation by marginalizing over a family of canonical orderings. On standard benchmarks, we achieve state-of-the-art time efficiency and sample quality compared to previous models. Additionally, we show our model is capable of generating large graphs of up to 5K nodes with good quality. Our code is released at: \url{https: //github. com/lrjconan/GRAN}.

PDF Details

IROS Conference 2019 Conference Paper

Exploiting Sparse Semantic HD Maps for Self-Driving Vehicle Localization

Wei-Chiu Ma
Raquel Urtasun
Ignacio Tartavull
Ioan Andrei Bârsan
Shenlong Wang
Min Bai
Gellért Máttyus
Namdar Homayounfar

In this paper we propose a novel semantic localization algorithm that exploits multiple sensors and has precision on the order of a few centimeters. Our approach does not require detailed knowledge about the appearance of the world, and our maps require orders of magnitude less storage than maps utilized by traditional geometry- and LiDAR intensity-based localizers. This is important as self-driving cars need to operate in large environments. Towards this goal, we formulate the problem in a Bayesian filtering framework, and exploit lanes, traffic signs, as well as vehicle dynamics to localize robustly with respect to a sparse semantic map. We validate the effectiveness of our method on a new highway dataset consisting of 312km of roads. Our experiments show that the proposed approach is able to achieve 0. 05m lateral accuracy and 1. 12m longitudinal accuracy on average while taking up only 0. 3% of the storage required by previous LiDAR intensity-based approaches.

Details

IROS Conference 2018 Conference Paper

Deep Multi-Sensor Lane Detection

Min Bai
Gellért Máttyus
Namdar Homayounfar
Shenlong Wang
Shrinidhi Kowshika Lakshmikanth
Raquel Urtasun

Reliable and accurate lane detection has been a long-standing problem in the field of autonomous driving. In recent years, many approaches have been developed that use images (or videos) as input and reason in image space. In this paper we argue that accurate image estimates do not translate to precise 3D lane boundaries, which are the input required by modern motion planning algorithms. To address this issue, we propose a novel deep neural network that takes advantage of both LiDAR and camera sensors and produces very accurate estimates directly in 3D space. We demonstrate the performance of our approach on both highways and in cities, and show very accurate estimates in complex scenarios such as heavy traffic (which produces occlusion), fork, merges and intersections.

Details

ICRA Conference 2017 Conference Paper

Find your way by observing the sun and other semantic cues

Wei-Chiu Ma
Shenlong Wang
Marcus A. Brubaker
Sanja Fidler
Raquel Urtasun

In this paper we present a robust, efficient and affordable approach to self-localization which requires neither GPS nor knowledge about the appearance of the world. Towards this goal, we utilize freely available cartographic maps and derive a probabilistic model that exploits semantic cues in the form of sun direction, presence of an intersection, road type, speed limit and ego-car trajectory to produce very reliable localization results. Our experimental evaluation shows that our approach can localize much faster (in terms of driving time) with less computation and more robustly than competing approaches, which ignore semantic information.

Details

NeurIPS Conference 2016 Conference Paper

Proximal Deep Structured Models

Shenlong Wang
Sanja Fidler
Raquel Urtasun

Many problems in real-world applications involve predicting continuous-valued random variables that are statistically related. In this paper, we propose a powerful deep structured model that is able to learn complex non-linear functions which encode the dependencies between continuous output variables. We show that inference in our model using proximal methods can be efficiently solved as a feed-foward pass of a special type of deep recurrent neural network. We demonstrate the effectiveness of our approach in the tasks of image denoising, depth refinement and optical flow estimation.

PDF Details

NeurIPS Conference 2014 Conference Paper

Efficient Inference of Continuous Markov Random Fields with Polynomial Potentials

Shenlong Wang
Alex Schwing
Raquel Urtasun

In this paper, we prove that every multivariate polynomial with even degree can be decomposed into a sum of convex and concave polynomials. Motivated by this property, we exploit the concave-convex procedure to perform inference on continuous Markov random fields with polynomial potentials. In particular, we show that the concave-convex decomposition of polynomials can be expressed as a sum-of-squares optimization, which can be efficiently solved via semidefinite programming. We demonstrate the effectiveness of our approach in the context of 3D reconstruction, shape from shading and image denoising, and show that our approach significantly outperforms existing approaches in terms of efficiency as well as the quality of the retrieved solution.

PDF Details