Author name cluster

Felix Heide

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

9 papers

2 author rows

AAAI Conference 2026 Conference Paper

LSD-3D: Large-Scale 3D Driving Scene Generation with Geometry Grounding

Julian Ost
Andrea Ramazzina
Amogh Joshi
Maximilian Bömer
Mario Bijelic
Felix Heide

Large-scale scene data is essential for training and testing in robot learning. Neural reconstruction methods have promised the capability of reconstructing large physically-grounded outdoor scenes from captured sensor data. However, these methods have baked-in static environments and only allow for limited scene control -- they are functionally constrained in scene and trajectory diversity by the captures from which they are reconstructed. In contrast, generating driving data with recent image or video diffusion models offers control, however, at the cost of geometry grounding and causality. In this work, we aim to bridge this gap and present a method that directly generates large-scale 3D driving scenes with accurate geometry, allowing for causal novel view synthesis with object permanence and explicit 3D geometry estimation. The proposed method combines the generation of a proxy geometry and environment representation with score distillation from learned 2D image priors. We find that this approach allows for high controllability, enabling the prompt-guided geometry and high-fidelity texture and structure that can be conditioned on map layouts -- producing realistic and geometrically consistent 3D generations of complex driving scenes.

PDF Details DOI

IROS Conference 2025 Conference Paper

A Multi-Modal Benchmark for Long-Range Depth Evaluation in Adverse Weather Conditions

Stefanie Walz
Andrea Ramazzina
Dominik Scheuble
Samuel Brucker
Alexander Zuber
Werner Ritter
Mario Bijelic
Felix Heide

Depth estimation is a cornerstone computer vision application that is critical for scene understanding and autonomous driving. In real-world scenarios, achieving reliable depth perception under adverse weather—e. g. in fog and rain—is crucial to ensure safety and system robustness. However, quantitatively evaluating the performances of depth estimation methods in these scenarios is challenging due to the difficulty of obtaining ground truth data. A promising approach is using weather chambers to simulate diverse weather conditions in a controlled environment. However, current datasets are limited in distance and lack a dense ground truth. To address this gap, we introduce a novel evaluation benchmark that extends depth evaluation up to 200 meters under clear, foggy, and rainy conditions. To this end, we employ a multimodal sensor setup, including state-of-the-art stereo RGB, RCCB, Gated camera systems, and a long-range LiDAR sensor. Moreover, we record a digital twin of the test facility sampled at a millimeter scale using a high-end geodesic laser scanner. This comprehensive benchmark allows for the evaluation of different models and multiple sensing modalities in a more precise and accurate manner, as well as at far distances. Data and code will be released upon publication.

Details

NeurIPS Conference 2025 Conference Paper

HEIR: Learning Graph-Based Motion Hierarchies

Cheng Zheng
William Koch
Baiang Li
Felix Heide

Hierarchical structures of motion exist across research fields, including computer vision, graphics, and robotics, where complex dynamics typically arise from coordinated interactions among simpler motion components. Existing methods to model such dynamics typically rely on manually-defined or heuristic hierarchies with fixed motion primitives, limiting their generalizability across different tasks. In this work, we propose a general hierarchical motion modeling method that learns structured, interpretable motion relationships directly from data. Our method represents observed motions using graph-based hierarchies, explicitly decomposing global absolute motions into parent-inherited patterns and local motion residuals. We formulate hierarchy inference as a differentiable graph learning problem, where vertices represent elemental motions and directed edges capture learned parent-child dependencies through graph neural networks. We evaluate our hierarchical reconstruction approach on three examples: 1D translational motion, 2D rotational motion, and dynamic 3D scene deformation via Gaussian splatting. Experimental results show that our method reconstructs the intrinsic motion hierarchy in 1D and 2D cases, and produces more realistic and interpretable deformations compared to the baseline on dynamic 3D Gaussian splatting scenes. By providing an adaptable, data-driven hierarchical modeling paradigm, our method offers a formulation applicable to a broad range of motion-centric tasks.

PDF Details

NeurIPS Conference 2025 Conference Paper

Neural Atlas Graphs for Dynamic Scene Decomposition and Editing

Jan Philipp Schneider
Pratik S. Bisht
Ilya Chugunov
Andreas Kolb
Michael Moeller
Felix Heide

Learning editable high-resolution scene representations for dynamic scenes is an open problem with applications across the domains from autonomous driving to creative editing - the most successful approaches today make a trade-off between editability and supporting scene complexity: neural atlases represent dynamic scenes as two deforming image layers, foreground and background, which are editable in 2D, but break down when multiple objects occlude and interact. In contrast, scene graph models make use of annotated data such as masks and bounding boxes from autonomous-driving datasets to capture complex 3D spatial relationships, but their implicit volumetric node representations are challenging to edit view-consistently. We propose Neural Atlas Graphs (NAGs), a hybrid high-resolution scene representation, where every graph node is a view-dependent neural atlas, facilitating both 2D appearance editing and 3D ordering and positioning of scene elements. Fit at test-time, NAGs achieve state-of-the-art quantitative results on the Waymo Open Dataset - by 5 dB PSNR increase compared to existing methods - and make environmental editing possible in high resolution and visual quality - creating counterfactual driving scenarios with new backgrounds and edited vehicle appearance. We find that the method also generalizes beyond driving scenes and compares favorably - by more than 7 dB in PSNR - to recent matting and video editing baselines on the DAVIS video dataset with a diverse set of human and animal-centric scenes. Project Page: https: //princeton-computational-imaging. github. io/nag/

PDF Details

AAAI Conference 2025 Conference Paper

Separating the Wheat from the Chaff: Spatio-Temporal Transformer with View-interweaved Attention for Photon-Efficient Depth Sensing

Letian Yu
Jiaxi Yang
Bo Dong
Qirui Bao
Yuanbo Wang
Felix Heide
Xiaopeng Wei
Xin Yang

Time-resolved imaging is an emerging sensing modality that has been shown to enable advanced applications, including remote sensing, fluorescence lifetime imaging, and even non-line-of-sight sensing. Single-photon avalanche diodes (SPADs) outperform relevant time-resolved imaging technologies thanks to their excellent photon sensitivity and superior temporal resolution on the order of tens of picoseconds. The capability of exceeding the sensing limits of conventional cameras for SPADs also draws attention to the photon-efficient imaging area. However, photon-efficient imaging under degraded conditions with low photon counts and low signal-to-background ratio (SBR) still remains an inevitable challenge. In this paper, we propose a spatio-temporal transformer network for photon-efficient imaging under low-flux scenarios. In particular, we introduce a view-interweaved attention mechanism (VIAM) to extract both spatial-view and temporal-view self-attention in each transformer block. We also design an adaptive-weighting scheme to dynamically adjust the weights between different views of self-attention in VIAM for different signal-to-background levels. We extensively validate and demonstrate the effectiveness of our approach on the simulated Middlebury dataset and a specially self-collected dataset with real-world-captured SPAD measurements and well-annotated ground truth depth maps.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

Kissing to Find a Match: Efficient Low-Rank Permutation Representation

Hannah Dröge
Zorah Lähner
Yuval Bahat
Onofre Martorell Nadal
Felix Heide
Michael Moeller

Permutation matrices play a key role in matching and assignment problems across the fields, especially in computer vision and robotics. However, memory for explicitly representing permutation matrices grows quadratically with the size of the problem, prohibiting large problem instances. In this work, we propose to tackle the curse of dimensionality of large permutation matrices by approximating them using low-rank matrix factorization, followed by a nonlinearity. To this end, we rely on the Kissing number theory to infer the minimal rank required for representing a permutation matrix of a given size, which is significantly smaller than the problem size. This leads to a drastic reduction in computation and memory costs, e. g. , up to $3$ orders of magnitude less memory for a problem of size $n=20000$, represented using $8. 4\times10^5$ elements in two small matrices instead of using a single huge matrix with $4\times 10^8$ elements. The proposed representation allows for accurate representations of large permutation matrices, which in turn enables handling large problems that would have been infeasible otherwise. We demonstrate the applicability and merits of the proposed approach through a series of experiments on a range of problems that involve predicting permutation matrices, from linear and quadratic assignment to shape matching problems.

PDF Details

NeurIPS Conference 2022 Conference Paper

Biologically Inspired Dynamic Thresholds for Spiking Neural Networks

Jianchuan Ding
Bo Dong
Felix Heide
Yufei Ding
Yunduo Zhou
Baocai Yin
Xin Yang

The dynamic membrane potential threshold, as one of the essential properties of a biological neuron, is a spontaneous regulation mechanism that maintains neuronal homeostasis, i. e. , the constant overall spiking firing rate of a neuron. As such, the neuron firing rate is regulated by a dynamic spiking threshold, which has been extensively studied in biology. Existing work in the machine learning community does not employ bioinspired spiking threshold schemes. This work aims at bridging this gap by introducing a novel bioinspired dynamic energy-temporal threshold (BDETT) scheme for spiking neural networks (SNNs). The proposed BDETT scheme mirrors two bioplausible observations: a dynamic threshold has 1) a positive correlation with the average membrane potential and 2) a negative correlation with the preceding rate of depolarization. We validate the effectiveness of the proposed BDETT on robot obstacle avoidance and continuous control tasks under both normal conditions and various degraded conditions, including noisy observations, weights, and dynamic environments. We find that the BDETT outperforms existing static and heuristic threshold approaches by significant margins in all tested conditions, and we confirm that the proposed bioinspired dynamic threshold scheme offers homeostasis to SNNs in complex real-world tasks.

PDF Details

NeurIPS Conference 2022 Conference Paper

GenSDF: Two-Stage Learning of Generalizable Signed Distance Functions

Gene Chou
Ilya Chugunov
Felix Heide

We investigate the generalization capabilities of neural signed distance functions (SDFs) for learning 3D object representations for unseen and unlabeled point clouds. Existing methods can fit SDFs to a handful of object classes and boast fine detail or fast inference speeds, but do not generalize well to unseen shapes. We introduce a two-stage semi-supervised meta-learning approach that transfers shape priors from labeled to unlabeled data to reconstruct unseen object categories. The first stage uses an episodic training scheme to simulate training on unlabeled data and meta-learns initial shape priors. The second stage then introduces unlabeled data with disjoint classes in a semi-supervised scheme to diversify these priors and achieve generalization. We assess our method on both synthetic data and real collected point clouds. Experimental results and analysis validate that our approach outperforms existing neural SDF methods and is capable of robust zero-shot inference on 100+ unseen classes. Code can be found at https: //github. com/princeton-computational-imaging/gensdf

PDF Details

ICLR Conference 2022 Conference Paper

Latent Variable Sequential Set Transformers for Joint Multi-Agent Motion Prediction

Roger Girgis
Florian Golemo
Felipe Codevilla
Martin Weiss
Jim Aldon D'Souza
Samira Ebrahimi Kahou
Felix Heide
Christopher Pal

Robust multi-agent trajectory prediction is essential for the safe control of robotic systems. A major challenge is to efficiently learn a representation that approximates the true joint distribution of contextual, social, and temporal information to enable planning. We propose Latent Variable Sequential Set Transformers which are encoder-decoder architectures that generate scene-consistent multi-agent trajectories. We refer to these architectures as “AutoBots”. The encoder is a stack of interleaved temporal and social multi-head self-attention (MHSA) modules which alternately perform equivariant processing across the temporal and social dimensions. The decoder employs learnable seed parameters in combination with temporal and social MHSA modules allowing it to perform inference over the entire future scene in a single forward pass efficiently. AutoBots can produce either the trajectory of one ego-agent or a distribution over the future trajectories for all agents in the scene. For the single-agent prediction case, our model achieves top results on the global nuScenes vehicle motion prediction leaderboard, and produces strong results on the Argoverse vehicle prediction challenge. In the multi-agent setting, we evaluate on the synthetic partition of TrajNet++ dataset to showcase the model’s socially-consistent predictions. We also demonstrate our model on general sequences of sets and provide illustrative experiments modelling the sequential structure of the multiple strokes that make up symbols in the Omniglot data. A distinguishing feature of AutoBots is that all models are trainable on a single desktop GPU (1080 Ti) in under 48h.

Details