Author name cluster

Igor Gilitschenski

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

41 papers

2 author rows

IROS Conference 2025 Conference Paper

AGENTS-LLM: Augmentative GENeration of Challenging Traffic Scenarios with an Agentic LLM Framework

Yu Yao
Salil Bhatnagar
Markus Mazzola
Vasileios Belagiannis
Igor Gilitschenski
Luigi Palmieri
Simon Razniewski
Marcel Hallgarten

Rare, yet critical, scenarios pose a significant challenge in testing and evaluating autonomous driving planners. Relying solely on real-world driving scenes requires collecting massive datasets to capture these scenarios. While automatic generation of traffic scenarios appears promising, data-driven models require extensive training data and often lack fine-grained control over the output. Moreover, generating novel scenarios from scratch can introduce a distributional shift from the original training scenes which undermines the validity of evaluations especially for learning-based planners. To sidestep this, recent work proposes to generate challenging scenarios by augmenting original scenarios from the test set. However, this involves the manual augmentation of scenarios by domain experts. An approach that is unable to meet the demands for scale in the evaluation of self-driving systems. Therefore, this paper introduces a novel LLM-agent based framework for augmenting real-world traffic scenarios using natural language descriptions, addressing the limitations of existing methods. A key innovation is the use of an agentic design, enabling fine-grained control over the output and maintaining high performance even with smaller, cost-effective LLMs. Extensive human expert evaluation demonstrates our framework’s ability to accurately adhere to user intent, generating high quality augmented scenarios comparable to those created manually.

ICML Conference 2025 Conference Paper

Calibrated Value-Aware Model Learning with Probabilistic Environment Models

Claas Voelcker
Anastasiia Pedan
Arash Ahmadian
Romina Abachi
Igor Gilitschenski
Amir Massoud Farahmand

The idea of value-aware model learning, that models should produce accurate value estimates, has gained prominence in model-based reinforcement learning. The MuZero loss, which penalizes a model’s value function prediction compared to the ground-truth value function, has been utilized in several prominent empirical works in the literature. However, theoretical investigation into its strengths and weaknesses is limited. In this paper, we analyze the family of value-aware model learning losses, which includes the popular MuZero loss. We show that these losses, as normally used, are uncalibrated surrogate losses, which means that they do not always recover the correct model and value function. Building on this insight, we propose corrections to solve this issue. Furthermore, we investigate the interplay between the loss calibration, latent model architectures, and auxiliary losses that are commonly employed when training MuZero-style agents. We show that while deterministic models can be sufficient to predict accurate values, learning calibrated stochastic models is still advantageous.

IROS Conference 2025 Conference Paper

Delving into Mapping Uncertainty for Mapless Trajectory Prediction

Zongzheng Zhang
Xuchong Qiu
Boran Zhang
Guantian Zheng
Xunjiang Gu
Guoxuan Chi
Huan-ang Gao
Leichen Wang

Recent advances in autonomous driving are moving towards mapless approaches, where High-Definition (HD) maps are generated online directly from sensor data, reducing the need for expensive labeling and maintenance. However, the reliability of these online-generated maps remains uncertain. While incorporating map uncertainty into downstream trajectory prediction tasks has shown potential for performance improvements, current strategies provide limited insights into the specific scenarios where this uncertainty is beneficial. In this work, we first analyze the driving scenarios in which mapping uncertainty has the greatest positive impact on trajectory prediction and identify a critical, previously overlooked factor: the agent’s kinematic state. Building on these insights, we propose a novel Proprioceptive Scenario Gating that adaptively integrates map uncertainty into trajectory prediction based on forecasts of the ego vehicle’s future kinematics. This lightweight, self-supervised approach enhances the synergy between online mapping and trajectory prediction, providing interpretability around where uncertainty is advantageous and outperforming previous integration methods. Additionally, we introduce a Covariance-based Map Uncertainty approach that better aligns with map geometry, further improving trajectory prediction. Extensive ablation studies confirm the effectiveness of our approach, achieving up to 23. 6% improvement in mapless trajectory prediction performance over the state-of-the-art method using the real-world nuScenes driving dataset. Our code, data, and models are publicly available at https://github.com/Ethan-Zheng136/Map-Uncertainty-for-Trajectory-Prediction.

NeurIPS Conference 2025 Conference Paper

DenseDPO: Fine-Grained Temporal Preference Optimization for Video Diffusion Models

Ziyi Wu
Anil Kag
Ivan Skorokhodov
Willi Menapace
Ashkan Mirzaei
Igor Gilitschenski
Sergey Tulyakov
Aliaksandr Siarohin

Direct Preference Optimization (DPO) has recently been applied as a post‑training technique for text-to-video diffusion models. To obtain training data, annotators are asked to provide preferences between two videos generated from independent noise. However, this approach prohibits fine-grained comparisons, and we point out that it biases the annotators towards low-motion clips as they often contain fewer visual artifacts. In this work, we introduce DenseDPO, a method that addresses these shortcomings by making three contributions. First, we create each video pair for DPO by denoising corrupted copies of a ground truth video. This results in aligned pairs with similar motion structures while differing in local details, effectively neutralizing the motion bias. Second, we leverage the resulting temporal alignment to label preferences on short segments rather than entire clips, yielding a denser and more precise learning signal. With only one‑third of the labeled data, DenseDPO greatly improves motion generation over vanilla DPO, while matching it in text alignment, visual quality, and temporal consistency. Finally, we show that DenseDPO unlocks automatic preference annotation using off-the-shelf Vision Language Models (VLMs): GPT accurately predicts segment-level preferences similar to task-specifically fine-tuned video reward models, and DenseDPO trained on these labels achieves performance close to using human labels.

NeurIPS Conference 2025 Conference Paper

Feed-Forward Bullet-Time Reconstruction of Dynamic Scenes from Monocular Videos

hanxue liang
Jiawei Ren
Ashkan Mirzaei
Antonio Torralba
Ziwei Liu
Igor Gilitschenski
Sanja Fidler
Cengiz Oztireli

Recent advancements in static feed-forward scene reconstruction have demonstrated significant progress in high-quality novel view synthesis. However, these models often struggle with generalizability across diverse environments and fail to effectively handle dynamic content. We present BTimer (short for Bullet Timer), the first motion-aware feed-forward model for real-time reconstruction and novel view synthesis of dynamic scenes. Our approach reconstructs the full scene in a 3D Gaussian Splatting representation at a given target (‘bullet’) timestamp by aggregating information from all the context frames. Such a formulation allows BTimer to gain scalability and generalization by leveraging both static and dynamic scene datasets. Given a casual monocular dynamic video, BTimer reconstructs a bullet-time scene within 150ms while reaching state-of-the-art performance on both static and dynamic scene datasets, even compared with optimization-based approaches.

NeurIPS Conference 2025 Conference Paper

LuxDiT: Lighting Estimation with Video Diffusion Transformer

Ruofan Liang
Kai He
Zan Gojcic
Igor Gilitschenski
Sanja Fidler
Nandita Vijaykumar
Zian Wang

Estimating scene lighting from a single image or video remains a longstanding challenge in computer vision and graphics. Learning-based approaches are constrained by the scarcity of ground-truth HDR environment maps, which are expensive to capture and limited in diversity. While recent generative models offer strong priors for image synthesis, lighting estimation remains difficult due to its reliance on indirect visual cues, the need to infer global (non-local) context, and the recovery of high-dynamic-range outputs. We propose LuxDiT, a novel data-driven approach that fine-tunes a video diffusion transformer to generate HDR environment maps conditioned on visual input. Trained on a large synthetic dataset with diverse lighting conditions, our model learns to infer illumination from indirect visual cues and generalizes effectively to real-world scenes. To improve semantic alignment between the input and the predicted environment map, we introduce a low-rank adaptation finetuning strategy using a collected dataset of HDR panoramas. Our method produces accurate lighting predictions with realistic angular high-frequency details, outperforming existing state-of-the-art techniques in both quantitative and qualitative evaluations.

ICLR Conference 2025 Conference Paper

MAD-TD: Model-Augmented Data stabilizes High Update Ratio RL

Claas Voelcker
Marcel Hussing
Eric Eaton
Amir Massoud Farahmand
Igor Gilitschenski

Building deep reinforcement learning (RL) agents that find a good policy with few samples has proven notoriously challenging. To achieve sample efficiency, recent work has explored updating neural networks with large numbers of gradient steps for every new sample. While such high update-to-data (UTD) ratios have shown strong empirical performance, they also introduce instability to the training process. Previous approaches need to rely on periodic neural network parameter resets to address this instability, but restarting the training process is infeasible in many real-world applications and requires tuning the resetting interval. In this paper, we focus on one of the core difficulties of stable training with limited samples: the inability of learned value functions to generalize to unobserved on-policy actions. We mitigate this issue directly by augmenting the off-policy RL training process with a small amount of data generated from a learned world model. Our method, Model-Augmented Data for TD Learning (MAD-TD) uses small amounts of generated data to stabilize high UTD training and achieve competitive performance on the most challenging tasks in the DeepMind control suite. Our experiments further highlight the importance of employing a good model to generate data, MAD-TD's ability to combat value overestimation, and its practical stability gains for continued learning.

IROS Conference 2025 Conference Paper

MORE: Mobile Manipulation Rearrangement Through Grounded Language Reasoning

Mohammad Mohammadi
Daniel Honerkamp
Martin Büchner
Matteo Cassinelli
Tim Welschehold
Fabien Despinoy
Igor Gilitschenski
Abhinav Valada

Autonomous long-horizon mobile manipulation encompasses a multitude of challenges, including scene dynamics, unexplored areas, and error recovery. Recent works have leveraged foundation models for scene-level robotic reasoning and planning. However, the performance of these methods degrades when dealing with a large number of objects and largescale environments. To address these limitations, we propose MORE, a novel approach for enhancing the capabilities of language models to solve zero-shot mobile manipulation planning for rearrangement tasks. MORE leverages scene graphs to represent environments, incorporates instance differentiation, and introduces an active filtering scheme that extracts task-relevant subgraphs of object and region instances. These steps yield a bounded planning problem, effectively mitigating hallucinations and improving reliability. Additionally, we introduce several enhancements that enable planning across both indoor and outdoor environments. We evaluate MORE on 81 diverse rearrangement tasks from the BEHAVIOR-1K benchmark, where it becomes the first approach to successfully solve a significant share of the benchmark, outperforming recent foundation model-based approaches. Furthermore, we demonstrate the capabilities of our approach in several complex real-world tasks, mimicking everyday activities. We make the code publicly available at https://more-model.cs.uni-freiburg.de.

NeurIPS Conference 2025 Conference Paper

SAFE: Multitask Failure Detection for Vision-Language-Action Models

Qiao Gu
Yuanliang Ju
Shengxiang Sun
Igor Gilitschenski
Haruki Nishimura
Masha Itkina
Florian Shkurti

While vision-language-action models (VLAs) have shown promising robotic behaviors across a diverse set of manipulation tasks, they achieve limited success rates when deployed on novel tasks out of the box. To allow these policies to safely interact with their environments, we need a failure detector that gives a timely alert such that the robot can stop, backtrack, or ask for help. However, existing failure detectors are trained and tested only on one or a few specific tasks, while generalist VLAs require the detector to generalize and detect failures also in unseen tasks and novel environments. In this paper, we introduce the multitask failure detection problem and propose SAFE, a failure detector for generalist robot policies such as VLAs. We analyze the VLA feature space and find that VLAs have sufficient high-level knowledge about task success and failure, which is generic across different tasks. Based on this insight, we design SAFE to learn from VLA internal features and predict a single scalar indicating the likelihood of task failure. SAFE is trained on both successful and failed rollouts and is evaluated on unseen tasks. SAFE is compatible with different policy architectures. We test it on OpenVLA, $\pi_0$, and $\pi_0$-FAST in both simulated and real-world environments extensively. We compare SAFE with diverse baselines and show that SAFE achieves state-of-the-art failure detection performance and the best trade-off between accuracy and detection time using conformal prediction. More qualitative results and code can be found at the project webpage: https: //vla-safe. github. io/

ICLR Conference 2025 Conference Paper

SG-I2V: Self-Guided Trajectory Control in Image-to-Video Generation

Koichi Namekata
Sherwin Bahmani
Ziyi Wu 0002
Yash Kant
Igor Gilitschenski
David B. Lindell

Methods for image-to-video generation have achieved impressive, photo-realistic quality. However, adjusting specific elements in generated videos, such as object motion or camera movement, is often a tedious process of trial and error, e.g., involving re-generating videos with different random seeds. Recent techniques address this issue by fine-tuning a pre-trained model to follow conditioning signals, such as bounding boxes or point trajectories. Yet, this fine-tuning procedure can be computationally expensive, and it requires datasets with annotated object motion, which can be difficult to procure. In this work, we introduce SG-I2V, a framework for controllable image-to-video generation that is self-guided—offering zero-shot control by relying solely on the knowledge present in a pre-trained image-to-video diffusion model without the need for fine-tuning or external knowledge. Our zero-shot method outperforms unsupervised baselines while significantly narrowing down the performance gap with supervised models in terms of visual quality and motion fidelity. Additional details and video results are available on our project page: https://kmcode1.github.io/Projects/SG-I2V

NeurIPS Conference 2025 Conference Paper

Track, Inpaint, Resplat: Subject-driven 3D and 4D Generation with Progressive Texture Infilling

Shuhong Zheng
Ashkan Mirzaei
Igor Gilitschenski

Current 3D/4D generation methods are usually optimized for photorealism, efficiency, and aesthetics. However, they often fail to preserve the semantic identity of the subject across different viewpoints. Adapting generation methods with one or few images of a specific subject (also known as Personalization or Subject-driven generation) allows generating visual content that align with the identity of the subject. However, personalized 3D/4D generation is still largely underexplored. In this work, we introduce TIRE (Track, Inpaint, REsplat), a novel method for subject-driven 3D/4D generation. It takes an initial 3D asset produced by an existing 3D generative model as input and uses video tracking to identify the regions that need to be modified. Then, we adopt a subject-driven 2D inpainting model for progressively infilling the identified regions. Finally, we resplat the modified 2D multi-view observations back to 3D while still maintaining consistency. Extensive experiments demonstrate that our approach significantly improves identity preservation in 3D/4D generation compared to state-of-the-art methods. Our project website is available at https: //zsh2000. github. io/track-inpaint-resplat. github. io/.

NeurIPS Conference 2025 Conference Paper

UniRelight: Learning Joint Decomposition and Synthesis for Video Relighting

Kai He
Ruofan Liang
Jacob Munkberg
Jon Hasselgren
Nandita Vijaykumar
Alexander Keller
Sanja Fidler
Igor Gilitschenski

We address the challenge of relighting a single image or video, a task that demands precise scene intrinsic understanding and high-quality light transport synthesis. Existing end-to-end relighting models are often limited by the scarcity of paired multi-illumination data, restricting their ability to generalize across diverse scenes. Conversely, two-stage pipelines that combine inverse and forward rendering can mitigate data requirements but are susceptible to error accumulation and often fail to produce realistic outputs under complex lighting conditions or with sophisticated materials. In this work, we introduce a general-purpose approach that jointly estimates albedo and synthesizes relit outputs in a single pass, harnessing the generative capabilities of video diffusion models. This joint formulation enhances implicit scene comprehension and facilitates the creation of realistic lighting effects and intricate material interactions, such as shadows, reflections, and transparency. Trained on synthetic multi-illumination data and extensive automatically labeled real-world videos, our model demonstrates strong generalization across diverse domains and surpasses previous methods in both visual fidelity and temporal consistency. Our project page is https: //research. nvidia. com/labs/toronto-ai/UniRelight/.

RLJ Journal 2024 Journal Article

Dissecting Deep RL with High Update Ratios: Combatting Value Divergence

Marcel Hussing
Claas A Voelcker
Igor Gilitschenski
Amir-massoud Farahmand
Eric Eaton

We show that deep reinforcement learning algorithms can retain their ability to learn without resetting network parameters in settings where the number of gradient updates greatly exceeds the number of environment samples by combatting value function divergence. Under large update-to-data ratios, a recent study by Nikishin et al. (2022) suggested the emergence of a primacy bias, in which agents overfit early interactions and downplay later experience, impairing their ability to learn. In this work, we investigate the phenomena leading to the primacy bias. We inspect the early stages of training that were conjectured to cause the failure to learn and find that one fundamental challenge is a long-standing acquaintance: value function divergence. Overinflated Q-values are found not only on out-of-distribution but also in-distribution data and can be linked to overestimation on unseen action prediction propelled by optimizer momentum. We employ a simple unit-ball normalization that enables learning under large update ratios, show its efficacy on the widely used dm_control suite, and obtain strong performance on the challenging dog tasks, competitive with model-based approaches. Our results question, in parts, the prior explanation for sub-optimal learning due to overfitting early data.

RLC Conference 2024 Conference Paper

Dissecting Deep RL with High Update Ratios: Combatting Value Divergence

Marcel Hussing
Claas A Voelcker
Igor Gilitschenski
Amir-massoud Farahm
Eric Eaton

We show that deep reinforcement learning algorithms can retain their ability to learn without resetting network parameters in settings where the number of gradient updates greatly exceeds the number of environment samples by combatting value function divergence. Under large update-to-data ratios, a recent study by Nikishin et al. (2022) suggested the emergence of a primacy bias, in which agents overfit early interactions and downplay later experience, impairing their ability to learn. In this work, we investigate the phenomena leading to the primacy bias. We inspect the early stages of training that were conjectured to cause the failure to learn and find that one fundamental challenge is a long-standing acquaintance: value function divergence. Overinflated Q-values are found not only on out-of-distribution but also in-distribution data and can be linked to overestimation on unseen action prediction propelled by optimizer momentum. We employ a simple unit-ball normalization that enables learning under large update ratios, show its efficacy on the widely used dm_control suite, and obtain strong performance on the challenging dog tasks, competitive with model-based approaches. Our results question, in parts, the prior explanation for sub-optimal learning due to overfitting early data.

NeurIPS Conference 2024 Conference Paper

GaussianCut: Interactive segmentation via graph cut for 3D Gaussian Splatting

Umangi Jain
Ashkan Mirzaei
Igor Gilitschenski

We introduce GaussianCut, a new method for interactive multiview segmentation of scenes represented as 3D Gaussians. Our approach allows for selecting the objects to be segmented by interacting with a single view. It accepts intuitive user input, such as point clicks, coarse scribbles, or text. Using 3D Gaussian Splatting (3DGS) as the underlying scene representation simplifies the extraction of objects of interest which are considered to be a subset of the scene's Gaussians. Our key idea is to represent the scene as a graph and use the graph-cut algorithm to minimize an energy function to effectively partition the Gaussians into foreground and background. To achieve this, we construct a graph based on scene Gaussians and devise a segmentation-aligned energy function on the graph to combine user inputs with scene properties. To obtain an initial coarse segmentation, we leverage 2D image/video segmentation models and further refine these coarse estimates using our graph construction. Our empirical evaluations show the adaptability of GaussianCut across a diverse set of scenes. GaussianCut achieves competitive performance with state-of-the-art approaches for 3D segmentation without requiring any additional segmentation-aware training

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

NAVSIM: Data-Driven Non-Reactive Autonomous Vehicle Simulation and Benchmarking

Daniel Dauner
Marcel Hallgarten
Tianyu Li
Xinshuo Weng
Zhiyu Huang
Zetong Yang
Hongyang Li
Igor Gilitschenski

Benchmarking vision-based driving policies is challenging. On one hand, open-loop evaluation with real data is easy, but these results do not reflect closed-loop performance. On the other, closed-loop evaluation is possible in simulation, but is hard to scale due to its significant computational demands. Further, the simulators available today exhibit a large domain gap to real data. This has resulted in an inability to draw clear conclusions from the rapidly growing body of research on end-to-end autonomous driving. In this paper, we present NAVSIM, a middle ground between these evaluation paradigms, where we use large datasets in combination with a non-reactive simulator to enable large-scale real-world benchmarking. Specifically, we gather simulation-based metrics, such as progress and time to collision, by unrolling bird's eye view abstractions of the test scenes for a short simulation horizon. Our simulation is non-reactive, i. e. , the evaluated policy and environment do not influence each other. As we demonstrate empirically, this decoupling allows open-loop metric computation while being better aligned with closed-loop evaluations than traditional displacement errors. NAVSIM enabled a new competition held at CVPR 2024, where 143 teams submitted 463 entries, resulting in several new insights. On a large set of challenging scenarios, we observe that simple methods with moderate compute requirements such as TransFuser can match recent large-scale end-to-end driving architectures such as UniAD. Our modular framework can potentially be extended with new datasets, data curation strategies, and metrics, and will be continually maintained to host future challenges. Our code is available at https: //github. com/autonomousvision/navsim.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Neural Assets: 3D-Aware Multi-Object Scene Synthesis with Image Diffusion Models

Ziyi Wu
Yulia Rubanova
Rishabh Kabra
Drew A. Hudson
Igor Gilitschenski
Yusuf Aytar
Sjoerd van Steenkiste
Kelsey R. Allen

We address the problem of multi-object 3D pose control in image diffusion models. Instead of conditioning on a sequence of text tokens, we propose to use a set of per-object representations, Neural Assets, to control the 3D pose of individual objects in a scene. Neural Assets are obtained by pooling visual representations of objects from a reference image, such as a frame in a video, and are trained to reconstruct the respective objects in a different image, e. g. , a later frame in the video. Importantly, we encode object visuals from the reference image while conditioning on object poses from the target frame, which enables learning disentangled appearance and position features. Combining visual and 3D pose representations in a sequence-of-tokens format allows us to keep the text-to-image interface of existing models, with Neural Assets in place of text tokens. By fine-tuning a pre-trained text-to-image diffusion model with this information, our approach enables fine-grained 3D pose and placement control of individual objects in a scene. We further demonstrate that Neural Assets can be transferred and recomposed across different scenes. Our model achieves state-of-the-art multi-object editing results on both synthetic 3D scene datasets, as well as two real-world video datasets (Objectron, Waymo Open).

PDF Details DOI

RLJ Journal 2024 Journal Article

When does Self-Prediction help? Understanding Auxiliary Tasks in Reinforcement Learning

Claas A Voelcker
Tyler Kastner
Igor Gilitschenski
Amir-massoud Farahmand

We investigate the impact of auxiliary learning tasks such as observation reconstruction and latent self-prediction on the representation learning problem in reinforcement learning. We also study how they interact with distractions and observation functions in the MDP. We provide a theoretical analysis of the learning dynamics of observation reconstruction, latent self-prediction, and TD learning in the presence of distractions and observation functions under linear model assumptions. With this formalization, we are able to explain why latent-self prediction is a helpful auxiliary task, while observation reconstruction can provide more useful features when used in isolation. Our empirical analysis shows that the insights obtained from our learning dynamics framework predicts the behavior of these loss functions beyond the linear model assumption in non-linear neural networks. This reinforces the usefulness of the linear model framework not only for theoretical analysis, but also practical benefit for applied problems.

RLC Conference 2024 Conference Paper

When does Self-Prediction help? Understanding Auxiliary Tasks in Reinforcement Learning

Claas A Voelcker
Tyler Kastner
Igor Gilitschenski
Amir-massoud Farahm

We investigate the impact of auxiliary learning tasks such as observation reconstruction and latent self-prediction on the representation learning problem in reinforcement learning. We also study how they interact with distractions and observation functions in the MDP. We provide a theoretical analysis of the learning dynamics of observation reconstruction, latent self-prediction, and TD learning in the presence of distractions and observation functions under linear model assumptions. With this formalization, we are able to explain why latent-self prediction is a helpful auxiliary task, while observation reconstruction can provide more useful features when used in isolation. Our empirical analysis shows that the insights obtained from our learning dynamics framework predicts the behavior of these loss functions beyond the linear model assumption in non-linear neural networks. This reinforces the usefulness of the linear model framework not only for theoretical analysis, but also practical benefit for applied problems.

NeurIPS Conference 2023 Conference Paper

SlotDiffusion: Object-Centric Generative Modeling with Diffusion Models

Ziyi Wu
Jingyu Hu
Wuyue Lu
Igor Gilitschenski
Animesh Garg

Object-centric learning aims to represent visual data with a set of object entities (a. k. a. slots), providing structured representations that enable systematic generalization. Leveraging advanced architectures like Transformers, recent approaches have made significant progress in unsupervised object discovery. In addition, slot-based representations hold great potential for generative modeling, such as controllable image generation and object manipulation in image editing. However, current slot-based methods often produce blurry images and distorted objects, exhibiting poor generative modeling capabilities. In this paper, we focus on improving slot-to-image decoding, a crucial aspect for high-quality visual generation. We introduce SlotDiffusion -- an object-centric Latent Diffusion Model (LDM) designed for both image and video data. Thanks to the powerful modeling capacity of LDMs, SlotDiffusion surpasses previous slot models in unsupervised object segmentation and visual generation across six datasets. Furthermore, our learned object features can be utilized by existing object-centric dynamics models, improving video prediction quality and downstream temporal reasoning tasks. Finally, we demonstrate the scalability of SlotDiffusion to unconstrained real-world datasets such as PASCAL VOC and COCO, when integrated with self-supervised pre-trained image encoders.

ICLR Conference 2023 Conference Paper

Solving Continuous Control via Q-learning

Tim Seyde
Peter Werner
Wilko Schwarting
Igor Gilitschenski
Martin A. Riedmiller
Daniela Rus
Markus Wulfmeier

While there has been substantial success for solving continuous control with actor-critic methods, simpler critic-only methods such as Q-learning find limited application in the associated high-dimensional action spaces. However, most actor-critic methods come at the cost of added complexity: heuristics for stabilisation, compute requirements and wider hyperparameter search spaces. We show that a simple modification of deep Q-learning largely alleviates these issues. By combining bang-bang action discretization with value decomposition, framing single-agent control as cooperative multi-agent reinforcement learning (MARL), this simple critic-only approach matches performance of state-of-the-art continuous actor-critic methods when learning from features or pixels. We extend classical bandit examples from cooperative MARL to provide intuition for how decoupled critics leverage state information to coordinate joint optimization, and demonstrate surprisingly strong performance across a variety of continuous control tasks.

NeurIPS Conference 2023 Conference Paper

trajdata: A Unified Interface to Multiple Human Trajectory Datasets

Boris Ivanovic
Guanyu Song
Igor Gilitschenski
Marco Pavone

The field of trajectory forecasting has grown significantly in recent years, partially owing to the release of numerous large-scale, real-world human trajectory datasets for autonomous vehicles (AVs) and pedestrian motion tracking. While such datasets have been a boon for the community, they each use custom and unique data formats and APIs, making it cumbersome for researchers to train and evaluate methods across multiple datasets. To remedy this, we present trajdata: a unified interface to multiple human trajectory datasets. At its core, trajdata provides a simple, uniform, and efficient representation and API for trajectory and map data. As a demonstration of its capabilities, in this work we conduct a comprehensive empirical evaluation of existing trajectory datasets, providing users with a rich understanding of the data underpinning much of current pedestrian and AV motion forecasting research, and proposing suggestions for future datasets from these insights. trajdata is permissively licensed (Apache 2. 0) and can be accessed online at https: //github. com/NVlabs/trajdata.

ICRA Conference 2022 Conference Paper

A Deep Concept Graph Network for Interaction-Aware Trajectory Prediction

Yutong Ban
Xiao Li 0025
Guy Rosman
Igor Gilitschenski
Ozanan R. Meireles
Sertac Karaman
Daniela Rus

Temporal patterns (how vehicles behave in our observed past) underline our reasoning of how people drive on the road, and can explain why we make certain predictions about interactions among road agents. In this paper we propose the ConceptNet trajectory predictor - a novel prediction framework that is able to incorporate agent interactions as explicit edges in a temporal knowledge graph. We demonstrate the sample efficiency and the overall accuracy of the proposed approach, and show that using the graphical structure to explicitly model interactions enables better detection of agent interactions and improved trajectory predictions on a large real-world driving dataset.

ICRA Conference 2022 Conference Paper

HYPER: Learned Hybrid Trajectory Prediction via Factored Inference and Adaptive Sampling

Xin Huang 0018
Guy Rosman
Igor Gilitschenski
Ashkan Jasour
Stephen G. McGill
John J. Leonard
Brian Williams 0001

Modeling multi-modal high-level intent is important for ensuring diversity in trajectory prediction. Existing approaches explore the discrete nature of human intent before predicting continuous trajectories, to improve accuracy and support explainability. However, these approaches often assume the intent to remain fixed over the prediction horizon, which is problematic in practice, especially over longer horizons. To overcome this limitation, we introduce HYPER, a general and expressive hybrid prediction framework that models evolving human intent. By modeling traffic agents as a hybrid discrete-continuous system, our approach is capable of predicting discrete intent changes over time. We learn the probabilistic hybrid model via a maximum likelihood estimation problem and leverage neural proposal distributions to sample adaptively from the exponentially growing discrete space. The overall approach affords a better trade-off between accuracy and coverage. We train and validate our model on the Argoverse dataset, and demonstrate its effectiveness through comprehensive ablation studies and comparisons with state-of-the-art models.

ICRA Conference 2022 Conference Paper

Learning Interactive Driving Policies via Data-driven Simulation

Tsun-Hsuan Wang
Alexander Amini
Wilko Schwarting
Igor Gilitschenski
Sertac Karaman
Daniela Rus

Data-driven simulators promise high data-efficiency for driving policy learning. When used for modelling interactions, this data-efficiency becomes a bottleneck: small underlying datasets often lack interesting and challenging edge cases for learning interactive driving. We address this challenge by proposing a data-driven simulation engine† that uses inpainted ado vehicles for learning robust driving policies. Thus, our approach can be used to learn policies that involve multi-agent interactions and allows for training via state-of-the-art policy learning methods. We evaluate the approach for learning standard interaction scenarios in driving. In extensive experiments, our work demonstrates that the resulting policies can be directly transferred to a full-scale autonomous vehicle without making use of any traditional sim-to-real transfer techniques such as domain randomization.

ICRA Conference 2022 Conference Paper

VISTA 2. 0: An Open, Data-driven Simulator for Multimodal Sensing and Policy Learning for Autonomous Vehicles

Alexander Amini
Tsun-Hsuan Wang
Igor Gilitschenski
Wilko Schwarting
Zhijian Liu
Song Han 0003
Sertac Karaman
Daniela Rus

Simulation has the potential to transform the development of robust algorithms for mobile agents deployed in safety-critical scenarios. However, the poor photorealism and lack of diverse sensor modalities of existing simulation engines remain key hurdles towards realizing this potential. Here, we present VISTA † † Full code release for the VISTA data-driven simulation engine is available here: vista. csail. mit.edu. , an open source, data-driven simulator that integrates multiple types of sensors for autonomous vehicles. Using high fidelity, real-world datasets, VISTA represents and simulates RGB cameras, 3D LiDAR, and event-based cameras, enabling the rapid generation of novel viewpoints in simulation and thereby enriching the data available for policy learning with corner cases that are difficult to capture in the physical world. Using VISTA, we demonstrate the ability to train and test perception-to-control policies across each of the sensor types and showcase the power of this approach via deployment on a full scale autonomous vehicle. The policies learned in VISTA exhibit sim-to-real transfer without modification and greater robustness than those trained exclusively on real-world data.

NeurIPS Conference 2021 Conference Paper

Is Bang-Bang Control All You Need? Solving Continuous Control with Bernoulli Policies

Tim Seyde
Igor Gilitschenski
Wilko Schwarting
Bartolomeo Stellato
Martin Riedmiller
Markus Wulfmeier
Daniela Rus

Reinforcement learning (RL) for continuous control typically employs distributions whose support covers the entire action space. In this work, we investigate the colloquially known phenomenon that trained agents often prefer actions at the boundaries of that space. We draw theoretical connections to the emergence of bang-bang behavior in optimal control, and provide extensive empirical evaluation across a variety of recent RL algorithms. We replace the normal Gaussian by a Bernoulli distribution that solely considers the extremes along each action dimension - a bang-bang controller. Surprisingly, this achieves state-of-the-art performance on several continuous control benchmarks - in contrast to robotic hardware, where energy and maintenance cost affect controller choices. Since exploration, learning, and the final solution are entangled in RL, we provide additional imitation learning experiments to reduce the impact of exploration on our analysis. Finally, we show that our observations generalize to environments that aim to model real-world challenges and evaluate factors to mitigate the emergence of bang-bang solutions. Our findings emphasise challenges for benchmarking continuous control algorithms, particularly in light of potential real-world applications.

ICLR Conference 2020 Conference Paper

Deep Orientation Uncertainty Learning based on a Bingham Loss

Igor Gilitschenski
Roshni Sahoo
Wilko Schwarting
Alexander Amini
Sertac Karaman
Daniela Rus

Reasoning about uncertain orientations is one of the core problems in many perception tasks such as object pose estimation or motion estimation. In these scenarios, poor illumination conditions, sensor limitations, or appearance invariance may result in highly uncertain estimates. In this work, we propose a novel learning-based representation for orientation uncertainty. By characterizing uncertainty over unit quaternions with the Bingham distribution, we formulate a loss that naturally captures the antipodal symmetry of the representation. We discuss the interpretability of the learned distribution parameters and demonstrate the feasibility of our approach on several challenging real-world pose estimation tasks involving uncertain orientations.

IROS Conference 2020 Conference Paper

Exploiting Semantic and Public Prior Information in MonoSLAM

Chenxi Ye
Yiduo Wang 0001
Ziwen Lu
Igor Gilitschenski
Martin P. Parsley
Simon Julier

In this paper, we propose a method to use semantic information to improve the use of map priors in a sparse, feature-based MonoSLAM system. To incorporate the priors, the features in the prior and SLAM maps must be associated with one another. Most existing systems build a map using SLAM and then align it with the prior map. However, this approach assumes that the local map is accurate, and the majority of the features within it can be constrained by the prior. We use the intuition that many prior maps are created to provide semantic information. Therefore, valid associations only exist if the features in the SLAM map arise from the same kind of semantic object as the prior map. Using this intuition, we extend ORB-SLAM2 using an open source pre-trained semantic segmentation network (DeepLabV3+) to incorporate prior information from Open Street Map building footprint data. We show that the amount of drift, before loop closing, is significantly smaller than that for original ORB-SLAM2. Furthermore, we show that when ORB-SLAM2 is used as a prior-aided visual odometry system, the tracking accuracy is equal to or better than the full ORB-SLAM2 system without the need for global mapping or loop closure.

IROS Conference 2019 Conference Paper

Infrastructure-free NLoS Obstacle Detection for Autonomous Cars

Felix Naser
Igor Gilitschenski
Alexander Amini
Christina Liao
Guy Rosman
Sertac Karaman
Daniela Rus

Current perception systems mostly require direct line of sight to anticipate and ultimately prevent potential collisions at intersections with other road users. We present a fully integrated autonomous system capable of detecting shadows or weak illumination changes on the ground caused by a dynamic obstacle in NLoS scenarios. This additional virtual sensor “ShadowCam” extends the signal range utilized so far by computer-vision ADASs. We show that (1) our algorithm maintains the mean classification accuracy of around 70% even when it doesn’t rely on infrastructure – such as AprilTags - as an image registration method. We validate (2) in real-world experiments that our autonomous car driving in night time conditions detects a hidden approaching car earlier with our virtual sensor than with the front facing 2-D LiDAR.

IROS Conference 2018 Conference Paper

LandmarkBoost: Efficient visualContext Classifiers for Robust Localization

Marcin Dymczyk
Igor Gilitschenski
Juan I. Nieto 0001
Simon Lynen
Bernhard Zeisl
Roland Siegwart

The growing popularity of autonomous systems creates a need for reliable and efficient metric pose retrieval algorithms. Currently used approaches tend to rely on nearest neighbor search of binary descriptors to perform the 2D-3D matching and guarantee realtime capabilities on mobile platforms. These methods struggle, however, with the growing size of the map, changes in viewpoint or appearance, and visual aliasing present in the environment. The rigidly defined descriptor patterns only capture a limited neighborhood of the keypoint and completely ignore the overall visual context. We propose LandmarkBoost - an approach that, in contrast to the conventional 2D-3D matching methods, casts the search problem as a landmark classification task. We use a boosted classifier to classify landmark observations and directly obtain correspondences as classifier scores. We also introduce a formulation of visual context that is flexible, efficient to compute, and can capture relationships in the entire image plane. The original binary descriptors are augmented with contextual information and informative features are selected by the boosting framework. Through detailed experiments, we evaluate the retrieval quality and performance of Landmark-Boost, demonstrating that it outperforms common state-of-the-art descriptor matching methods.

IROS Conference 2017 Conference Paper

A low-cost system for high-rate, high-accuracy temporal calibration for LIDARs and cameras

Hannes Sommer
Raghav Khanna
Igor Gilitschenski
Zachary Taylor
Roland Siegwart
Juan I. Nieto 0001

Deployment of camera and laser based motion estimation systems for controlling platforms operating at high speeds, such as cars or trains, is posing increasingly challenging precision requirements on the temporal calibration of these sensors. In this work, we demonstrate a simple, low-cost system for calibrating any combination of cameras and time of flight LIDARs with respect to the CPU clock (and therefore, also to each other). The newly proposed device is based on widely available off-the-shelf components, such as the Raspberry Pi 3, which is synchronized using the Precision Time Protocol (PTP) with respect to the CPU of the sensor carrying system. The obtained accuracy can be shown to be below 0. 1 ms per measurement for LIDARs and below minimal exposure time per image for cameras. It outperforms state-of-the-art approaches also not relying on hardware synchronization by more than a factor of 10 in precision. Moreover, the entire process can be carried out at a high rate allowing the study of how offsets evolve over time. In our analysis, we demonstrate how each building block of the system contributes to this accuracy and validate the obtained results using real-world data.

ICRA Conference 2017 Conference Paper

Efficient descriptor learning for large scale localization

Antonio Loquercio
Marcin Dymczyk
Bernhard Zeisl
Simon Lynen
Igor Gilitschenski
Roland Siegwart

Many robotics and Augmented Reality (AR) systems that use sparse keypoint-based visual maps operate in large and highly repetitive environments, where pose tracking and localization are challenging tasks. Additionally, these systems usually face further challenges, such as limited computational power, or insufficient memory for storing large maps of the entire environment. Thus, developing compact map representations and improving retrieval is of considerable interest for enabling large-scale visual place recognition and loop-closure. In this paper, we propose a novel approach to compress descriptors while increasing their discriminability and match-ability, based on recent advances in neural networks. At the same time, we target resource-constrained robotics applications in our design choices. The main contributions of this work are twofold. First, we propose a linear projection from descriptor space to a lower-dimensional Euclidean space, based on a novel supervised learning strategy employing a triplet loss. Second, we show the importance of including contextual appearance information to the visual feature in order to improve matching under strong viewpoint, illumination and scene changes. Through detailed experiments on three challenging datasets, we demonstrate significant gains in performance over state-of-the-art methods.

ICRA Conference 2017 Conference Paper

Map quality evaluation for visual localization

Hamza Merzic
Elena Stumm
Marcin Dymczyk
Roland Siegwart
Igor Gilitschenski

A variety of end-user devices involving keypoint-based mapping systems are about to hit the market e. g. as part of smartphones, cars, robotic platforms, or virtual and augmented reality applications. Thus, the generated map data requires automated evaluation procedures that do not require experienced personnel or ground truth knowledge of the underlying environment. A particularly important question enabling commercial applications is whether a given map is of sufficient quality for localization. This paper proposes a framework for predicting localization performance in the context of visual landmark-based mapping. Specifically, we propose an algorithm for predicting performance of vision-based localization systems from different poses within the map. To achieve this, a metric is defined that assigns a score to a given query pose based on the underlying map structure. The algorithm is evaluated on two challenging datasets involving indoor data generated using a handheld device and outdoor data from an autonomous fixed-wing unmanned aerial vehicle (UAV). Using these, we are able to show that the score provided by our method is highly correlated to the true localization performance. Furthermore, we demonstrate how the predicted map quality can be used within a belief based path planning framework in order to provide reliable trajectories through high-quality areas of the map.

IROS Conference 2017 Conference Paper

Onboard real-time dense reconstruction of large-scale environments for UAV

Anurag Sai Vempati
Igor Gilitschenski
Juan I. Nieto 0001
Paul A. Beardsley
Roland Siegwart

In this paper, we propose a GPU parallelized SLAM system capable of using photometric and inertial data together with depth data from an active RGB-D sensor to build accurate dense 3D maps of indoor environments. We describe several extensions to existing dense SLAM techniques that allow us to operate in real-time onboard memory constrained robotic platforms. Our primary contribution is a memory management algorithm that scales to large scenes without being limited by GPU memory resources. Moreover, by integrating a visual-inertial odometry system, we robustly track the camera pose even on an agile platform such as a quadrotor UAV. Our robust camera tracking framework can deal with fast camera motions and varying environments by relying on depth, color and inertial motion cues. Global consistency is achieved via regular checking for loop closures in conjunction with a pose graph, as a basis for corrective deformation of the 3D map. Our efficient SLAM system is capable of producing highly dense meshes up to 5mm resolution at rates close to 60Hz fully onboard a UAV. Experimental validations both in simulation and on a real-world platform, show that our approach is fast, more robust and more memory efficient than state-of-the-art techniques, while obtaining better or comparable accuracy.

ICRA Conference 2017 Conference Paper

TSDF-based change detection for consistent long-term dense reconstruction and dynamic object discovery

Marius Fehr
Fadri Furrer
Ivan Dryanovski
Jürgen Sturm
Igor Gilitschenski
Roland Siegwart
Cesar Cadena 0001

Robots that are operating for extended periods of time need to be able to deal with changes in their environment and represent them adequately in their maps. In this paper, we present a novel 3D reconstruction algorithm based on an extended Truncated Signed Distance Function (TSDF) that enables to continuously refine the static map while simultaneously obtaining 3D reconstructions of dynamic objects in the scene. This is a challenging problem because map updates happen incrementally and are often incomplete. Previous work typically performs change detection on point clouds, surfels or maps, which are not able to distinguish between unexplored and empty space. In contrast, our TSDF-based representation naturally contains this information and thus allows us to more robustly solve the scene differencing problem. We demonstrate the algorithms performance as part of a system for unsupervised object discovery and class recognition. We evaluated our algorithm on challenging datasets that we recorded over several days with RGB-D enabled tablets. To stimulate further research in this area, all of our datasets are publicly available 3.

ICRA Conference 2017 Conference Paper

Visual-inertial self-calibration on informative motion segments

Thomas Schneider 0007
Mingyang Li 0001
Michael Burri
Juan I. Nieto 0001
Roland Siegwart
Igor Gilitschenski

Environmental conditions and external effects, such as shocks, have a significant impact on the calibration parameters of visual-inertial sensor systems. Thus long-term operation of these systems cannot fully rely on factory calibration. Since the observability of certain parameters is highly dependent on the motion of the device, using short data segments at device initialization may yield poor results. When such systems are additionally subject to energy constraints, it is also infeasible to use full-batch approaches on a big dataset and careful selection of the data is of high importance. In this paper, we present a novel approach for resource efficient self-calibration of visual-inertial sensor systems. This is achieved by casting the calibration as a segment-based optimization problem that can be run on a small subset of informative segments. Consequently, the computational burden is limited as only a predefined number of segments is used. We also propose an efficient information-theoretic selection to identify such informative motion segments. In evaluations on a challenging dataset, we show our approach to significantly outperform state-of-the-art in terms of computational burden while maintaining a comparable accuracy.

IROS Conference 2016 Conference Paper

Appearance-based landmark selection for efficient long-term visual localization

Mathias Bürki
Igor Gilitschenski
Elena Stumm
Roland Siegwart
Juan I. Nieto 0001

In this paper, we present an online landmark selection method for distributed long-term visual localization systems in bandwidth-constrained environments. Sharing a common map for online localization provides a fleet of autonomous vehicles with the possibility to maintain and access a consistent map source, and therefore reduce redundancy while increasing efficiency. However, connectivity over a mobile network imposes strict bandwidth constraints and thus the need to minimize the amount of exchanged data. The wide range of varying appearance conditions encountered during long-term visual localization offers the potential to reduce data usage by extracting only those visual cues which are relevant at the given time. Motivated by this, we propose an unsupervised method of adaptively selecting landmarks according to how likely these landmarks are to be observable under the prevailing appearance condition. The ranking function this selection is based upon exploits landmark co-observability statistics collected in past traversals through the mapped area. Evaluation is performed over different outdoor environments, large time-scales and varying appearance conditions, including the extreme transition from day-time to night-time, demonstrating that with our appearance-dependent selection method, we can significantly reduce the amount of landmarks used for localization while maintaining or even improving the localization performance.

IROS Conference 2016 Conference Paper

Erasing bad memories: Agent-side summarization for long-term mapping

Marcin Dymczyk
Thomas Schneider 0007
Igor Gilitschenski
Roland Siegwart
Elena Stumm

Precisely estimating the pose of an agent in a global reference frame is a crucial goal that unlocks a multitude of robotic applications, including autonomous navigation and collaboration. In order to achieve this, current state-of-the-art localization approaches collect data provided by one or more agents and create a single, consistent localization map, maintained over time. However, with the introduction of lengthier sorties and the growing size of the environments, data transfers between the backend server where the global map is stored and the agents are becoming prohibitively large. While some existing methods partially address this issue by building compact summary maps, the data transfer from the agents to the backend can still easily become unmanageable. In this paper, we propose a method that is designed to reduce the amount of data that needs to be transferred from the agent to the backend, functioning in large-scale, multi-session mapping scenarios. Our approach is based upon a landmark selection method that exploits information coming from multiple, possibly weak and correlated, landmark utility predictors; fused using learned feature coefficients. Such a selection yields a drastic reduction in data transfer while maintaining localization performance and the ability to efficiently summarize environments over time. We evaluate our approach on a data set that was autonomously collected in a dynamic indoor environment over a period of several months.

IROS Conference 2016 Conference Paper

Generalized information filtering for MAV parameter estimation

Michael Burri
Michael Bloesch
Dominik Schindler
Igor Gilitschenski
Zachary Taylor
Roland Siegwart

In this paper we present a new estimation algorithm that allows for the combination of information from any number of process and measurement models. This adds more flexibility to the design of the estimator and in our case avoids the need for state augmentation. We achieve this by adapting the maximum likelihood formulation of the Kalman Filter, and thereby represent all measurement models as residuals. Posing the problem in this form allows for the straightforward integration of any number of (nonlinear) constraints between two subsequent states. To solve the optimization we present a closed form recursive set of equations that directly marginalizes out information that is not required, this leads to an efficient and generic implementation. The new algorithm is applied to parameter estimation on MAVs which have two dynamic models, the MAV dynamic model and the IMU-driven model. We show the benefits and limitations of the new filtering approach on a simplified simulation example and on a real MAV system.

IROS Conference 2016 Conference Paper

Robust map generation for fixed-wing UAVs with low-cost highly-oblique monocular cameras

Timo Hinzmann
Thomas Schneider 0007
Marcin Dymczyk
Amir Melzer
Thomas Mantel
Roland Siegwart
Igor Gilitschenski

Accurate and robust real-time map generation onboard of a fixed-wing UAV is essential for obstacle avoidance, path planning, and critical maneuvers such as autonomous take-off and landing. Due to the computational constraints, the required robustness and reliability, it remains a challenge to deploy a fixed-wing UAV with an online-capable, accurate and robust map generation framework. While photogrammetric approaches have underlying assumptions on the structure and the view of the camera, generic simultaneous localization and mapping (SLAM) approaches are computationally demanding. This paper presents a framework that uses the autopilot's state estimate as a prior for sliding window bundle adjustment and map generation. Our approach outputs an accurate geo-referenced dense point-cloud which was validated in simulation on a synthetic dataset and on two real-world scenarios based on ground control points.