Arrow Research search

Author name cluster

Torsten Sattler

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

16 papers
2 author rows

Possible papers

16

NeurIPS Conference 2025 Conference Paper

LODGE: Level-of-Detail Large-Scale Gaussian Splatting with Efficient Rendering

  • Jonas Kulhanek
  • Marie-Julie Rakotosaona
  • Fabian Manhardt
  • Christina Tsalicoglou
  • Michael Niemeyer
  • Torsten Sattler
  • Songyou Peng
  • Federico Tombari

In this work, we present a novel level-of-detail (LOD) method for 3D Gaussian Splatting that enables real-time rendering of large-scale scenes on memory-constrained devices. Our approach introduces a hierarchical LOD representation that iteratively selects optimal subsets of Gaussians based on camera distance, thus largely reducing both rendering time and GPU memory usage. We construct each LOD level by applying a depth-aware 3D smoothing filter, followed by importance-based pruning and fine-tuning to maintain visual fidelity. To further reduce memory overhead, we partition the scene into spatial chunks and dynamically load only relevant Gaussians during rendering, employing an opacity-blending mechanism to avoid visual artifacts at chunk boundaries. Our method achieves state-of-the-art performance on both outdoor (Hierarchical 3DGS) and indoor (Zip-NeRF) datasets, delivering high-quality renderings with reduced latency and memory requirements.

NeurIPS Conference 2025 Conference Paper

NerfBaselines: Consistent and Reproducible Evaluation of Novel View Synthesis Methods

  • Jonas Kulhanek
  • Torsten Sattler

Novel view synthesis is an important problem with many applications, including AR/VR, gaming, and robotic simulations. With the recent rapid development of Neural Radiance Fields (NeRFs) and 3D Gaussian Splatting (3DGS) methods, it is becoming difficult to keep track of the current state of the art (SoTA) due to methods using different evaluation protocols, codebases being difficult to install and use, and methods not generalizing well to novel 3D scenes. In our experiments, we show that even tiny differences in the evaluation protocols of various methods can artificially boost the performance of these methods. This raises questions about the validity of quantitative comparisons performed in the literature. To address these questions, we propose NerfBaselines, an evaluation framework which provides consistent benchmarking tools, ensures reproducibility, and simplifies the installation and use of various methods. We validate our implementation experimentally by reproducing the numbers reported in the original papers. For improved accessibility, we release a web platform that compares commonly used methods on standard benchmarks. We strongly believe NerfBaselines is a valuable contribution to the community as it ensures that quantitative results are comparable and thus truly measure progress in the field of novel view synthesis.

IROS Conference 2024 Conference Paper

Camera Pose Estimation from Bounding Boxes

  • Václav Vávra
  • Torsten Sattler
  • Zuzana Kukelova

Visual localization is an important part of many interesting applications, including robotics. The dominant localization strategy is to estimate the camera pose from 2D-3D matches between 2D pixel positions and 3D points. Yet, such approaches can be quite memory intensive and can lead to privacy risks. An interesting alternative to point-based matches is to use higher-level primitives for pose estimation. Consequently, this work investigates using correspondences between 2D and 3D bounding boxes for camera pose estimation. The resulting scene representation is compact and poses fewer privacy risks. In this setting, there are typically orders of magnitude fewer matches available compared to classical feature-based methods. In addition, the available correspondences are significantly more noisy. We investigate multiple strategies based on converting bounding box correspondences to point correspondences and propose a novel and simple 2-point camera absolute pose solver (DP2P) that exploits the fact that the depths of the objects can be approximated from the sizes of their bounding boxes.

NeurIPS Conference 2024 Conference Paper

WildGaussians: 3D Gaussian Splatting In the Wild

  • Jonas Kulhanek
  • Songyou Peng
  • Zuzana Kukelova
  • Marc Pollefeys
  • Torsten Sattler

While the field of 3D scene reconstruction is dominated by NeRFs due to their photorealistic quality, 3D Gaussian Splatting (3DGS) has recently emerged, offering similar quality with real-time rendering speeds. However, both methods primarily excel with well-controlled 3D scenes, while in-the-wild data - characterized by occlusions, dynamic objects, and varying illumination - remains challenging. NeRFs can adapt to such conditions easily through per-image embedding vectors, but 3DGS struggles due to its explicit representation and lack of shared parameters. To address this, we introduce WildGaussians, a novel approach to handle occlusions and appearance changes with 3DGS. By leveraging robust DINO features and integrating an appearance modeling module within 3DGS, our method achieves state-of-the-art results. We demonstrate that WildGaussians matches the real-time rendering speed of 3DGS while surpassing both 3DGS and NeRF baselines in handling in-the-wild data, all within a simple architectural framework.

NeurIPS Conference 2022 Conference Paper

MonoSDF: Exploring Monocular Geometric Cues for Neural Implicit Surface Reconstruction

  • Zehao Yu
  • Songyou Peng
  • Michael Niemeyer
  • Torsten Sattler
  • Andreas Geiger

In recent years, neural implicit surface reconstruction methods have become popular for multi-view 3D reconstruction. In contrast to traditional multi-view stereo methods, these approaches tend to produce smoother and more complete reconstructions due to the inductive smoothness bias of neural networks. State-of-the-art neural implicit methods allow for high-quality reconstructions of simple scenes from many input views. Yet, their performance drops significantly for larger and more complex scenes and scenes captured from sparse viewpoints. This is caused primarily by the inherent ambiguity in the RGB reconstruction loss that does not provide enough constraints, in particular in less-observed and textureless areas. Motivated by recent advances in the area of monocular geometry prediction, we systematically explore the utility these cues provide for improving neural implicit surface reconstruction. We demonstrate that depth and normal cues, predicted by general-purpose monocular estimators, significantly improve reconstruction quality and optimization time. Further, we analyse and investigate multiple design choices for representing neural implicit surfaces, ranging from monolithic MLP models over single-grid to multi-resolution grid representations. We observe that geometric monocular priors improve performance both for small-scale single-object as well as large-scale multi-object scenes, independent of the choice of representation.

ICRA Conference 2020 Conference Paper

To Learn or Not to Learn: Visual Localization from Essential Matrices

  • Qunjie Zhou
  • Torsten Sattler
  • Marc Pollefeys
  • Laura Leal-Taixé

Visual localization is the problem of estimating a camera within a scene and a key technology for autonomous robots. State-of-the-art approaches for accurate visual localization use scene-specific representations, resulting in the overhead of constructing these models when applying the techniques to new scenes. Recently, learned approaches based on relative pose estimation have been proposed, carrying the promise of easily adapting to new scenes. However, they are currently significantly less accurate than state-of-the-art approaches. In this paper, we are interested in analyzing this behavior. To this end, we propose a novel framework for visual localization from relative poses. Using a classical feature-based approach within this framework, we show state-of-the-art performance. Replacing the classical approach with learned alternatives at various levels, we then identify the reasons for why deep learned approaches do not perform well. Based on our analysis, we make recommendations for future work.

ICRA Conference 2019 Conference Paper

Efficient 2D-3D Matching for Multi-Camera Visual Localization

  • Marcel Geppert
  • Peidong Liu 0001
  • Zhaopeng Cui
  • Marc Pollefeys
  • Torsten Sattler

Visual localization, i. e. , determining the position and orientation of a vehicle with respect to a map, is a key problem in autonomous driving. We present a multi-camera visual inertial localization algorithm for large scale environments. To efficiently and effectively match features against a pre-built global 3D map, we propose a prioritized feature matching scheme for multi-camera systems. In contrast to existing works, designed for monocular cameras, we (1) tailor the prioritization function to the multi-camera setup and (2) run feature matching and pose estimation in parallel. This significantly accelerates the matching and pose estimation stages and allows us to dynamically adapt the matching efforts based on the surrounding environment. In addition, we show how pose priors can be integrated into the localization system to increase efficiency and robustness. Finally, we extend our algorithm by fusing the absolute pose estimates with motion estimates from a multi-camera visual inertial odometry pipeline (VIO). This results in a system that provides reliable and drift-less pose estimation. Extensive experiments show that our localization runs fast and robust under varying conditions, and that our extended algorithm enables reliable real-time pose estimation.

ICRA Conference 2019 Conference Paper

Incremental Visual-Inertial 3D Mesh Generation with Structural Regularities

  • Antoni Rosinol
  • Torsten Sattler
  • Marc Pollefeys
  • Luca Carlone

Visual-Inertial Odometry (VIO) algorithms typically rely on a point cloud representation of the scene that does not model the topology of the environment. A 3D mesh instead offers a richer, yet lightweight, model. Nevertheless, building a 3D mesh out of the sparse and noisy 3D landmarks triangulated by a VIO algorithm often results in a mesh that does not fit the real scene. In order to regularize the mesh, previous approaches decouple state estimation from the 3D mesh regularization step, and either limit the 3D mesh to the current frame [1], [2] or let the mesh grow indefinitely [3], [4]. We propose instead to tightly couple mesh regularization and state estimation by detecting and enforcing structural regularities in a novel factor-graph formulation. We also propose to incrementally build the mesh by restricting its extent to the time-horizon of the VIO optimization; the resulting 3D mesh covers a larger portion of the scene than a per-frame approach while its memory usage and computational complexity remain bounded. We show that our approach successfully regularizes the mesh, while improving localization accuracy, when structural regularities are present, and remains operational in scenes without regularities.

ICRA Conference 2019 Conference Paper

Night-to-Day Image Translation for Retrieval-based Localization

  • Asha Anoosheh
  • Torsten Sattler
  • Radu Timofte
  • Marc Pollefeys
  • Luc Van Gool

Visual localization is a key step in many robotics pipelines, allowing the robot to (approximately) determine its position and orientation in the world. An efficient and scalable approach to visual localization is to use image retrieval techniques. These approaches identify the image most similar to a query photo in a database of geo-tagged images and approximate the query’s pose via the pose of the retrieved database image. However, image retrieval across drastically different illumination conditions, e. g. day and night, is still a problem with unsatisfactory results, even in this age of powerful neural models. This is due to a lack of a suitably diverse dataset with true correspondences to perform end-to-end learning. A recent class of neural models allows for realistic translation of images among visual domains with relatively little training data and, most importantly, without ground-truth pairings. In this paper, we explore the task of accurately localizing images captured from two traversals of the same area in both day and night. We propose ToDayGAN – a modified image-translation model to alter nighttime driving images to a more useful daytime representation. We then compare the daytime and translated night images to obtain a pose estimate for the night image using the known 6-DOF position of the closest day image. Our approach improves localization performance by over 250% compared the current state-of-the-art, in the context of standard metrics in multiple categories.

ICRA Conference 2019 Conference Paper

Project AutoVision: Localization and 3D Scene Perception for an Autonomous Vehicle with a Multi-Camera System

  • Lionel Heng
  • Benjamin Choi
  • Zhaopeng Cui
  • Marcel Geppert
  • Sixing Hu
  • Benson Kuan
  • Peidong Liu 0001
  • Rang M. H. Nguyen

Project AutoVision aims to develop localization and 3D scene perception capabilities for a self-driving vehicle. Such capabilities will enable autonomous navigation in urban and rural environments, in day and night, and with cameras as the only exteroceptive sensors. The sensor suite employs many cameras for both 360-degree coverage and accurate multi-view stereo; the use of low-cost cameras keeps the cost of this sensor suite to a minimum. In addition, the project seeks to extend the operating envelope to include GNSS-less conditions which are typical for environments with tall buildings, foliage, and tunnels. Emphasis is placed on leveraging multi-view geometry and deep learning to enable the vehicle to localize and perceive in 3D space. This paper presents an overview of the project, and describes the sensor suite and current progress in the areas of calibration, localization, and perception.

ICRA Conference 2019 Conference Paper

Real-Time Dense Mapping for Self-Driving Vehicles using Fisheye Cameras

  • Zhaopeng Cui
  • Lionel Heng
  • Ye Chuan Yeo
  • Andreas Geiger 0001
  • Marc Pollefeys
  • Torsten Sattler

We present a real-time dense geometric mapping algorithm for large-scale environments. Unlike existing methods which use pinhole cameras, our implementation is based on fisheye cameras whose large field of view benefits various computer vision applications for self-driving vehicles such as visual-inertial odometry, visual localization, and object detection. Our algorithm runs on in-vehicle PCs at approximately 15 Hz, enabling vision-only 3D scene perception for self-driving vehicles. For each synchronized set of images captured by multiple cameras, we first compute a depth map for a reference camera using plane-sweeping stereo. To maintain both accuracy and efficiency, while accounting for the fact that fisheye images have a lower angular resolution, we recover the depths using multiple image resolutions. We adopt the fast object detection framework, YOLOv3, to remove potentially dynamic objects. At the end of the pipeline, we fuse the fisheye depth images into the truncated signed distance function (TSDF) volume to obtain a 3D map. We evaluate our method on large-scale urban datasets, and results show that our method works well in complex dynamic environments.

IROS Conference 2018 Conference Paper

Incremental Object Database: Building 3D Models from Multiple Partial Observations

  • Fadri Furrer
  • Tonci Novkovic
  • Marius Fehr
  • Abel Gawel
  • Margarita Grinvald
  • Torsten Sattler
  • Roland Siegwart
  • Juan I. Nieto 0001

Collecting 3D object data sets involves a large amount of manual work and is time consuming. Getting complete models of objects either requires a 3D scanner that covers all the surfaces of an object or one needs to rotate it to completely observe it. We present a system that incrementally builds a database of objects as a mobile agent traverses a scene. Our approach requires no prior knowledge of the shapes present in the scene. Object-like segments are extracted from a global segmentation map, which is built online using the input of segmented RGB-D images. These segments are stored in a database, matched among each other, and merged with other previously observed instances. This allows us to create and improve object models on the fly and to use these merged models to reconstruct also unobserved parts of the scene. The database contains each (potentially merged) object model only once, together with a set of poses where it was observed. We evaluate our pipeline with one public dataset, and on a newly created Google Tango dataset containing four indoor scenes with some of the objects appearing multiple times, both within and across scenes.

IROS Conference 2018 Conference Paper

Towards Robust Visual Odometry with a Multi-Camera System

  • Peidong Liu 0001
  • Marcel Geppert
  • Lionel Heng
  • Torsten Sattler
  • Andreas Geiger 0001
  • Marc Pollefeys

We present a visual odometry (VO) algorithm for a multi-camera system and robust operation in challenging environments. Our algorithm consists of a pose tracker and a local mapper. The tracker estimates the current pose by minimizing photometric errors between the most recent keyframe and the current frame. The mapper initializes the depths of all sampled feature points using plane-sweeping stereo. To reduce pose drift, a sliding window optimizer is used to refine poses and structure jointly. Our formulation is flexible enough to support an arbitrary number of stereo cameras. We evaluate our algorithm thoroughly on five datasets. The datasets were captured in different conditions: daytime, night-time with near-infrared (NIR) illumination and nighttime without NIR illumination. Experimental results show that a multi-camera setup makes the VO more robust to challenging environments, especially night-time conditions, in which a single stereo configuration fails easily due to the lack of features.

IROS Conference 2017 Conference Paper

Direct visual odometry for a fisheye-stereo camera

  • Peidong Liu 0001
  • Lionel Heng
  • Torsten Sattler
  • Andreas Geiger 0001
  • Marc Pollefeys

We present a direct visual odometry algorithm for a fisheye-stereo camera. Our algorithm performs simultaneous camera motion estimation and semi-dense reconstruction. The pipeline consists of two threads: a tracking thread and a mapping thread. In the tracking thread, we estimate the camera pose via semi-dense direct image alignment. To have a wider field of view (FoV) which is important for robotic perception, we use fisheye images directly without converting them to conventional pinhole images which come with a limited FoV. To address the epipolar curve problem, plane-sweeping stereo is used for stereo matching and depth initialization. Multiple depth hypotheses are tracked for selected pixels to better capture the uncertainty characteristics of stereo matching. Temporal motion stereo is then used to refine the depth and remove false positive depth hypotheses. Our implementation runs at an average of 20 Hz on a low-end PC. We run experiments in outdoor environments to validate our algorithm, and discuss the experimental results. We experimentally show that we are able to estimate 6D poses with low drift, and at the same time, do semi-dense 3D reconstruction with high accuracy. To the best of our knowledge, there is no other existing semi-dense direct visual odometry algorithm for a fisheye-stereo camera.

ICRA Conference 2017 Conference Paper

Embedded real-time multi-baseline stereo

  • Dominik Honegger
  • Torsten Sattler
  • Marc Pollefeys

Dense depth map estimation from stereo cameras has many applications in robotic vision, e. g. , obstacle detection, especially when performed in real-time. The range in which depth values can be accurately estimated is usually limited for two-camera stereo setups due to the fixed baseline between the cameras. In addition, two-camera setups suffer from wrong depth estimates caused by local minima in the matching cost functions. Both problems can be alleviated by adding more cameras as this creates multiple baselines of different lengths and since multi-image matching leads to unique minima. However, using more cameras usually comes at an increase in run-time. In this paper, we present a novel embedded system for multi-baseline stereo. By exploiting the parallelization capabilities within FPGAs, we are able to estimate a depth map from multiple cameras in real-time. We show that our approach requires only little more power and weight compared to a two-camera stereo system. At the same time, we show that our system produces significantly better depth maps and is able to handle occlusion of some cameras, resulting in the redundancy typically desired for autonomous vehicles. Our system is small in size and leight-weight and can be employed even on a MAV platform with very strict power, weight, and size requirements.

IROS Conference 2015 Conference Paper

Obstacle detection for self-driving cars using only monocular cameras and wheel odometry

  • Christian Häne
  • Torsten Sattler
  • Marc Pollefeys

Mapping the environment is crucial to enable path planning and obstacle avoidance for self-driving vehicles and other robots. In this paper, we concentrate on ground-based vehicles and present an approach which extracts static obstacles from depth maps computed out of multiple consecutive images. In contrast to existing approaches, our system does not require accurate visual inertial odometry estimation but solely relies on the readily available wheel odometry. To handle the resulting higher pose uncertainty, our system fuses obstacle detections over time and between cameras to estimate the free and occupied space around the vehicle. Using monocular fisheye cameras, we are able to cover a wider field of view and detect obstacles closer to the car, which are often not within the standard field of view of a classical binocular stereo camera setup. Our quantitative analysis shows that our system is accurate enough for navigation purposes of self-driving cars and runs in real-time.