Arrow Research search

Author name cluster

Daniel Cremers

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

65 papers
2 author rows

Possible papers

65

AAAI Conference 2026 Conference Paper

HI-SLAM2: Geometry-Aware Gaussian SLAM for Fast Monocular Scene Reconstruction (Abstract Reprint)

  • Wei Zhang
  • Qing Cheng
  • David Skuddis
  • Niclas Zeller
  • Daniel Cremers
  • Norbert Haala

We present HI-SLAM2, a geometry-aware Gaussian SLAM system that achieves fast and accurate monocular scene reconstruction using only RGB input. Existing Neural SLAM or 3DGS-based SLAM methods often trade off between rendering quality and geometry accuracy, our research demonstrates that both can be achieved simultaneously with RGB input alone. The key idea of our approach is to enhance the ability for geometry estimation by combining easy-to-obtain monocular priors with learning-based dense SLAM, and then using 3D Gaussian splatting as our core map representation to efficiently model the scene. Upon loop closure, our method ensures on-the-fly global consistency through efficient pose graph bundle adjustment and instant map updates by explicitly deforming the 3D Gaussian units based on anchored keyframe updates. Furthermore, we introduce a grid-based scale alignment strategy to maintain improved scale consistency in prior depths for finer depth details. Through extensive experiments on Replica, ScanNet, and ScanNet++, we demonstrate significant improvements over existing Neural SLAM methods and even surpass RGB-D-based methods in both reconstruction and rendering quality.

NeurIPS Conference 2025 Conference Paper

FlowFeat: Pixel-Dense Embedding of Motion Profiles

  • Nikita Araslanov
  • Anna Sonnweber
  • Daniel Cremers

Dense and versatile image representations underpin the success of virtually all computer vision applications. However, state-of-the-art networks, such as transformers, produce low-resolution feature grids, which are suboptimal for dense prediction tasks. To address this limitation, we present FlowFeat, a high-resolution and multi-task feature representation. The key ingredient behind FlowFeat is a novel distillation technique that embeds a distribution of plausible apparent motions, or motion profiles. By leveraging optical flow networks and diverse video data, we develop an effective self-supervised training framework that statistically approximates the apparent motion. With its remarkable level of spatial detail, FlowFeat encodes a compelling degree of geometric and semantic cues while exhibiting high temporal consistency. Empirically, FlowFeat significantly enhances the representational power of five state-of-the-art encoders and alternative upsampling strategies across three dense tasks: video object segmentation, monocular depth estimation and semantic segmentation. Training FlowFeat is computationally inexpensive and robust to inaccurate flow estimation, remaining highly effective even when using unsupervised flow networks. Our work takes a step forward towards reliable and versatile dense image representations.

ICRA Conference 2025 Conference Paper

Ground-Aware Automotive Radar Odometry

  • Daniel Casado Herraez
  • Franz Kaschner
  • Matthias Zeller
  • Dominik Muhle
  • Jens Behley
  • Michael Heidingsfeld
  • Daniel Cremers
  • Cyrill Stachniss

Odometry is crucial for the navigation of autonomous vehicles in unknown environments. While cameras and LiDARs are commonly used to estimate the ego-motion of a vehicle, these sensors face limitations under bad lighting and severe weather conditions. Automotive radars overcome these challenges, but radar point clouds are generally sparse and noisy, making it difficult to identify useful features within a radar scan. In this paper, we address the problem of ego-motion estimation using a single automotive radar sensor. We propose a simple, yet effective, heuristic-based method to extract the ground plane from single radar scans and perform ground plane matching between consecutive scans. Additionally, we perform a windowed factor-graph optimization of the poses together with the ground plane, improving the accuracy of the pose estimation. We put our work to the test using the 4DRadarDataset. Our findings illustrate the state-of-the-art performance of our odometry approach compared to existing alternatives that use radar point clouds.

ICLR Conference 2025 Conference Paper

Implicit Neural Surface Deformation with Explicit Velocity Fields

  • Lu Sang
  • Zehranaz Canfes
  • Dongliang Cao
  • Florian Bernard 0001
  • Daniel Cremers

In this work, we introduce the first unsupervised method that simultaneously predicts time-varying neural implicit surfaces and deformations between pairs of point clouds. We propose to model the point movement using an explicit velocity field and directly deform a time-varying implicit field using the modified level-set equation. This equation utilizes an iso-surface evolution with Eikonal constraints in a compact formulation, ensuring the integrity of the signed distance field. By applying a smooth, volume-preserving constraint to the velocity field, our method successfully recovers physically plausible intermediate shapes. Our method is able to handle both rigid and non-rigid deformations without any intermediate shape supervision. Our experimental results demonstrate that our method significantly outperforms existing works, delivering superior results in both quality and efficiency.

NeurIPS Conference 2025 Conference Paper

IPFormer: Visual 3D Panoptic Scene Completion with Context-Adaptive Instance Proposals

  • Markus Gross
  • Aya Fahmy
  • Danit Niwattananan
  • Dominik Muhle
  • Rui Song
  • Daniel Cremers
  • Henri Meeß

Semantic Scene Completion (SSC) has emerged as a pivotal approach for jointly learning scene geometry and semantics, enabling downstream applications such as navigation in mobile robotics. The recent generalization to Panoptic Scene Completion (PSC) advances the SSC domain by integrating instance-level information, thereby enhancing object-level sensitivity in scene understanding. While PSC was introduced using LiDAR modality, methods based on camera images remain largely unexplored. Moreover, recent Transformer-based approaches utilize a fixed set of learned queries to reconstruct objects within the scene volume. Although these queries are typically updated with image context during training, they remain static at test time, limiting their ability to dynamically adapt specifically to the observed scene. To overcome these limitations, we propose IPFormer, the first method that leverages context-adaptive instance proposals at train and test time to address vision-based 3D Panoptic Scene Completion. Specifically, IPFormer adaptively initializes these queries as panoptic instance proposals derived from image context and further refines them through attention-based encoding and decoding to reason about semantic instance-voxel relationships. Extensive experimental results show that our approach achieves state-of-the-art in-domain performance, exhibits superior zero-shot generalization on out-of-domain data, and achieves a runtime reduction exceeding 14$\times$. These results highlight our introduction of context-adaptive instance proposals as a pioneering effort in addressing vision-based 3D Panoptic Scene Completion. Code available at https: //github. com/markus-42/ipformer.

ICRA Conference 2025 Conference Paper

MonoCT: Overcoming Monocular 3D Detection Domain Shift with Consistent Teacher Models

  • Johannes Meier
  • Louis Inchingolo
  • Oussema Dhaouadi
  • Yan Xia 0003
  • Jacques Kaiser
  • Daniel Cremers

We tackle the problem of monocular 3D object detection across different sensors, environments, and camera setups. In this paper, we introduce a novel unsupervised domain adaptation approach, MonoCT, that generates highly accurate pseudo labels for self-supervision. Inspired by our observation that accurate depth estimation is critical to mitigating domain shifts, MonoCT introduces a novel Generalized Depth Enhancement (GDE) module with an ensemble concept to improve depth estimation accuracy. Moreover, we introduce a novel Pseudo Label Scoring (PLS) module by exploring inner-model consistency measurement and a Diversity Maximization (DM) strategy to further generate high-quality pseudo labels for self-training. Extensive experiments on six benchmarks show that MonoCT outperforms existing SOTA domain adaptation methods by large margins (~21% minimum for AP Mod.) and generalizes well to car, traffic camera and drone views.

NeurIPS Conference 2025 Conference Paper

OrthoLoC: UAV 6-DoF Localization and Calibration Using Orthographic Geodata

  • Oussema Dhaouadi
  • Riccardo Marin
  • Johannes Meier
  • Jacques Kaiser
  • Daniel Cremers

Accurate visual localization from aerial views is a fundamental problem with applications in mapping, large-area inspection, and search-and-rescue operations. In many scenarios, these systems require high-precision localization while operating with limited resources (e. g. , no internet connection or GNSS/GPS support), making large image databases or heavy 3D models impractical. Surprisingly, little attention has been given to leveraging orthographic geodata as an alternative paradigm, which is lightweight and increasingly available through free releases by governmental authorities (e. g. , the European Union). To fill this gap, we propose OrthoLoC, the first large-scale dataset comprising 16, 425 UAV images from Germany and the United States with multiple modalities. The dataset addresses domain shifts between UAV imagery and geospatial data. Its paired structure enables fair benchmarking of existing solutions by decoupling image retrieval from feature matching, allowing isolated evaluation of localization and calibration performance. Through comprehensive evaluation, we examine the impact of domain shifts, data resolutions, and covisibility on localization accuracy. Finally, we introduce a refinement technique called AdHoP, which can be integrated with any feature matcher, improving matching by up to 95% and reducing translation error by up to 63%. The dataset and code are available at: https: //deepscenario. github. io/OrthoLoC.

IROS Conference 2025 Conference Paper

The Monado SLAM Dataset for Egocentric Visual-Inertial Tracking

  • Mateo de Mayo
  • Daniel Cremers
  • Taihú Pire

Humanoid robots and mixed reality headsets benefit from the use of head-mounted sensors for tracking. While advancements in visual-inertial odometry (VIO) and simultaneous localization and mapping (SLAM) have produced new and high-quality state-of-the-art tracking systems, we show that these are still unable to gracefully handle many of the challenging settings presented in the head-mounted use cases. Common scenarios like high-intensity motions, dynamic occlusions, long tracking sessions, low-textured areas, adverse lighting conditions, saturation of sensors, to name a few, continue to be covered poorly by existing datasets in the literature. In this way, systems may inadvertently overlook these essential real-world issues. To address this, we present the Monado SLAM dataset, a set of real sequences taken from multiple virtual reality headsets. We release the dataset under a permissive CC BY 4. 0 license, to drive advancements in VIO/SLAM research and development.

NeurIPS Conference 2025 Conference Paper

UrbanIng-V2X: A Large-Scale Multi-Vehicle, Multi-Infrastructure Dataset Across Multiple Intersections for Cooperative Perception

  • Karthikeyan Chandra Sekaran
  • Markus Geisler
  • Dominik Rößle
  • Adithya Mohan
  • Daniel Cremers
  • Wolfgang Utschick
  • Michael Botsch
  • Werner Huber

Recent cooperative perception datasets have played a crucial role in advancing smart mobility applications by enabling information exchange between intelligent agents, helping to overcome challenges such as occlusions and improving overall scene understanding. While some existing real-world datasets incorporate both vehicle-to-vehicle and vehicle-to-infrastructure interactions, they are typically limited to a single intersection or a single vehicle. A comprehensive perception dataset featuring multiple connected vehicles and infrastructure sensors across several intersections remains unavailable, limiting the benchmarking of algorithms in diverse traffic environments. Consequently, overfitting can occur, and models may demonstrate misleadingly high performance due to similar intersection layouts and traffic participant behavior. To address this gap, we introduce UrbanIng-V2X, the first large-scale, multi-modal dataset supporting cooperative perception involving vehicles and infrastructure sensors deployed across three urban intersections in Ingolstadt, Germany. UrbanIng-V2X consists of 34 temporally aligned and spatially calibrated sensor sequences, each lasting 20 seconds. All sequences contain recordings from one of three intersections, involving two vehicles and up to three infrastructure-mounted sensor poles operating in coordinated scenarios. In total, UrbanIng-V2X provides data from 12 vehicle-mounted RGB cameras, 2 vehicle LiDARs, 17 infrastructure thermal cameras, and 12 infrastructure LiDARs. All sequences are annotated at a frequency of 10 Hz with 3D bounding boxes spanning 13 object classes, resulting in approximately 712k annotated instances across the dataset. We provide comprehensive evaluations using state-of-the-art cooperative perception methods and publicly release the codebase, dataset, HD map, and a digital twin of the complete data collection environment via https: //github. com/thi-ad/UrbanIng-V2X.

ICLR Conference 2024 Conference Paper

An Analytical Solution to Gauss-Newton Loss for Direct Image Alignment

  • Sergei Solonets
  • Daniil Sinitsyn
  • Lukas von Stumberg
  • Nikita Araslanov
  • Daniel Cremers

Direct image alignment is a widely used technique for relative 6DoF pose estimation between two images, but its accuracy strongly depends on pose initialization. Therefore, recent end-to-end frameworks increase the convergence basin of the learned feature descriptors with special training objectives, such as the Gauss-Newton loss. However, the training data may exhibit bias toward a specific type of motion and pose initialization, thus limiting the generalization of these methods. In this work, we derive a closed-form solution to the expected optimum of the Gauss-Newton loss. The solution is agnostic to the underlying feature representation and allows us to dynamically adjust the basin of convergence according to our assumptions about the uncertainty in the current estimates. These properties allow for effective control over the convergence in the alignment process. Despite using self-supervised feature embeddings, our solution achieves compelling accuracy w.r.t. the state-of-the-art direct image alignment methods trained end-to-end with pose supervision, and demonstrates improved robustness to pose initialization. Our analytical solution exposes some inherent limitations of end-to-end learning with the Gauss-Newton loss, and establishes an intriguing connection between direct image alignment and feature-matching approaches.

NeurIPS Conference 2024 Conference Paper

An Image is Worth 32 Tokens for Reconstruction and Generation

  • Qihang Yu
  • Mark Weber
  • Xueqing Deng
  • Xiaohui Shen
  • Daniel Cremers
  • Liang-Chieh Chen

Recent advancements in generative models have highlighted the crucial role of image tokenization in the efficient synthesis of high-resolution images. Tokenization, which transforms images into latent representations, reduces computational demands compared to directly processing pixels and enhances the effectiveness and efficiency of the generation process. Prior methods, such as VQGAN, typically utilize 2D latent grids with fixed downsampling factors. However, these 2D tokenizations face challenges in managing the inherent redundancies present in images, where adjacent regions frequently display similarities. To overcome this issue, we introduce T ransformer-based 1-D i mensional Tok enizer (TiTok), an innovative approach that tokenizes images into 1D latent sequences. TiTok provides a more compact latent representation, yielding substantially more efficient and effective representations than conventional techniques. For example, a 256 × 256 × 3 image can be reduced to just 32 discrete tokens, a significant reduction from the 256 or 1024 tokens obtained by prior methods. Despite its compact nature, TiTok achieves competitive performance to state-of-the-art approaches. Specifically, using the same generator framework, TiTok attains 1. 97 gFID, outperforming MaskGIT baseline significantly by 4. 21 at ImageNet 256 × 256 benchmark. The advantages of TiTok become even more significant when it comes to higher resolution. At ImageNet 512 × 512 benchmark, TiTok not only outperforms state-of-the-art diffusion model DiT-XL/2 (gFID 2. 74 vs. 3. 04), but also reduces the image tokens by 64×, leading to 410× faster generation process. Our best-performing variant can significantly surpasses DiT-XL/2 (gFID 2. 13 vs. 3. 04) while still generating high-quality samples 74× faster. Codes and models are available at https: //github. com/bytedance/1d-tokenizer

ICLR Conference 2024 Conference Paper

HoloNets: Spectral Convolutions do extend to Directed Graphs

  • Christian Koke
  • Daniel Cremers

Within the graph learning community, conventional wisdom dictates that spectral convolutional networks may only be deployed on undirected graphs: Only there could the existence of a well-defined graph Fourier transform be guaranteed, so that information may be translated between spatial- and spectral domains. Here we show this traditional reliance on the graph Fourier transform to be superfluous and -- making use of certain advanced tools from complex analysis and spectral theory -- extend spectral convolutions to directed graphs. We provide a frequency-response interpretation of newly developed filters, investigate the influence of the basis used to express filters and discuss the interplay with characteristic operators on which networks are based. In order to thoroughly test the developed theory, we conduct experiments in real world settings, showcasing that directed spectral convolutional networks provide new state of the art results for heterophilic node classification on many datasets and -- as opposed to baselines -- may be rendered stable to resolution-scale varying topological perturbations.

TMLR Journal 2024 Journal Article

MaskBit: Embedding-free Image Generation via Bit Tokens

  • Mark Weber
  • Lijun Yu
  • Qihang Yu
  • Xueqing Deng
  • Xiaohui Shen
  • Daniel Cremers
  • Liang-Chieh Chen

Masked transformer models for class-conditional image generation have become a compelling alternative to diffusion models. Typically comprising two stages - an initial VQGAN model for transitioning between latent space and image space, and a subsequent Transformer model for image generation within latent space - these frameworks offer promising avenues for image synthesis. In this study, we present two primary contributions: Firstly, an empirical and systematic examination of VQGANs, leading to a modernized VQGAN. Secondly, a novel embedding-free generation network operating directly on bit tokens - a binary quantized representation of tokens with rich semantics. The first contribution furnishes a transparent, reproducible, and high-performing VQGAN model, enhancing accessibility and matching the performance of current state-of-the-art methods while revealing previously undisclosed details. The second contribution demonstrates that embedding-free image generation using bit tokens achieves a new state-of-the-art FID of 1.52 on the ImageNet $256\times256$ benchmark, with a compact generator model of mere 305M parameters. The code for this project is available on https://github.com/markweberdev/maskbit.

IROS Conference 2024 Conference Paper

Physically-Based Photometric Bundle Adjustment in Non-Lambertian Environments

  • Lei Cheng
  • Junpeng Hu
  • Haodong Yan
  • Mariia Gladkova
  • Tianyu Huang
  • Yun-Hui Liu 0001
  • Daniel Cremers
  • Haoang Li

Photometric bundle adjustment (PBA) is widely used in estimating the camera pose and 3D geometry by assuming a Lambertian world. However, the assumption of photometric consistency is often violated since the non-diffuse reflection is common in real-world environments. The photometric inconsistency significantly affects the reliability of existing PBA methods. To solve this problem, we propose a novel physically-based PBA method. Specifically, we introduce the physically-based weights regarding material, illumination, and light path. These weights distinguish the pixel pairs with different levels of photometric inconsistency. We also design corresponding models for material estimation based on sequential images and illumination estimation based on point clouds. In addition, we establish the first SLAM-related dataset of non-Lambertian scenes with complete ground truth of illumination and material. Extensive experiments demonstrated that our PBA method outperforms existing approaches in accuracy.

ICML Conference 2024 Conference Paper

Variational Learning is Effective for Large Deep Networks

  • Yuesong Shen
  • Nico Daheim
  • Bai Cong
  • Peter Nickl
  • Gian Maria Marconi
  • Clement Bazan
  • Rio Yokota
  • Iryna Gurevych

We give extensive empirical evidence against the common belief that variational learning is ineffective for large neural networks. We show that an optimizer called Improved Variational Online Newton (IVON) consistently matches or outperforms Adam for training large networks such as GPT-2 and ResNets from scratch. IVON’s computational costs are nearly identical to Adam but its predictive uncertainty is better. We show several new use cases of IVON where we improve finetuning and model merging in Large Language Models, accurately predict generalization error, and faithfully estimate sensitivity to data. We find overwhelming evidence that variational learning is effective. Code is available at https: //github. com/team-approx-bayes/ivon.

ICML Conference 2023 Conference Paper

Beyond In-Domain Scenarios: Robust Density-Aware Calibration

  • Christian Tomani
  • Futa Kai Waseda
  • Yuesong Shen
  • Daniel Cremers

Calibrating deep learning models to yield uncertainty-aware predictions is crucial as deep neural networks get increasingly deployed in safety-critical applications. While existing post-hoc calibration methods achieve impressive results on in-domain test datasets, they are limited by their inability to yield reliable uncertainty estimates in domain-shift and out-of-domain (OOD) scenarios. We aim to bridge this gap by proposing DAC, an accuracy-preserving as well as Density-Aware Calibration method based on k-nearest-neighbors (KNN). In contrast to existing post-hoc methods, we utilize hidden layers of classifiers as a source for uncertainty-related information and study their importance. We show that DAC is a generic method that can readily be combined with state-of-the-art post-hoc methods. DAC boosts the robustness of calibration performance in domain-shift and OOD, while maintaining excellent in-domain predictive uncertainty estimates. We demonstrate that DAC leads to consistently better calibration across a large number of model architectures, datasets, and metrics. Additionally, we show that DAC improves calibration substantially on recent large-scale neural networks pre-trained on vast amounts of data.

ICML Conference 2023 Conference Paper

Learning Expressive Priors for Generalization and Uncertainty Estimation in Neural Networks

  • Dominik Schnaus
  • Jongseok Lee
  • Daniel Cremers
  • Rudolph Triebel

In this work, we propose a novel prior learning method for advancing generalization and uncertainty estimation in deep neural networks. The key idea is to exploit scalable and structured posteriors of neural networks as informative priors with generalization guarantees. Our learned priors provide expressive probabilistic representations at large scale, like Bayesian counterparts of pre-trained models on ImageNet, and further produce non-vacuous generalization bounds. We also extend this idea to a continual learning framework, where the favorable properties of our priors are desirable. Major enablers are our technical contributions: (1) the sums-of-Kronecker-product computations, and (2) the derivations and optimizations of tractable objectives that lead to improved generalization bounds. Empirically, we exhaustively show the effectiveness of this method for uncertainty estimation and generalization.

TMLR Journal 2023 Journal Article

Semantic Self-adaptation: Enhancing Generalization with a Single Sample

  • Sherwin Bahmani
  • Oliver Hahn
  • Eduard Zamfir
  • Nikita Araslanov
  • Daniel Cremers
  • Stefan Roth

The lack of out-of-domain generalization is a critical weakness of deep networks for semantic segmentation. Previous studies relied on the assumption of a static model, i. e., once the training process is complete, model parameters remain fixed at test time. In this work, we challenge this premise with a self-adaptive approach for semantic segmentation that adjusts the inference process to each input sample. Self-adaptation operates on two levels. First, it fine-tunes the parameters of convolutional layers to the input image using consistency regularization. Second, in Batch Normalization layers, self-adaptation interpolates between the training and the reference distribution derived from a single test sample. Despite both techniques being well known in the literature, their combination sets new state-of-the-art accuracy on synthetic-to-real generalization benchmarks. Our empirical study suggests that self-adaptation may complement the established practice of model regularization at training time for improving deep network generalization to out-of-domain data. Our code and pre-trained models are available at https://github.com/visinf/self-adaptive.

NeurIPS Conference 2022 Conference Paper

Deep Combinatorial Aggregation

  • Yuesong Shen
  • Daniel Cremers

Neural networks are known to produce poor uncertainty estimations, and a variety of approaches have been proposed to remedy this issue. This includes deep ensemble, a simple and effective method that achieves state-of-the-art results for uncertainty-aware learning tasks. In this work, we explore a combinatorial generalization of deep ensemble called deep combinatorial aggregation (DCA). DCA creates multiple instances of network components and aggregates their combinations to produce diversified model proposals and predictions. DCA components can be defined at different levels of granularity. And we discovered that coarse-grain DCAs can outperform deep ensemble for uncertainty-aware learning both in terms of predictive performance and uncertainty estimation. For fine-grain DCAs, we discover that an average parameterization approach named deep combinatorial weight averaging (DCWA) can improve the baseline training. It is on par with stochastic weight averaging (SWA) but does not require any custom training schedule or adaptation of BatchNorm layers. Furthermore, we propose a consistency enforcing loss that helps the training of DCWA and modelwise DCA. We experiment on in-domain, distributional shift, and out-of-distribution image classification tasks, and empirically confirm the effectiveness of DCWA and DCA approaches.

IROS Conference 2022 Conference Paper

DirectTracker: 3D Multi-Object Tracking Using Direct Image Alignment and Photometric Bundle Adjustment

  • Mariia Gladkova
  • Nikita Korobov
  • Nikolaus Demmel
  • Aljosa Osep
  • Laura Leal-Taixé
  • Daniel Cremers

Direct methods have shown excellent performance in the applications of visual odometry and SLAM. In this work we propose to leverage their effectiveness for the task of 3D multi-object tracking. To this end, we propose DirectTracker, a framework that effectively combines direct image alignment for the short-term tracking and sliding-window photometric bundle adjustment for 3D object detection. Object proposals are estimated based on the sparse sliding-window pointcloud and further refined using an optimization-based cost function that carefully combines 3D and 2D cues to ensure consistency in image and world space. We propose to evaluate 3D tracking using the recently introduced higher-order tracking accuracy (HOTA) metric and the generalized intersection over union sim-ilarity measure to mitigate the limitations of the conventional use of intersection over union for the evaluation of vision-based trackers. We perform evaluation on the KITTI Tracking benchmark for the Car class and show competitive performance in tracking objects both in 2D and 3D.

AAAI Conference 2022 Conference Paper

Joint Deep Multi-Graph Matching and 3D Geometry Learning from Inhomogeneous 2D Image Collections

  • Zhenzhang Ye
  • Tarun Yenamandra
  • Florian Bernard
  • Daniel Cremers

Graph matching aims to establish correspondences between vertices of graphs such that both the node and edge attributes agree. Various learning-based methods were recently proposed for finding correspondences between image key points based on deep graph matching formulations. While these approaches mainly focus on learning node and edge attributes, they completely ignore the 3D geometry of the underlying 3D objects depicted in the 2D images. We fill this gap by proposing a trainable framework that takes advantage of graph neural networks for learning a deformable 3D geometry model from inhomogeneous image collections, i. e. , a set of images that depict different instances of objects from the same category. Experimentally, we demonstrate that our method outperforms recent learning-based approaches for graph matching considering both accuracy and cycle-consistency error, while we in addition obtain the underlying 3D geometry of the objects depicted in the 2D images.

ICRA Conference 2022 Conference Paper

Vision-Based Large-scale 3D Semantic Mapping for Autonomous Driving Applications

  • Qing Cheng 0001
  • Niclas Zeller
  • Daniel Cremers

In this paper, we present a complete pipeline for 3D semantic mapping solely based on a stereo camera system. The pipeline comprises a direct sparse visual odometry frontend as well as a back-end for global optimization including GNSS integration, and semantic 3D point cloud labeling. We propose a simple but effective temporal voting scheme which improves the quality and consistency of the 3D point labels. Qualitative and quantitative evaluations of our pipeline are performed on the KITTI-360 dataset. The results show the effectiveness of our proposed voting scheme and the capability of our pipeline for efficient large-scale 3D semantic mapping. The large-scale mapping capabilities of our pipeline is furthermore demonstrated by presenting a very large-scale semantic map covering 8000 km of roads generated from data collected by a fleet of vehicles.

NeurIPS Conference 2022 Conference Paper

What Makes Graph Neural Networks Miscalibrated?

  • Hans Hao-Hsun Hsu
  • Yuesong Shen
  • Christian Tomani
  • Daniel Cremers

Given the importance of getting calibrated predictions and reliable uncertainty estimations, various post-hoc calibration methods have been developed for neural networks on standard multi-class classification tasks. However, these methods are not well suited for calibrating graph neural networks (GNNs), which presents unique challenges such as accounting for the graph structure and the graph-induced correlations between the nodes. In this work, we conduct a systematic study on the calibration qualities of GNN node predictions. In particular, we identify five factors which influence the calibration of GNNs: general under-confident tendency, diversity of nodewise predictive distributions, distance to training nodes, relative confidence level, and neighborhood similarity. Furthermore, based on the insights from this study, we design a novel calibration method named Graph Attention Temperature Scaling (GATS), which is tailored for calibrating graph neural networks. GATS incorporates designs that address all the identified influential factors and produces nodewise temperature scaling using an attention-based architecture. GATS is accuracy-preserving, data-efficient, and expressive at the same time. Our experiments empirically verify the effectiveness of GATS, demonstrating that it can consistently achieve state-of-the-art calibration results on various graph datasets for different GNN backbones.

UAI Conference 2021 Conference Paper

Explicit pairwise factorized graph neural network for semi-supervised node classification

  • Yu Wang 0158
  • Yuesong Shen
  • Daniel Cremers

Node features and structural information of a graph are both crucial for semi-supervised node classification problems. A variety of graph neural network (GNN) based approaches have been proposed to tackle these problems, which typically determine output labels through feature aggregation. This can be problematic, as it implies conditional independence of output nodes given hidden representations, despite their direct connections in the graph. To learn the direct influence among output nodes in a graph, we propose the Explicit Pairwise Factorized Graph Neural Network (EPFGNN), which models the whole graph as a partially observed Markov Random Field. It contains explicit pairwise factors to model output-output relations and uses a GNN backbone to model input-output relations. To balance model complexity and expressivity, the pairwise factors have a shared component and a separate scaling coefficient for each edge. We apply the EM algorithm to train our model, and utilize a star-shaped piecewise likelihood for the tractable surrogate objective. We conduct experiments on various datasets, which shows that our model can effectively improve the performance for semi-supervised node classification on graphs.

NeurIPS Conference 2021 Conference Paper

Sparse Quadratic Optimisation over the Stiefel Manifold with Application to Permutation Synchronisation

  • Florian Bernard
  • Daniel Cremers
  • Johan Thunberg

We address the non-convex optimisation problem of finding a sparse matrix on the Stiefel manifold (matrices with mutually orthogonal columns of unit length) that maximises (or minimises) a quadratic objective function. Optimisation problems on the Stiefel manifold occur for example in spectral relaxations of various combinatorial problems, such as graph matching, clustering, or permutation synchronisation. Although sparsity is a desirable property in such settings, it is mostly neglected in spectral formulations since existing solvers, e. g. based on eigenvalue decomposition, are unable to account for sparsity while at the same time maintaining global optimality guarantees. We fill this gap and propose a simple yet effective sparsity-promoting modification of the Orthogonal Iteration algorithm for finding the dominant eigenspace of a matrix. By doing so, we can guarantee that our method finds a Stiefel matrix that is globally optimal with respect to the quadratic objective function, while in addition being sparse. As a motivating application we consider the task of permutation synchronisation, which can be understood as a constrained clustering problem that has particular relevance for matching multiple images or 3D shapes in computer vision, computer graphics, and beyond. We demonstrate that the proposed approach outperforms previous methods in this domain.

NeurIPS Conference 2021 Conference Paper

STEP: Segmenting and Tracking Every Pixel

  • Mark Weber
  • Jun Xie
  • Yukun Zhu
  • Paul Voigtlaender
  • Bo Chen
  • Bradley Green
  • Andreas Geiger
  • Bastian Leibe

The task of assigning semantic classes and track identities to every pixel in a video is called video panoptic segmentation. Our work is the first that targets this task in a real-world setting requiring dense interpretation in both spatial and temporal domains. As the ground-truth for this task is difficult and expensive to obtain, existing datasets are either constructed synthetically or only sparsely annotated within short video clips. To overcome this, we introduce a new benchmark encompassing two datasets, KITTI-STEP, and MOTChallenge-STEP. The datasets contain long video sequences, providing challenging examples and a test-bed for studying long-term pixel-precise segmentation and tracking under real-world conditions. We further propose a novel evaluation metric Segmentation and Tracking Quality (STQ) that fairly balances semantic and tracking aspects of this task and is more appropriate for evaluating sequences of arbitrary length. Finally, we provide several baselines to evaluate the status of existing methods on this new challenging dataset. We have made our datasets, metric, benchmark servers, and baselines publicly available, and hope this will inspire future research.

ICRA Conference 2021 Conference Paper

Tight Integration of Feature-based Relocalization in Monocular Direct Visual Odometry

  • Mariia Gladkova
  • Rui Wang 0037
  • Niclas Zeller
  • Daniel Cremers

In this paper we propose a framework for inte-grating map-based relocalization into online direct visual odometry. To achieve map-based relocalization for direct methods, we integrate image features into Direct Sparse Odometry (DSO) and rely on feature matching to associate online visual odometry (VO) with a previously built map. The integration of the relocalization poses is threefold. Firstly, they are incorporated as pose priors in the direct image alignment of the front-end tracking. Secondly, they are tightly integrated into the back-end bundle adjustment. Thirdly, an online fusion module is further proposed to combine relative VO poses and global relocalization poses in a pose graph to estimate keyframe-wise smooth and globally accurate poses. We evaluate our method on two multi-weather datasets showing the benefits of integrating different handcrafted and learned features and demonstrating promising improvements on camera tracking accuracy.

IROS Conference 2021 Conference Paper

Towards Robust Monocular Visual Odometry for Flying Robots on Planetary Missions

  • Martin Wudenka
  • Marcus Gerhard Müller
  • Nikolaus Demmel
  • Armin Wedler
  • Rudolph Triebel
  • Daniel Cremers
  • Wolfgang Stürzl

In the future, extraterrestrial expeditions will not only be conducted by rovers but also by flying robots. The technical demonstration drone Ingenuity, that just landed on Mars, will mark the beginning of a new era of exploration unhindered by terrain traversability. Robust self-localization is crucial for that. Cameras that are lightweight, cheap and information-rich sensors are already used to estimate the ego-motion of vehicles. However, methods proven to work in man-made environments cannot simply be deployed on other planets. The highly repetitive textures present in the wastelands of Mars pose a huge challenge to descriptor matching based approaches. In this paper, we present an advanced robust monocular odometry algorithm that uses efficient optical flow tracking to obtain feature correspondences between images and a refined keyframe selection criterion. In contrast to most other approaches, our framework can also handle rotation-only motions that are particularly challenging for monocular odometry systems. Furthermore, we present a novel approach to estimate the current risk of scale drift based on a principal component analysis of the relative translation information matrix. This way we obtain an implicit measure of uncertainty. We evaluate the validity of our approach on all sequences of a challenging real-world dataset captured in a Mars-like environment and show that it outperforms state-of-the-art approaches. The source code is publicly available at: https://github.com/DLR-RM/granit.

IROS Conference 2021 Conference Paper

TUM-VIE: The TUM Stereo Visual-Inertial Event Dataset

  • Simon Klenk
  • Jason Chui
  • Nikolaus Demmel
  • Daniel Cremers

Event cameras are bio-inspired vision sensors which measure per pixel brightness changes. They offer numerous benefits over traditional, frame-based cameras, including low latency, high dynamic range, high temporal resolution and low power consumption. Thus, these sensors are suited for robotics and virtual reality applications. To foster the development of 3D perception and navigation algorithms with event cameras, we present the TUM-VIE dataset. It consists of a large variety of handheld and head-mounted sequences in indoor and outdoor environments, including rapid motion during sports and high dynamic range scenarios. The dataset contains stereo event data, stereo grayscale frames at 20Hz as well as IMU data at 200Hz. Timestamps between all sensors are synchronized in hardware. The event cameras contain a large sensor of 1280x720 pixels, which is significantly larger than the sensors used in existing stereo event datasets (at least by a factor of ten). We provide ground truth poses from a motion capture system at 120Hz during the beginning and end of each sequence, which can be used for trajectory evaluation. TUM-VIE includes challenging sequences where state-of-the art visual SLAM algorithms either fail or result in large drift. Hence, our dataset can help to push the boundary of future research on event-based visual-inertial perception algorithms.

ICML Conference 2021 Conference Paper

Variational Data Assimilation with a Learned Inverse Observation Operator

  • Thomas Frerix
  • Dmitrii Kochkov
  • Jamie A. Smith
  • Daniel Cremers
  • Michael P. Brenner
  • Stephan Hoyer

Variational data assimilation optimizes for an initial state of a dynamical system such that its evolution fits observational data. The physical model can subsequently be evolved into the future to make predictions. This principle is a cornerstone of large scale forecasting applications such as numerical weather prediction. As such, it is implemented in current operational systems of weather forecasting agencies across the globe. However, finding a good initial state poses a difficult optimization problem in part due to the non-invertible relationship between physical states and their corresponding observations. We learn a mapping from observational data to physical states and show how it can be used to improve optimizability. We employ this mapping in two ways: to better initialize the non-convex optimization problem, and to reformulate the objective function in better behaved physics space instead of observation space. Our experimental results for the Lorenz96 model and a two-dimensional turbulent fluid flow demonstrate that this procedure significantly improves forecast quality for chaotic systems.

ICRA Conference 2021 Conference Paper

Vision-Based Mobile Robotics Obstacle Avoidance With Deep Reinforcement Learning

  • Patrick Wenzel
  • Torsten Schön
  • Laura Leal-Taixé
  • Daniel Cremers

Obstacle avoidance is a fundamental and challenging problem for autonomous navigation of mobile robots. In this paper, we consider the problem of obstacle avoidance in simple 3D environments where the robot has to solely rely on a single monocular camera. In particular, we are interested in solving this problem without relying on localization, mapping, or planning techniques. Most of the existing work consider obstacle avoidance as two separate problems, namely obstacle detection, and control. Inspired by the recent advantages of deep reinforcement learning in Atari games and understanding highly complex situations in Go, we tackle the obstacle avoidance problem as a data-driven end-to-end deep learning approach. Our approach takes raw images as input and generates control commands as output. We show that discrete action spaces are outperforming continuous control commands in terms of expected average reward in maze-like environments. Furthermore, we show how to accelerate the learning and increase the robustness of the policy by incorporating predicted depth maps by a generative adversarial network.

NeurIPS Conference 2020 Conference Paper

Deep Shells: Unsupervised Shape Correspondence with Optimal Transport

  • Marvin Eisenberger
  • Aysim Toker
  • Laura Leal-Taixé
  • Daniel Cremers

We propose a novel unsupervised learning approach to 3D shape correspondence that builds a multiscale matching pipeline into a deep neural network. This approach is based on smooth shells, the current state-of-the-art axiomatic correspondence method, which requires an a priori stochastic search over the space of initial poses. Our goal is to replace this costly preprocessing step by directly learning good initializations from the input surfaces. To that end, we systematically derive a fully differentiable, hierarchical matching pipeline from entropy regularized optimal transport. This allows us to combine it with a local feature extractor based on smooth, truncated spectral convolution filters. Finally, we show that the proposed unsupervised method significantly improves over the state-of-the-art on multiple datasets, even in comparison to the most recent supervised methods. Moreover, we demonstrate compelling generalization results by applying our learned filters to examples that significantly deviate from the training set.

ICRA Conference 2020 Conference Paper

DirectShape: Direct Photometric Alignment of Shape Priors for Visual Vehicle Pose and Shape Estimation

  • Rui Wang 0037
  • Nan Yang 0007
  • Jörg Stückler
  • Daniel Cremers

Scene understanding from images is a challenging problem encountered in autonomous driving. On the object level, while 2D methods have gradually evolved from computing simple bounding boxes to delivering finer grained results like instance segmentations, the 3D family is still dominated by estimating 3D bounding boxes. In this paper, we propose a novel approach to jointly infer the 3D rigid-body poses and shapes of vehicles from a stereo image pair using shape priors. Unlike previous works that geometrically align shapes to point clouds from dense stereo reconstruction, our approach works directly on images by combining a photometric and a silhouette alignment term in the energy function. An adaptive sparse point selection scheme is proposed to efficiently measure the consistency with both terms. In experiments, we show superior performance of our method on 3D pose and shape estimation over the previous geometric approach and demonstrate that our method can also be applied as a refinement step and significantly boost the performances of several state-of-the-art deep learning based 3D object detectors. All related materials and demonstration videos are available at the project page https://vision.in.tum.de/research/vslam/direct-shape.

ICRA Conference 2020 Conference Paper

PrimiTect: Fast Continuous Hough Voting for Primitive Detection

  • Christiane Sommer
  • Yumin Sun
  • Erik Bylow
  • Daniel Cremers

This paper tackles the problem of data abstraction in the context of 3D point sets. Our method classifies points into different geometric primitives, such as planes and cones, leading to a compact representation of the data. Being based on a semi-global Hough voting scheme, the method does not need initialization and is robust, accurate, and efficient. We use a local, low-dimensional parameterization of primitives to determine type, shape and pose of the object that a point belongs to. This makes our algorithm suitable to run on devices with low computational power, as often required in robotics applications. The evaluation shows that our method outperforms state-of-the-art methods both in terms of accuracy and robustness.

ICML Conference 2019 Conference Paper

Flat Metric Minimization with Applications in Generative Modeling

  • Thomas Möllenhoff
  • Daniel Cremers

We take the novel perspective to view data not as a probability distribution but rather as a current. Primarily studied in the field of geometric measure theory, k-currents are continuous linear functionals acting on compactly supported smooth differential forms and can be understood as a generalized notion of oriented k-dimensional manifold. By moving from distributions (which are 0-currents) to k-currents, we can explicitly orient the data by attaching a k-dimensional tangent plane to each sample point. Based on the flat metric which is a fundamental distance between currents, we derive FlatGAN, a formulation in the spirit of generative adversarial networks but generalized to k-currents. In our theoretical contribution we prove that the flat metric between a parametrized current and a reference current is Lipschitz continuous in the parameters. In experiments, we show that the proposed shift to k>0 leads to interpretable and disentangled latent representations which behave equivariantly to the specified oriented tangent planes.

IROS Conference 2019 Conference Paper

Rolling-Shutter Modelling for Direct Visual-Inertial Odometry

  • David Schubert
  • Nikolaus Demmel
  • Lukas von Stumberg
  • Vladyslav Usenko
  • Daniel Cremers

We present a direct visual-inertial odometry (VIO) method which estimates the motion of the sensor setup and sparse 3D geometry of the environment based on measurements from a rolling-shutter camera and an inertial measurement unit (IMU). The visual part of the system performs a photometric bundle adjustment on a sparse set of points. This direct approach does not extract feature points and is able to track not only corners, but any pixels with sufficient gradient magnitude. Neglecting rolling-shutter effects in the visual part severely degrades accuracy and robustness of the system. In this paper, we incorporate a rolling-shutter model into the photometric bundle adjustment that estimates a set of recent keyframe poses and the inverse depth of a sparse set of points. IMU information is accumulated between several frames using measurement preintegration, and is inserted into the optimization as an additional constraint between selected keyframes. For every keyframe we estimate not only the pose but also velocity and biases to correct the IMU measurements. Unlike systems with global-shutter cameras, we use both IMU measurements and rolling-shutter effects of the camera to estimate velocity and biases for every state. Last, we evaluate our system on a new dataset that contains global-shutter and rolling-shutter images, IMU data and ground-truth poses for ten different sequences, which we make publicly available. Evaluation shows that the proposed method outperforms a system where rolling shutter is not modelled and achieves similar accuracy to the global-shutter method on global-shutter data.

IROS Conference 2019 Conference Paper

Towards Generalizing Sensorimotor Control Across Weather Conditions

  • Qadeer Khan
  • Patrick Wenzel
  • Daniel Cremers
  • Laura Leal-Taixé

The ability of deep learning models to generalize well across different scenarios depends primarily on the quality and quantity of annotated data. Labeling large amounts of data for all possible scenarios that a model may encounter would not be feasible; if even possible. We propose a framework to deal with limited labeled training data and demonstrate it on the application of vision-based vehicle control. We show how limited steering angle data available for only one condition can be transferred to multiple different weather scenarios. This is done by leveraging unlabeled images in a teacher-student learning paradigm complemented with an image-to-image translation network. The translation network transfers the images to a new domain, whereas the teacher provides soft supervised targets to train the student on this domain. Furthermore, we demonstrate how utilization of auxiliary networks can reduce the size of a model at inference time, without affecting the accuracy. The experiments show that our approach generalizes well across multiple different weather conditions using only ground truth labels from one domain.

ICRA Conference 2018 Conference Paper

Direct Sparse Visual-Inertial Odometry Using Dynamic Marginalization

  • Lukas von Stumberg
  • Vladyslav Usenko
  • Daniel Cremers

We present VI-DSO, a novel approach for visual-inertial odometry, which jointly estimates camera poses and sparse scene geometry by minimizing photometric and IMU measurement errors in a combined energy functional. The visual part of the system performs a bundle-adjustment like optimization on a sparse set of points, but unlike key-point based systems it directly minimizes a photometric error. This makes it possible for the system to track not only corners, but any pixels with large enough intensity gradients. IMU information is accumulated between several frames using measurement preintegration and is inserted into the optimization as an additional constraint between keyframes. We explicitly include scale and gravity direction into our model and jointly optimize them together with other variables such as poses. As the scale is often not immediately observable using IMU data this allows us to initialize our visual-inertial system with an arbitrary scale instead of having to delay the initialization until everything is observable. We perform partial marginalization of old variables so that updates can be computed in a reasonable time. In order to keep the system consistent we propose a novel strategy which we call “dynamic marginalization”. This technique allows us to use partial marginalization even in cases where the initial scale estimate is far from the optimum. We evaluate our method on the challenging EuRoC dataset, showing that VI-DSO outperforms the state of the art.

IROS Conference 2018 Conference Paper

Incremental Semi-Supervised Learning from Streams for Object Classification

  • Ioannis Chiotellis
  • Franziska Zimmermann
  • Daniel Cremers
  • Rudolph Triebel

The Label Propagation (LP) algorithm, first introduced by Zhu and Ghahramani [1], is a semi-supervised method used in transductive learning scenarios, where all data are available already in the beginning. In this work, we present a novel extension of the LP algorithm for applications where data samples are observed sequentially - as is the case in autonomous driving. Specifically, our “Incremental Label Propagation” algorithm efficiently approximates the so called harmonic solution on a nearest-neighbor graph that is regularly updated by new labeled and unlabeled nodes. We achieve this by reformulating the original algorithm based on an active set of nodes and by introducing a threshold to decide whether the label of a given node should be updated or not. Our method can also deal with graphs that are not fully connected, and we give a formal convergence proof for this general case. In experiments on the challenging KITTI benchmark data stream, we show superior performance in terms of both test accuracy and number of required training labels compared to state-of-the-art online learning methods.

IROS Conference 2018 Conference Paper

LDSO: Direct Sparse Odometry with Loop Closure

  • Xiang Gao 0006
  • Rui Wang 0037
  • Nikolaus Demmel
  • Daniel Cremers

In this paper we present an extension of Direct Sparse Odometry (DSO) [1] to a monocular visual SLAM system with loop closure detection and pose-graph optimization (LDSO). As a direct technique, DSO can utilize any image pixel with sufficient intensity gradient, which makes it robust even in featureless areas. LDSO retains this robustness, while at the same time ensuring repeatability of some of these points by favoring corner features in the tracking frontend. This repeatability allows to reliably detect loop closure candidates with a conventional feature-based bag-of-words (BoW) approach. Loop closure candidates are verified geometrically and Sim(3) relative pose constraints are estimated by jointly minimizing 2D and 3D geometric error terms. These constraints are fused with a co-visibility graph of relative poses extracted from DSO's sliding window optimization. Our evaluation on publicly available datasets demonstrates that the modified point selection strategy retains the tracking accuracy and robustness, and the integrated pose-graph optimization significantly reduces the accumulated rotation-, translation- and scale-drift, resulting in an overall performance comparable to state-of-the-art feature-based systems, even without global bundle adjustment.

ICRA Conference 2018 Conference Paper

StaticFusion: Background Reconstruction for Dense RGB-D SLAM in Dynamic Environments

  • Raluca Scona
  • Mariano Jaimez
  • Yvan R. Petillot
  • Maurice F. Fallon
  • Daniel Cremers

Dynamic environments are challenging for visual SLAM as moving objects can impair camera pose tracking and cause corruptions to be integrated into the map. In this paper, we propose a method for robust dense RGB-D SLAM in dynamic environments which detects moving objects and simultaneously reconstructs the background structure. While most methods employ implicit robust penalisers or outlier filtering techniques in order to handle moving objects, our approach is to simultaneously estimate the camera motion as well as a probabilistic static/dynamic segmentation of the current RGB-D image pair. This segmentation is then used for weighted dense RGB-D fusion to estimate a 3D model of only the static parts of the environment. By leveraging the 3D model for frame-to-model alignment, as well as static/dynamic segmentation, camera motion estimation has reduced overall drift - as well as being more robust to the presence of dynamics in the scene. Demonstrations are presented which compare the proposed method to related state-of-the-art approaches using both static and dynamic sequences. The proposed method achieves similar performance in static environments and improved accuracy and robustness in dynamic scenes.

IROS Conference 2018 Conference Paper

The TUM VI Benchmark for Evaluating Visual-Inertial Odometry

  • David Schubert
  • Thore Goll
  • Nikolaus Demmel
  • Vladyslav Usenko
  • Jörg Stückler
  • Daniel Cremers

Visual odometry and SLAM methods have a large variety of applications in domains such as augmented reality or robotics. Complementing vision sensors with inertial measurements tremendously improves tracking accuracy and robustness, and thus has spawned large interest in the development of visual-inertial (VI) odometry approaches. In this paper, we propose the TUM VI benchmark, a novel dataset with a diverse set of sequences in different scenes for evaluating VI odometry. It provides camera images with 1024×1024 resolution at 20 Hz, high dynamic range and photometric calibration. An IMU measures accelerations and angular velocities on 3 axes at 200 Hz, while the cameras and IMU sensors are time-synchronized in hardware. For trajectory evaluation, we also provide accurate pose ground truth from a motion capture system at high frequency (120 Hz) at the start and end of the sequences which we accurately aligned with the camera and IMU measurements. The full dataset with raw and calibrated data is publicly available. We also evaluate state-of-the-art VI odometry approaches on our dataset.

ICRA Conference 2017 Conference Paper

De-noising, stabilizing and completing 3D reconstructions on-the-go using plane priors

  • Maksym Dzitsiuk
  • Jürgen Sturm
  • Robert Maier 0001
  • Lingni Ma
  • Daniel Cremers

Creating 3D maps on robots and other mobile devices has become a reality in recent years. Online 3D reconstruction enables many exciting applications in robotics and AR/VR gaming. However, the reconstructions are noisy and generally incomplete. Moreover, during online reconstruction, the surface changes with every newly integrated depth image which poses a significant challenge for physics engines and path planning algorithms. This paper presents a novel, fast and robust method for obtaining and using information about planar surfaces, such as walls, floors, and ceilings as a stage in 3D reconstruction based on Signed Distance Fields (SDFs). Our algorithm recovers clean and accurate surfaces, reduces the movement of individual mesh vertices caused by noise during online reconstruction and fills in the occluded and unobserved regions. We implemented and evaluated two different strategies to generate plane candidates and two strategies for merging them. Our implementation is optimized to run in real-time on mobile devices such as the Tango tablet. In an extensive set of experiments, we validated that our approach works well in a large number of natural environments despite the presence of significant amount of occlusion, clutter and noise, which occur frequently. We further show that plane fitting enables in many cases a meaningful semantic segmentation of real-world scenes.

ICRA Conference 2017 Conference Paper

Fast odometry and scene flow from RGB-D cameras based on geometric clustering

  • Mariano Jaimez
  • Christian Kerl
  • Javier González Jiménez 0001
  • Daniel Cremers

In this paper we propose an efficient solution to jointly estimate the camera motion and a piecewise-rigid scene flow from an RGB-D sequence. The key idea is to perform a two-fold segmentation of the scene, dividing it into geometric clusters that are, in turn, classified as static or moving elements. Representing the dynamic scene as a set of rigid clusters drastically accelerates the motion estimation, while segmenting it into static and dynamic parts allows us to separate the camera motion (odometry) from the rest of motions observed in the scene. The resulting method robustly and accurately determines the motion of an RGB-D camera in dynamic environments with an average runtime of 80 milliseconds on a multi-core CPU. The code is available for public use/test.

IROS Conference 2017 Conference Paper

Multi-view deep learning for consistent semantic mapping with RGB-D cameras

  • Lingni Ma
  • Jörg Stückler
  • Christian Kerl
  • Daniel Cremers

Visual scene understanding is an important capability that enables robots to purposefully act in their environment. In this paper, we propose a novel deep neural network approach to predict semantic segmentation from RGB-D sequences. The key innovation is to train our network to predict multi-view consistent semantics in a self-supervised way. At test time, its semantics predictions can be fused more consistently in semantic keyframe maps than predictions of a network trained on individual views. We base our network architecture on a recent single-view deep learning approach to RGB and depth fusion for semantic object-class segmentation and enhance it with multi-scale loss minimization. We obtain the camera trajectory using RGB-D SLAM and warp the predictions of RGB-D images into ground-truth annotated frames in order to enforce multi-view consistency during training. At test time, predictions from multiple views are fused into keyframes. We propose and analyze several methods for enforcing multi-view consistency during training and testing. We evaluate the benefit of multi-view consistency training and demonstrate that pooling of deep features and fusion over multiple views outperforms single-view baselines on the NYUDv2 benchmark for semantic segmentation. Our end-to-end trained network achieves state-of-the-art performance on the NYUDv2 dataset in single-view segmentation as well as multi-view semantic fusion.

IROS Conference 2017 Conference Paper

Real-time trajectory replanning for MAVs using uniform B-splines and a 3D circular buffer

  • Vladyslav Usenko
  • Lukas von Stumberg
  • Andrej Pangercic
  • Daniel Cremers

In this paper, we present a real-time approach to local trajectory replanning for microaerial vehicles (MAVs). Current trajectory generation methods for multicopters achieve high success rates in cluttered environments, but assume that the environment is static and require prior knowledge of the map. In the presented study, we use the results of such planners and extend them with a local replanning algorithm that can handle unmodeled (possibly dynamic) obstacles while keeping the MAV close to the global trajectory. To ensure that the proposed approach is real-time capable, we maintain information about the environment around the MAV in an occupancy grid stored in a three-dimensional circular buffer, which moves together with a drone, and represent the trajectories by using uniform B-splines. This representation ensures that the trajectory is sufficiently smooth and simultaneously allows for efficient optimization.

ICRA Conference 2016 Conference Paper

CPA-SLAM: Consistent plane-model alignment for direct RGB-D SLAM

  • Lingni Ma
  • Christian Kerl
  • Jörg Stückler
  • Daniel Cremers

Planes are predominant features of man-made environments which have been exploited in many mapping approaches. In this paper, we propose a real-time capable RGB-D SLAM system that consistently integrates frame-to-keyframe and frame-to-plane alignment. Our method models the environment with a global plane model and - besides direct image alignment - it uses the planes for tracking and global graph optimization. This way, our method makes use of the dense image information available in keyframes for accurate short-term tracking. At the same time it uses a global model to reduce drift. Both components are integrated consistently in an expectation-maximization framework. In experiments, we demonstrate the benefits our approach and its state-of-the-art accuracy on challenging benchmarks.

ICRA Conference 2016 Conference Paper

Direct visual-inertial odometry with stereo cameras

  • Vladyslav Usenko
  • Jakob J. Engel
  • Jörg Stückler
  • Daniel Cremers

We propose a novel direct visual-inertial odometry method for stereo cameras. Camera pose, velocity and IMU biases are simultaneously estimated by minimizing a combined photometric and inertial energy functional. This allows us to exploit the complementary nature of vision and inertial data. At the same time, and in contrast to all existing visual-inertial methods, our approach is fully direct: geometry is estimated in the form of semi-dense depth maps instead of manually designed sparse keypoints. Depth information is obtained both from static stereo - relating the fixed-baseline images of the stereo camera - and temporal stereo - relating images from the same camera, taken at different points in time. We show that our method outperforms not only vision-only or loosely coupled approaches, but also can achieve more accurate results than state-of-the-art keypoint-based methods on different datasets, including rapid motion and significant illumination changes. In addition, our method provides high-fidelity semi-dense, metric reconstructions of the environment, and runs in real-time on a CPU.

NeurIPS Conference 2016 Conference Paper

Protein contact prediction from amino acid co-evolution using convolutional networks for graph-valued images

  • Vladimir Golkov
  • Marcin Skwark
  • Antonij Golkov
  • Alexey Dosovitskiy
  • Thomas Brox
  • Jens Meiler
  • Daniel Cremers

Proteins are the "building blocks of life", the most abundant organic molecules, and the central focus of most areas of biomedicine. Protein structure is strongly related to protein function, thus structure prediction is a crucial task on the way to solve many biological questions. A contact map is a compact representation of the three-dimensional structure of a protein via the pairwise contacts between the amino acid constituting the protein. We use a convolutional network to calculate protein contact maps from inferred statistical coupling between positions in the protein sequence. The input to the network has an image-like structure amenable to convolutions, but every "pixel" instead of color channels contains a bipartite undirected edge-weighted graph. We propose several methods for treating such "graph-valued images" in a convolutional network. The proposed method outperforms state-of-the-art methods by a large margin. It also allows for a great flexibility with regard to the input data, which makes it useful for studying a wide range of problems.

ICRA Conference 2016 Conference Paper

Stream-based Active Learning for efficient and adaptive classification of 3D objects

  • Alexander Narr
  • Rudolph Triebel
  • Daniel Cremers

We present a new Active Learning approach for classifying objects from streams of 3D point cloud data. The major problems here are the non-uniform occurrence of class instances and the unbalanced numbers of samples per class. We show that standard online learning methods based on decision trees perform comparably bad for such data streams, which are however particularly relevant for mobile robots that need to learn semantics persistently. To address this, we use Mondrian forests (MF), a recent online learning algorithm that is independent on the data order. We present an extension of that algorithm and show that MF are less overconfident than standard Random Forests. In experiments on the KITTI benchmark, we show that this leads to a substantially improved classification performance for data streams, rendering our approach very attractive for lifelong robot learning applications.

ICRA Conference 2015 Conference Paper

A primal-dual framework for real-time dense RGB-D scene flow

  • Mariano Jaimez
  • Mohamed Souiai
  • Javier González Jiménez 0001
  • Daniel Cremers

This paper presents the first method to compute dense scene flow in real-time for RGB-D cameras. It is based on a variational formulation where brightness constancy and geometric consistency are imposed. Accounting for the depth data provided by RGB-D cameras, regularization of the flow field is imposed on the 3D surface (or set of surfaces) of the observed scene instead of on the image plane, leading to more geometrically consistent results. The minimization problem is efficiently solved by a primal-dual algorithm which is implemented on a GPU, achieving a previously unseen temporal performance. Several tests have been conducted to compare our approach with a state-of-the-art work (RGB-D flow) where quantitative and qualitative results are evaluated. Moreover, an additional set of experiments have been carried out to show the applicability of our work to estimate motion in real-time. Results demonstrate the accuracy of our approach, which outperforms the RGB-D flow, and which is able to estimate heterogeneous and non-rigid motions at a high frame rate.

ICRA Conference 2015 Conference Paper

Active online confidence boosting for efficient object classification

  • Dennis Mund
  • Rudolph Triebel
  • Daniel Cremers

We present a novel efficient algorithm for object classification. Our method is based on the active learning framework, in which training and classification are performed in loops, and new ground truth labels are queried from the supervisor in each loop. Our underlying classifier is from the family of boosting methods, but in contrast to earlier methods, our Confidence Boosting particularly focusses on misclassified samples that have a high classification confidence associated. We show that weighting these samples more than others leads to a decrease of overconfidence, for which we give a formal definition. As a result, our classifier is better suited for active learning, leading to steeper learning curves and less required label queries. We show the benefits of our approach on standard data sets from machine learning and robotics.

IROS Conference 2015 Conference Paper

Large-scale direct SLAM for omnidirectional cameras

  • David Caruso
  • Jakob J. Engel
  • Daniel Cremers

We propose a real-time, direct monocular SLAM method for omnidirectional or wide field-of-view fisheye cameras. Both tracking (direct image alignment) and mapping (pixel-wise distance filtering) are directly formulated for the unified omnidirectional model, which can model central imaging devices with a field of view above 180 °. This is in contrast to existing direct mono-SLAM approaches like DTAM or LSD-SLAM, which operate on rectified images, in practice limiting the field of view to around 130 ° diagonally. Not only does this allows to observe - and reconstruct - a larger portion of the surrounding environment, but it also makes the system more robust to degenerate (rotation-only) movement. The two main contribution are (1) the formulation of direct image alignment for the unified omnidirectional model, and (2) a fast yet accurate approach to incremental stereo directly on distorted images. We evaluated our framework on real-world sequences taken with a 185 ° fisheye lens, and compare it to a rectified and a piecewise rectified approach.

IROS Conference 2015 Conference Paper

Large-scale direct SLAM with stereo cameras

  • Jakob J. Engel
  • Jörg Stückler
  • Daniel Cremers

We propose a novel Large-Scale Direct SLAM algorithm for stereo cameras (Stereo LSD-SLAM) that runs in real-time at high frame rate on standard CPUs. In contrast to sparse interest-point based methods, our approach aligns images directly based on the photoconsistency of all high-contrast pixels, including corners, edges and high texture areas. It concurrently estimates the depth at these pixels from two types of stereo cues: Static stereo through the fixed-baseline stereo camera setup as well as temporal multi-view stereo exploiting the camera motion. By incorporating both disparity sources, our algorithm can even estimate depth of pixels that are under-constrained when only using fixed-baseline stereo. Using a fixed baseline, on the other hand, avoids scale-drift that typically occurs in pure monocular SLAM. We furthermore propose a robust approach to enforce illumination invariance, capable of handling aggressive brightness changes between frames - greatly improving the performance in realistic settings. In experiments, we demonstrate state-of-the-art results on stereo SLAM benchmarks such as Kitti or challenging datasets from the EuRoC Challenge 3 for micro aerial vehicles.

IROS Conference 2015 Conference Paper

Semi-supervised online learning for efficient classification of objects in 3D data streams

  • Ye Tao
  • Rudolph Triebel
  • Daniel Cremers

We present a novel learning algorithm especially designed for challenging, large-scale classification problems in mobile robotics. Our method addresses two important aims: first it reduces the required amount of interaction with a human supervisor, which increases the level of autonomy of the learning process. And second, it has the capability to update its internal representation online with every new observed data sample, which makes it adaptive to new environments. The proposed method is based on a combination of two established methods, namely Online Star Clustering and Label Propagation, but it extends and modifies these in such a way that significant shortcomings such as classification inaccuracy and run time inefficiency can be resolved. In experiments on large benchmark data sets, we show that our approach can quickly learn to classify 3D objects with a significantly reduced amount of required ground truth labels for training.

ICRA Conference 2014 Conference Paper

Event-based 3D SLAM with a depth-augmented dynamic vision sensor

  • David Weikersdorfer
  • David B. Adrian
  • Daniel Cremers
  • Jörg Conradt

We present the D-eDVS- a combined event-based 3D sensor — and a novel event-based full-3D simultaneous localization and mapping algorithm which works exclusively with the sparse stream of visual data provided by the D-eDVS. The D-eDVS is a combination of the established PrimeSense RGB-D sensor and a biologically inspired embedded dynamic vision sensor. Dynamic vision sensors only react to dynamic contrast changes and output data in form of a sparse stream of events which represent individual pixel locations. We demonstrate how an event-based dynamic vision sensor can be fused with a classic frame-based RGB-D sensor to produce a sparse stream of depth-augmented 3D points. The advantages of a sparse, event-based stream are a much smaller amount of generated data, thus more efficient resource usage, and a continuous representation of motion allowing lag-free tracking. Our event-based SLAM algorithm is highly efficient and runs 20 times faster than realtime, provides localization updates at several hundred Hertz, and produces excellent results. We compare our method against ground truth from an external tracking system and two state-of-the-art algorithms on a new dataset which we release in combination with this paper.

ICRA Conference 2014 Conference Paper

Volumetric 3D mapping in real-time on a CPU

  • Frank Steinbrücker
  • Jürgen Sturm
  • Daniel Cremers

In this paper we propose a novel volumetric multi-resolution mapping system for RGB-D images that runs on a standard CPU in real-time. Our approach generates a textured triangle mesh from a signed distance function that it continuously updates as new RGB-D images arrive. We propose to use an octree as the primary data structure which allows us to represent the scene at multiple scales. Furthermore, it allows us to grow the reconstruction volume dynamically. As most space is either free or unknown, we allocate and update only those voxels that are located in a narrow band around the observed surface. In contrast to a regular grid, this approach saves enormous amounts of memory and computation time. The major challenge is to generate and maintain a consistent triangle mesh, as neighboring cells in the octree are more difficult to find and may have different resolutions. To remedy this, we present in this paper a novel algorithm that keeps track of these dependencies, and efficiently updates corresponding parts of the triangle mesh. In our experiments, we demonstrate the real-time capability on a large set of RGB-D sequences. As our approach does not require a GPU, it is well suited for applications on mobile or flying robots with limited computational resources.

IROS Conference 2013 Conference Paper

Dense visual SLAM for RGB-D cameras

  • Christian Kerl
  • Jürgen Sturm
  • Daniel Cremers

In this paper, we propose a dense visual SLAM method for RGB-D cameras that minimizes both the photometric and the depth error over all pixels. In contrast to sparse, feature-based methods, this allows us to better exploit the available information in the image data which leads to higher pose accuracy. Furthermore, we propose an entropy-based similarity measure for keyframe selection and loop closure detection. From all successful matches, we build up a graph that we optimize using the g2o framework. We evaluated our approach extensively on publicly available benchmark datasets, and found that it performs well in scenes with low texture as well as low structure. In direct comparison to several state-of-the-art methods, our approach yields a significantly lower trajectory error. We release our software as open-source.

IROS Conference 2013 Conference Paper

FollowMe: Person following and gesture recognition with a quadrocopter

  • Tayyab Naseer
  • Jürgen Sturm
  • Daniel Cremers

In this paper, we present an approach that allows a quadrocopter to follow a person and to recognize simple gestures using an onboard depth camera. This enables novel applications such as hands-free filming and picture taking. The problem of tracking a person with an onboard camera however is highly challenging due to the self-motion of the platform. To overcome this problem, we stabilize the depth image by warping it to a virtual-static camera, using the estimated pose of the quadrocopter obtained from vision and inertial sensors using an Extended Kalman filter. We show that such a stabilized depth video is well suited to use with existing person trackers such as the OpenNI tracker. Using this approach, the quadrocopter not only obtains the position and orientation of the tracked person, but also the full body pose — which can then for example be used to recognize hand gestures to control the quadrocopter's behaviour. We implemented a small set of example commands (“follow me”, “take picture”, “land”), and generate corresponding motion commands. We demonstrate the practical performance of our approach in an extensive set of experiments with a quadrocopter. Although our current system is limited to indoor environments and small motions due to the restrictions of the used depth sensor, it indicates that there is large potential for such applications in the near future.

ICRA Conference 2013 Conference Paper

Robust odometry estimation for RGB-D cameras

  • Christian Kerl
  • Jürgen Sturm
  • Daniel Cremers

The goal of our work is to provide a fast and accurate method to estimate the camera motion from RGB-D images. Our approach registers two consecutive RGB-D frames directly upon each other by minimizing the photometric error. We estimate the camera motion using non-linear minimization in combination with a coarse-to-fine scheme. To allow for noise and outliers in the image data, we propose to use a robust error function that reduces the influence of large residuals. Furthermore, our formulation allows for the inclusion of a motion model which can be based on prior knowledge, temporal filtering, or additional sensors like an IMU. Our method is attractive for robots with limited computational resources as it runs in real-time on a single CPU core and has a small, constant memory footprint. In an extensive set of experiments carried out both on a benchmark dataset and synthetic data, we demonstrate that our approach is more accurate and robust than previous methods. We provide our software under an open source license.

IROS Conference 2012 Conference Paper

A benchmark for the evaluation of RGB-D SLAM systems

  • Jürgen Sturm
  • Nikolas Engelhard
  • Felix Endres
  • Wolfram Burgard
  • Daniel Cremers

In this paper, we present a novel benchmark for the evaluation of RGB-D SLAM systems. We recorded a large set of image sequences from a Microsoft Kinect with highly accurate and time-synchronized ground truth camera poses from a motion capture system. The sequences contain both the color and depth images in full sensor resolution (640 × 480) at video frame rate (30 Hz). The ground-truth trajectory was obtained from a motion-capture system with eight high-speed tracking cameras (100 Hz). The dataset consists of 39 sequences that were recorded in an office environment and an industrial hall. The dataset covers a large variety of scenes and camera motions. We provide sequences for debugging with slow motions as well as longer trajectories with and without loop closures. Most sequences were recorded from a handheld Kinect with unconstrained 6-DOF motions but we also provide sequences from a Kinect mounted on a Pioneer 3 robot that was manually navigated through a cluttered indoor environment. To stimulate the comparison of different approaches, we provide automatic evaluation tools both for the evaluation of drift of visual odometry systems and the global pose error of SLAM systems. The benchmark website [1] contains all data, detailed descriptions of the scenes, specifications of the data formats, sample code, and evaluation tools.

ICRA Conference 2012 Conference Paper

A generalized framework for opening doors and drawers in kitchen environments

  • Thomas Rühr
  • Jürgen Sturm
  • Dejan Pangercic
  • Michael Beetz
  • Daniel Cremers

In this paper, we present a generalized framework for robustly operating previously unknown cabinets in kitchen environments. Our framework consists of the following four components: (1) a module for detecting both Lambertian and non-Lambertian (i. e. specular) handles, (2) a module for opening and closing novel cabinets using impedance control and for learning their kinematic models, (3) a module for storing and retrieving information about these objects in the map, and (4) a module for reliably operating cabinets of which the kinematic model is known. The presented work is the result of a collaboration of three PR2 beta sites. We rigorously evaluated our approach on 29 cabinets in five real kitchens located at our institutions. These kitchens contained 13 drawers, 12 doors, 2 refrigerators and 2 dishwashers. We evaluated the overall performance of detecting the handle of a novel cabinet, operating it and storing its model in a semantic map. We found that our approach was successful in 51. 9% of all 104 trials. With this work, we contribute a well-tested building block of open-source software for future robotic service applications.

ICRA Conference 2012 Conference Paper

An evaluation of the RGB-D SLAM system

  • Felix Endres
  • Jürgen Hess 0001
  • Nikolas Engelhard
  • Jürgen Sturm
  • Daniel Cremers
  • Wolfram Burgard

We present an approach to simultaneous localization and mapping (SLAM) for RGB-D cameras like the Microsoft Kinect. Our system concurrently estimates the trajectory of a hand-held Kinect and generates a dense 3D model of the environment. We present the key features of our approach and evaluate its performance thoroughly on a recently published dataset, including a large set of sequences of different scenes with varying camera speeds and illumination conditions. In particular, we evaluate the accuracy, robustness, and processing time for three different feature descriptors (SIFT, SURF, and ORB). The experiments demonstrate that our system can robustly deal with difficult data in common indoor scenarios while being fast enough for online operation. Our system is fully available as open-source.

IROS Conference 2012 Conference Paper

Camera-based navigation of a low-cost quadrocopter

  • Jakob J. Engel
  • Jürgen Sturm
  • Daniel Cremers

In this paper, we describe a system that enables a low-cost quadrocopter coupled with a ground-based laptop to navigate autonomously in previously unknown and GPS-denied environments. Our system consists of three components: a monocular SLAM system, an extended Kalman filter for data fusion and state estimation and a PID controller to generate steering commands. Next to a working system, the main contribution of this paper is a novel, closed-form solution to estimate the absolute scale of the generated visual map from inertial and altitude measurements. In an extensive set of experiments, we demonstrate that our system is able to navigate in previously unknown environments at absolute scale without requiring artificial markers or external sensors. Furthermore, we show (1) its robustness to temporary loss of visual tracking and significant delays in the communication process, (2) the elimination of odometry drift as a result of the visual SLAM system and (3) accurate, scale-aware pose estimation and navigation.

IROS Conference 2012 Conference Paper

Real-time human motion tracking using multiple depth cameras

  • Licong Zhang
  • Jürgen Sturm
  • Daniel Cremers
  • Dongheui Lee

In this paper, we consider the problem of tracking human motion with a 22-DOF kinematic model from depth images. In contrast to existing approaches, our system naturally scales to multiple sensors. The motivation behind our approach, termed Multiple Depth Camera Approach (MDCA), is that by using several cameras, we can significantly improve the tracking quality and reduce ambiguities as for example caused by occlusions. By fusing the depth images of all available cameras into one joint point cloud, we can seamlessly incorporate the available information from multiple sensors into the pose estimation. To track the high-dimensional human pose, we employ state-of-the-art annealed particle filtering and partition sampling. We compute the particle likelihood based on the truncated signed distance of each observed point to a parameterized human shape model. We apply a coarse-to-fine scheme to recognize a wide range of poses to initialize the tracker. In our experiments, we demonstrate that our approach can accurately track human motion in real-time (15Hz) on a GPGPU. In direct comparison to two existing trackers (OpenNI, Microsoft Kinect SDK), we found that our approach is significantly more robust for unconstrained motions and under (partial) occlusions.