Arrow Research search

Author name cluster

Federico Tombari

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

55 papers
2 author rows

Possible papers

55

AAAI Conference 2026 Conference Paper

RiemanLine: Riemannian Manifold Representation of 3D Lines for Factor Graph Optimization

  • Yan Li
  • Ze Yang
  • Keisuke Tateno
  • Federico Tombari
  • Liang Zhao
  • Gim Hee Lee

Minimal parametrization of 3D lines plays a critical role in camera localization and structural mapping. Existing representations in robotics and computer vision predominantly handle independent lines, overlooking structural regularities such as sets of parallel lines that are pervasive in man-made environments. This paper introduces RiemanLine, a unified minimal representation for 3D lines formulated on Riemannian manifolds that jointly accommodates both individual lines and parallel-line groups. Our key idea is to decouple each line landmark into global and local components: a shared vanishing direction optimized on the unit sphere, and scaled normal vectors constrained on orthogonal subspaces, enabling compact encoding of structural regularities. For n parallel lines, the proposed representation reduces the parameter space from 4n (orthonormal form) to 2n+2, naturally embedding parallelism without explicit constraints. We further integrate this parameterization into a factor graph framework, allowing global direction alignment and local reprojection optimization within a unified manifold-based bundle adjustment. Extensive experiments on ICL-NUIM, TartanAir, and synthetic benchmarks demonstrate that our method achieves significantly more accurate pose estimation and line reconstruction, while reducing parameter dimensionality and improving convergence stability.

ICLR Conference 2025 Conference Paper

CubeDiff: Repurposing Diffusion-Based Image Models for Panorama Generation

  • Nikolai Kalischek
  • Michael Oechsle
  • Fabian Manhardt
  • Philipp Henzler
  • Konrad Schindler
  • Federico Tombari

We introduce a novel method for generating 360° panoramas from text prompts or images. Our approach leverages recent advances in 3D generation by employing multi-view diffusion models to jointly synthesize the six faces of a cubemap. Unlike previous methods that rely on processing equirectangular projections or autoregressive generation, our method treats each face as a standard perspective image, simplifying the generation process and enabling the use of existing multi-view diffusion models. We demonstrate that these models can be adapted to produce high-quality cubemaps without requiring correspondence-aware attention layers. Our model allows for fine-grained text control, generates high resolution panorama images and generalizes well beyond its training set, whilst achieving state-of-the-art results, both qualitatively and quantitatively.

NeurIPS Conference 2025 Conference Paper

Gatekeeper: Improving Model Cascades Through Confidence Tuning

  • Stephan Rabanser
  • Nathalie Rauschmayr
  • Achin Kulshrestha
  • Petra Poklukar
  • Wittawat Jitkrittum
  • Sean Augenstein
  • Congchao Wang
  • Federico Tombari

Large-scale machine learning models deliver strong performance across a wide range of tasks but come with significant computational and resource constraints. To mitigate these challenges, local smaller models are often deployed alongside larger models, relying on routing and deferral mechanisms to offload complex tasks. However, existing approaches inadequately balance the capabilities of these models, often resulting in unnecessary deferrals or sub-optimal resource usage. In this work, we introduce a novel loss function called Gatekeeper for calibrating smaller models in cascade setups. Our approach fine-tunes the smaller model to confidently handle tasks it can perform correctly while deferring complex tasks to the larger model. Moreover, it incorporates a mechanism for managing the trade-off between model performance and deferral accuracy and is broadly applicable across various tasks and domains without any architectural changes. We evaluated our method on encoder-only, decoder-only, and encoder-decoder architectures. Experiments across image classification, language modeling, and vision-language tasks show that our approach substantially improves deferral performance.

NeurIPS Conference 2025 Conference Paper

Learning Neural Exposure Fields for View Synthesis

  • Michael Niemeyer
  • Fabian Manhardt
  • Marie-Julie Rakotosaona
  • Michael Oechsle
  • Christina Tsalicoglou
  • Keisuke Tateno
  • Jonathan Barron
  • Federico Tombari

Recent advances in neural scene representations have led to unprecedented quality in 3D reconstruction and view synthesis. Despite achieving high-quality results for common benchmarks with curated data, outputs often degrade for data that contain per image variations such as strong exposure changes, present, e. g. , in most scenes with indoor and outdoor areas or rooms with windows. In this paper, we introduce Neural Exposure Fields (NExF), a novel technique for robustly reconstructing 3D scenes with high quality and 3D-consistent appearance from challenging real-world captures. In the core, we propose to learn a neural field predicting an optimal exposure value per 3D point, enabling us to optimize exposure along with the neural scene representation. While capture devices such as cameras select optimal exposure per image/pixel, we generalize this concept and perform optimization in 3D instead. This enables accurate view synthesis in high dynamic range scenarios, bypassing the need of post-processing steps or multi-exposure captures. Our contributions include a novel neural representation for exposure prediction, a system for joint optimization of the scene representation and the exposure field via a novel neural conditioning mechanism, and demonstrated superior performance on challenging real-world data. We find that our approach trains faster than prior works and produces state-of-the-art results on several benchmarks improving by over 55% over best-performing baselines.

AAAI Conference 2025 Conference Paper

Learning to Prompt with Text Only Supervision for Vision-Language Models

  • Muhammad Uzair Khattak
  • Muhammad Ferjad Naeem
  • Muzammal Naseer
  • Luc Van Gool
  • Federico Tombari

Foundational vision-language models like CLIP are emerging as a promising paradigm in vision due to their excellent generalization. However, adapting these models for downstream tasks while maintaining their generalization remains challenging. In literature, one branch of methods adapts CLIP by learning prompts using images. While effective, these methods often rely on image-label data, which is not always practical, and struggle to generalize to new datasets due to overfitting on few-shot source data. Another approach explores training-free methods by generating class captions from large language models (LLMs) and performing prompt ensembling, but these methods often produce static, class-specific prompts that cannot be transferred to new classes and incur additional costs by generating LLM descriptions for each class separately. In this work, we aim to combine the strengths of both approaches by learning prompts using only text data derived from LLMs. As supervised training of prompts in the image-free setup is non-trivial, we develop a language-only efficient training approach that enables prompts to distill rich contextual knowledge from LLM data. Furthermore, by mapping the LLM contextual text data within the learned prompts, our approach enables zero-shot transfer of prompts to new classes and datasets, potentially reducing the LLM prompt engineering cost. To the best of our knowledge, this is the first work that learns generalized and transferable prompts for image tasks using only text data. We perform evaluations on 4 benchmarks, where ProText improves over ensembling methods while being competitive with those using labeled images.

ICRA Conference 2025 Conference Paper

LiLoc: Lifelong Localization Using Adaptive Submap Joining and Egocentric Factor Graph

  • Yixin Fang
  • Yanyan Li 0001
  • Kun Qian
  • Federico Tombari
  • Yue Wang
  • Gim Hee Lee

This paper proposes a versatile graph-based lifelong localization framework using LiDAR, LiLoc, which enhances its timeliness by maintaining a single central session while improves the accuracy through multi-modal factors between the central and subsidiary sessions. First, an adaptive submap joining strategy is employed to generate prior submaps (keyframes and poses) for the central session, and to provide priors for subsidiaries when constraints are needed for robust localization. Next, a coarse-to-fine pose initialization for subsidiary sessions is performed using vertical recognition and ICP refinement in the global coordinate frame. To elevate the accuracy of subsequent localization, we propose an egocentric factor graph (EFG) module that integrates the IMU preintegration, LiDAR odometry and scan match factors in a joint optimization manner. Specifically, the scan match factors are constructed by a novel propagation model that efficiently distributes the prior constrains as edges to the relevant prior pose nodes, weighted by noises based on keyframe registration errors. Additionally, the framework supports flexible switching between two modes: relocalization (RLM) and incremental localization (ILM) based on the proposed overlap-based mechanism to select or update the prior submaps from central session. The proposed LiLoc is tested on public and custom datasets, demonstrating accurate localization performance against state-of-the-art methods. Our codes will be publicly available on https://github.com/Yixin-F/LiLoc.

NeurIPS Conference 2025 Conference Paper

LODGE: Level-of-Detail Large-Scale Gaussian Splatting with Efficient Rendering

  • Jonas Kulhanek
  • Marie-Julie Rakotosaona
  • Fabian Manhardt
  • Christina Tsalicoglou
  • Michael Niemeyer
  • Torsten Sattler
  • Songyou Peng
  • Federico Tombari

In this work, we present a novel level-of-detail (LOD) method for 3D Gaussian Splatting that enables real-time rendering of large-scale scenes on memory-constrained devices. Our approach introduces a hierarchical LOD representation that iteratively selects optimal subsets of Gaussians based on camera distance, thus largely reducing both rendering time and GPU memory usage. We construct each LOD level by applying a depth-aware 3D smoothing filter, followed by importance-based pruning and fine-tuning to maintain visual fidelity. To further reduce memory overhead, we partition the scene into spatial chunks and dynamically load only relevant Gaussians during rendering, employing an opacity-blending mechanism to avoid visual artifacts at chunk boundaries. Our method achieves state-of-the-art performance on both outdoor (Hierarchical 3DGS) and indoor (Zip-NeRF) datasets, delivering high-quality renderings with reduced latency and memory requirements.

NeurIPS Conference 2025 Conference Paper

Object-X: Learning to Reconstruct Multi-Modal 3D Object Representations

  • Gaia Di Lorenzo
  • Federico Tombari
  • Marc Pollefeys
  • Daniel Barath

Learning effective multi-modal 3D representations of objects is essential for numerous applications, such as augmented reality and robotics. Existing methods often rely on task-specific embeddings that are tailored either for semantic understanding or geometric reconstruction. As a result, these embeddings typically cannot be decoded into explicit geometry and simultaneously reused across tasks. In this paper, we propose Object-X, a versatile multi-modal object representation framework capable of encoding rich object embeddings (e. g. , images, point cloud, text) and decoding them back into detailed geometric and visual reconstructions. Object-X operates by geometrically grounding the captured modalities in a 3D voxel grid and learning an unstructured embedding fusing the information from the voxels with the object attributes. The learned embedding enables 3D Gaussian Splatting-based object reconstruction, while also supporting a range of downstream tasks, including scene alignment, single-image 3D object reconstruction, and localization. Evaluations on two challenging real-world datasets demonstrate that Object-X produces high-fidelity novel-view synthesis comparable to standard 3D Gaussian Splatting, while significantly improving geometric accuracy. Moreover, Object-X achieves competitive performance with specialized methods in scene alignment and localization. Critically, our object-centric descriptors require 3-4 orders of magnitude less storage compared to traditional image- or point cloud-based approaches, establishing Object-X as a scalable and highly practical solution for multi-modal 3D scene representation.

ICLR Conference 2025 Conference Paper

TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters

  • Haiyang Wang
  • Yue Fan
  • Muhammad Ferjad Naeem
  • Yongqin Xian
  • Jan Eric Lenssen
  • Liwei Wang 0001
  • Federico Tombari
  • Bernt Schiele

Transformers have become the predominant architecture in foundation models due to their excellent performance across various domains. However, the substantial cost of scaling these models remains a significant concern. This problem arises primarily from their dependence on a fixed number of parameters within linear projections. When architectural modifications (e.g., channel dimensions) are introduced, the entire model typically requires retraining from scratch. As model sizes continue growing, this strategy results in increasingly high computational costs and becomes unsustainable. To overcome this problem, we introduce Tokenformer, a natively scalable architecture that leverages the attention mechanism not only for computations among input tokens but also for interactions between tokens and model parameters, thereby enhancing architectural flexibility. By treating model parameters as tokens, we replace all the linear projections in Transformers with our token-parameter attention layer, where input tokens act as queries and model parameters as keys and values. This reformulation allows for progressive and efficient scaling without necessitating retraining from scratch. Our model scales from 124M to 1.4B parameters by incrementally adding new key-value parameter pairs, achieving performance comparable to Transformers trained from scratch while greatly reducing training costs. Code and models are available at {\color{red}\url{https://github.com/Haiyang-W/TokenFormer.git}}

NeurIPS Conference 2025 Conference Paper

Video Perception Models for 3D Scene Synthesis

  • Rui Huang
  • Guangyao Zhai
  • Zuria Bauer
  • Marc Pollefeys
  • Federico Tombari
  • Leonidas Guibas
  • Gao Huang
  • Francis Engelmann

Automating the expert-dependent and labor-intensive task of 3D scene synthesis would significantly benefit fields such as architectural design, robotics simulation, and virtual reality. Recent approaches to 3D scene synthesis often rely on the commonsense reasoning of large language models (LLMs) or strong visual priors from image generation models. However, current LLMs exhibit limited 3D spatial reasoning, undermining the realism and global coherence of synthesized scenes, while image-generation-based methods often constrain viewpoint control and introduce multi-view inconsistencies. In this work, we present Video Perception models for 3D Scene synthesis (VIPScene), a novel framework that exploits the encoded commonsense knowledge of the 3D physical world in video generation models to ensure coherent scene layouts and consistent object placements across views. VIPScene accepts both text and image prompts and seamlessly integrates video generation, feedforward 3D reconstruction, and open-vocabulary perception models to semantically and geometrically analyze each object in a scene. This enables flexible scene synthesis with high realism and structural consistency. For a more sufficient evaluation on coherence and plausibility, we further introduce First-Person View Score (FPVScore), utilizing a continuous first-person perspective to capitalize on the reasoning ability of multimodal large language models. Extensive experiments show that VIPScene significantly outperforms existing methods and generalizes well across diverse scenarios.

ICLR Conference 2024 Conference Paper

Denoising Diffusion via Image-Based Rendering

  • Titas Anciukevicius
  • Fabian Manhardt
  • Federico Tombari
  • Paul Henderson

Generating 3D scenes is a challenging open problem, which requires synthesizing plausible content that is fully consistent in 3D space. While recent methods such as neural radiance fields excel at view synthesis and 3D reconstruction, they cannot synthesize plausible details in unobserved regions since they lack a generative capability. Conversely, existing generative methods are typically not capable of reconstructing detailed, large-scale scenes in the wild, as they use limited-capacity 3D scene representations, require aligned camera poses, or rely on additional regularizers. In this work, we introduce the first diffusion model able to perform fast, detailed reconstruction and generation of real-world 3D scenes. To achieve this, we make three contributions. First, we introduce a new neural scene representation, IB-planes, that can efficiently and accurately represent large 3D scenes, dynamically allocating more capacity as needed to capture details visible in each image. Second, we propose a denoising-diffusion framework to learn a prior over this novel 3D scene representation, using only 2D images without the need for any additional supervision signal such as masks or depths. This supports 3D reconstruction and generation in a unified architecture. Third, we develop a principled approach to avoid trivial 3D solutions when integrating the image-based rendering with the diffusion model, by dropping out representations of some images. We evaluate the model on several challenging datasets of real and synthetic images, and demonstrate superior results on generation, novel view synthesis and 3D reconstruction.

IROS Conference 2024 Conference Paper

DNS-SLAM: Dense Neural Semantic-Informed SLAM

  • Kunyi Li
  • Michael Niemeyer
  • Nassir Navab
  • Federico Tombari

In recent years, coordinate-based neural implicit representations have shown promising results for the task of Simultaneous Localization and Mapping (SLAM). While achieving impressive performance on small synthetic scenes, these methods often suffer from losing details, especially for complex real-world scenes. In this work, we introduce DNS SLAM, a novel neural RGB-D semantic SLAM approach featuring a hybrid representation. Relying only on 2D semantic priors, we propose the first semantic neural SLAM method that trains class-wise scene representations while providing stable camera tracking at the same time. Our method integrates multi-view geometry constraints with image-based feature extraction to improve appearance details and to output color, occupancy, and semantic class information, enabling many downstream applications. To further enable fast tracking, we introduce a lightweight coarse scene representation which is trained in a self-supervised manner in latent space. Our experimental results achieve state-of-the-art performance on both synthetic data and real-world data tracking while maintaining a commendable operational speed on off-the-shelf hardware. Further, our method outputs class-wise decomposed reconstructions with better texture, capturing appearance and geometric details.

ICML Conference 2024 Conference Paper

Extracting Training Data From Document-Based VQA Models

  • Francesco Pinto
  • Nathalie Rauschmayr
  • Florian Tramèr
  • Philip H. S. Torr
  • Federico Tombari

Vision-Language Models (VLMs) have made remarkable progress in document-based Visual Question Answering (i. e. , responding to queries about the contents of an input document provided as an image). In this work, we show these models can memorize responses for training samples and regurgitate them even when the relevant visual information has been removed. This includes Personal Identifiable Information (PII) repeated once in the training set, indicating these models could divulge memorised sensitive information and therefore pose a privacy risk. We quantitatively measure the extractability of information in controlled experiments and differentiate between cases where it arises from generalization capabilities or from memorization. We further investigate the factors that influence memorization across multiple state-of-the-art models and propose an effective heuristic countermeasure that empirically prevents the extractability of PII.

IROS Conference 2024 Conference Paper

Neural Semantic Map-Learning for Autonomous Vehicles

  • Markus Herb
  • Nassir Navab
  • Federico Tombari

Autonomous vehicles demand detailed maps to maneuver reliably through traffic, which need to be kept up-to-date to ensure a safe operation. A promising way to adapt the maps to the ever-changing road-network is to use crowd-sourced data from a fleet of vehicles. In this work, we present a mapping system that fuses local submaps gathered from a fleet of vehicles at a central instance to produce a coherent map of the road environment including drivable area, lane markings, poles, obstacles and more as a 3D mesh. Each vehicle contributes locally reconstructed submaps as lightweight meshes, making our method applicable to a wide range of reconstruction methods and sensor modalities. Our method jointly aligns and merges the noisy and incomplete local submaps using a scene-specific Neural Signed Distance Field, which is supervised using the submap meshes to predict a fused environment representation. We leverage memory-efficient sparse feature-grids to scale to large areas and introduce a confidence score to model uncertainty in scene reconstruction. Our approach is evaluated on two datasets with different local mapping methods, showing improved pose alignment and reconstruction over existing methods. Additionally, we demonstrate the benefit of multi-session mapping and examine the required amount of data to enable high-fidelity map learning for autonomous vehicles.

ICLR Conference 2024 Conference Paper

OpenNeRF: Open Set 3D Neural Scene Segmentation with Pixel-Wise Features and Rendered Novel Views

  • Francis Engelmann
  • Fabian Manhardt
  • Michael Niemeyer
  • Keisuke Tateno
  • Federico Tombari

Large visual-language models (VLMs), like CLIP, enable open-set image segmentation to segment arbitrary concepts from an image in a zero-shot manner. This goes beyond the traditional closed-set assumption, i.e., where models can only segment classes from a pre-defined training set. More recently, first works on open-set segmentation in 3D scenes have appeared in the literature. These methods are heavily influenced by closed-set 3D convolutional approaches that process point clouds or polygon meshes. However, these 3D scene representations do not align well with the image-based nature of the visual-language models. Indeed, point cloud and 3D meshes typically have a lower resolution than images and the reconstructed 3D scene geometry might not project well to the underlying 2D image sequences used to compute pixel-aligned CLIP features. To address these challenges, we propose OpenNeRF which naturally operates on posed images and directly encodes the VLM features within the NeRF. This is similar in spirit to LERF, however our work shows that using pixel-wise VLM features (instead of global CLIP features) results in an overall less complex architecture without the need for additional DINO regularization. Our OpenNeRF further leverages NeRF’s ability to render novel views and extract open-set VLM features from areas that are not well observed in the initial posed images. For 3D point cloud segmentation on the Replica dataset, OpenNeRF outperforms recent open-vocabulary methods such as LERF and OpenScene by at least +4.9 mIoU.

ICRA Conference 2024 Conference Paper

Physics-Encoded Graph Neural Networks for Deformation Prediction under Contact

  • Mahdi Saleh
  • Michael Sommersperger
  • Nassir Navab
  • Federico Tombari

In robotics, it’s crucial to understand object deformation during tactile interactions. A precise understanding of deformation can elevate robotic simulations and have broad implications across different industries. We introduce a method using Physics-Encoded Graph Neural Networks (GNNs) for such predictions. Similar to robotic grasping and manipulation scenarios, we focus on modeling the dynamics between a rigid mesh contacting a deformable mesh under external forces. Our approach represents both the soft body and the rigid body within graph structures, where nodes hold the physical states of the meshes. We also incorporate cross-attention mechanisms to capture the interplay between the objects. By jointly learning geometry and physics, our model reconstructs consistent and detailed deformations. We’ve made our code and dataset public to advance research in robotic simulation and grasping. †

ICRA Conference 2024 Conference Paper

SG-Bot: Object Rearrangement via Coarse-to-Fine Robotic Imagination on Scene Graphs

  • Guangyao Zhai
  • Xiaoni Cai
  • Dianye Huang
  • Yan Di
  • Fabian Manhardt
  • Federico Tombari
  • Nassir Navab
  • Benjamin Busam

Object rearrangement is pivotal in robotic-environment interactions, representing a significant capability in embodied AI. In this paper, we present SG-Bot, a novel rearrangement framework that utilizes a coarse-to-fine scheme with a scene graph as the scene representation. Unlike previous methods that rely on either known goal priors or zero-shot large models, SG-Bot exemplifies lightweight, real-time, and user-controllable characteristics, seamlessly blending the consideration of commonsense knowledge with automatic generation capabilities. SG-Bot employs a three-fold procedure– observation, imagination, and execution–to adeptly address the task. Initially, objects are discerned and extracted from a cluttered scene during the observation. These objects are first coarsely organized and depicted within a scene graph, guided by either commonsense or user-defined criteria. Then, this scene graph subsequently informs a generative model, which forms a fine-grained goal scene considering the shape information from the initial scene and object semantics. Finally, for execution, the initial and envisioned goal scenes are matched to formulate robotic action policies. Experimental results demonstrate that SG-Bot outperforms competitors by a large margin.

NeurIPS Conference 2024 Conference Paper

Stylebreeder: Exploring and Democratizing Artistic Styles through Text-to-Image Models

  • Matthew Zheng
  • Enis Simsar
  • Hidir Yesiltepe
  • Federico Tombari
  • Joel Simon
  • Pinar Yanardag

Text-to-image models are becoming increasingly popular, revolutionizing the landscape of digital art creation by enabling highly detailed and creative visual content generation. These models have been widely employed across various domains, particularly in art generation, where they facilitate a broad spectrum of creative expression and democratize access to artistic creation. In this paper, we introduce STYLEBREEDER, a comprehensive dataset of 6. 8M images and 1. 8M prompts generated by 95K users on Artbreeder, a platform that has emerged as a significant hub for creative exploration with over 13M users. We introduce a series of tasks with this dataset aimed at identifying diverse artistic styles, generating personalized content, and recommending styles based on user interests. By documenting unique, user-generated styles that transcend conventional categories like 'cyberpunk' or 'Picasso, ' we explore the potential for unique, crowd-sourced styles that could provide deep insights into the collective creative psyche of users worldwide. We also evaluate different personalization methods to enhance artistic expression and introduce a style atlas, making these models available in LoRA format for public use. Our research demonstrates the potential of text-to-image diffusion models to uncover and promote unique artistic expressions, further democratizing AI in art and fostering a more diverse and inclusive artistic community. The dataset, code, and models are available at https: //stylebreeder. github. io under a Public Domain (CC0) license.

NeurIPS Conference 2024 Conference Paper

UniSDF: Unifying Neural Representations for High-Fidelity 3D Reconstruction of Complex Scenes with Reflections

  • Fangjinhua Wang
  • Marie-Julie Rakotosaona
  • Michael Niemeyer
  • Richard Szeliski
  • Marc Pollefeys
  • Federico Tombari

Neural 3D scene representations have shown great potential for 3D reconstruction from 2D images. However, reconstructing real-world captures of complex scenes still remains a challenge. Existing generic 3D reconstruction methods often struggle to represent fine geometric details and do not adequately model reflective surfaces of large-scale scenes. Techniques that explicitly focus on reflective surfaces can model complex and detailed reflections by exploiting better reflection parameterizations. However, we observe that these methods are often not robust in real scenarios where non-reflective as well as reflective components are present. In this work, we propose UniSDF, a general purpose 3D reconstruction method that can reconstruct large complex scenes with reflections. We investigate both camera view as well as reflected view-based color parameterization techniques and find that explicitly blending these representations in 3D space enables reconstruction of surfaces that are more geometrically accurate, especially for reflective surfaces. We further combine this representation with a multi-resolution grid backbone that is trained in a coarse-to-fine manner, enabling faster reconstructions than prior methods. Extensive experiments on object-level datasets DTU, Shiny Blender as well as unbounded datasets Mip-NeRF 360 and Ref-NeRF real demonstrate that our method is able to robustly reconstruct complex large-scale scenes with fine details and reflective surfaces, leading to the best overall performance. Project page: https: //fangjinhuawang. github. io/UniSDF.

IROS Conference 2024 Conference Paper

Zero123-6D: Zero-shot Novel View Synthesis for RGB Category-level 6D Pose Estimation

  • Francesco Di Felice
  • Alberto Remus
  • Stefano Gasperini
  • Benjamin Busam
  • Lionel Ott
  • Federico Tombari
  • Roland Siegwart
  • Carlo Alberto Avizzano

Estimating the pose of objects through vision is essential to make robotic platforms interact with the environment. Yet, it presents many challenges, often related to the lack of flexibility and generalizability of state-of-the-art solutions. Diffusion models are a cutting-edge neural architecture transforming 2D and 3D computer vision, outlining remarkable performances in zero-shot novel-view synthesis. Such a use case is particularly intriguing for reconstructing 3D objects. However, localizing objects in unstructured environments is rather unexplored. To this end, this work presents Zero123-6D, the first work to demonstrate the utility of Diffusion Model-based novel-view-synthesizers in enhancing RGB 6D pose estimation at category-level, by integrating them with feature extraction techniques. Novel View Synthesis allows to obtain a coarse pose that is refined through an online optimization method introduced in this work to deal with intra-category geometric differences. In such a way, the outlined method shows reduction in data requirements, removal of the necessity of depth information in zero-shot category-level 6D pose estimation task, and increased performance, quantitatively demonstrated through experiments on the CO3D dataset.

NeurIPS Conference 2023 Conference Paper

CommonScenes: Generating Commonsense 3D Indoor Scenes with Scene Graph Diffusion

  • Guangyao Zhai
  • Evin Pınar Örnek
  • Shun-Cheng Wu
  • Yan Di
  • Federico Tombari
  • Nassir Navab
  • Benjamin Busam

Controllable scene synthesis aims to create interactive environments for numerous industrial use cases. Scene graphs provide a highly suitable interface to facilitate these applications by abstracting the scene context in a compact manner. Existing methods, reliant on retrieval from extensive databases or pre-trained shape embeddings, often overlook scene-object and object-object relationships, leading to inconsistent results due to their limited generation capacity. To address this issue, we present CommonScenes, a fully generative model that converts scene graphs into corresponding controllable 3D scenes, which are semantically realistic and conform to commonsense. Our pipeline consists of two branches, one predicting the overall scene layout via a variational auto-encoder and the other generating compatible shapes via latent diffusion, capturing global scene-object and local inter-object relationships in the scene graph while preserving shape diversity. The generated scenes can be manipulated by editing the input scene graph and sampling the noise in the diffusion model. Due to the lack of a scene graph dataset offering high-quality object-level meshes with relations, we also construct SG-FRONT, enriching the off-the-shelf indoor dataset 3D-FRONT with additional scene graph labels. Extensive experiments are conducted on SG-FRONT, where CommonScenes shows clear advantages over other methods regarding generation consistency, quality, and diversity. Codes and the dataset are available on the website.

NeurIPS Conference 2023 Conference Paper

DDF-HO: Hand-Held Object Reconstruction via Conditional Directed Distance Field

  • Chenyangguang Zhang
  • Yan Di
  • Ruida Zhang
  • Guangyao Zhai
  • Fabian Manhardt
  • Federico Tombari
  • Xiangyang Ji

Reconstructing hand-held objects from a single RGB image is an important and challenging problem. Existing works utilizing Signed Distance Fields (SDF) reveal limitations in comprehensively capturing the complex hand-object interactions, since SDF is only reliable within the proximity of the target, and hence, infeasible to simultaneously encode local hand and object cues. To address this issue, we propose DDF-HO, a novel approach leveraging Directed Distance Field (DDF) as the shape representation. Unlike SDF, DDF maps a ray in 3D space, consisting of an origin and a direction, to corresponding DDF values, including a binary visibility signal determining whether the ray intersects the objects and a distance value measuring the distance from origin to target in the given direction. We randomly sample multiple rays and collect local to global geometric features for them by introducing a novel 2D ray-based feature aggregation scheme and a 3D intersection-aware hand pose embedding, combining 2D-3D features to model hand-object interactions. Extensive experiments on synthetic and real-world datasets demonstrate that DDF-HO consistently outperforms all baseline methods by a large margin, especially under Chamfer Distance, with about 80% leap forward. Codes are available at https: //github. com/ZhangCYG/DDFHO.

ICRA Conference 2023 Conference Paper

MonoGraspNet: 6-DoF Grasping with a Single RGB Image

  • Guangyao Zhai
  • Dianye Huang
  • Shun-Cheng Wu
  • HyunJun Jung
  • Yan Di
  • Fabian Manhardt
  • Federico Tombari
  • Nassir Navab

6-DoF robotic grasping is a long-lasting but un-solved problem. Recent methods utilize strong 3D networks to extract geometric grasping representations from depth sensors, demonstrating superior accuracy on common objects but performing unsatisfactorily on photometrically challenging objects, e. g. , objects in transparent or reflective materials. The bottleneck lies in that the surface of these objects can not reflect accurate depth due to the absorption or refraction of light. In this paper, in contrast to exploiting the inaccurate depth data, we propose the first RGB-only 6-DoF grasping pipeline called MonoGraspNet that utilizes stable 2D features to simultaneously handle arbitrary object grasping and overcome the problems induced by photometrically challenging objects. MonoGraspNet leverages a keypoint heatmap and a normal map to recover the 6-DoF grasping poses represented by our novel representation parameterized with 2D keypoints with corresponding depth, grasping direction, grasping width, and angle. Extensive experiments in real scenes demonstrate that our method can achieve competitive results in grasping common objects and surpass the depth-based competitor by a large margin in grasping photometrically challenging objects. To further stimulate robotic manipulation research, we annotate and open-source a multi-view grasping dataset in the real world containing 44 sequence collections of mixed photometric complexity with nearly 20M accurate grasping labels.

NeurIPS Conference 2023 Conference Paper

OpenMask3D: Open-Vocabulary 3D Instance Segmentation

  • Ayca Takmaz
  • Elisabetta Fedele
  • Robert Sumner
  • Marc Pollefeys
  • Federico Tombari
  • Francis Engelmann

We introduce the task of open-vocabulary 3D instance segmentation. Current approaches for 3D instance segmentation can typically only recognize object categories from a pre-defined closed set of classes that are annotated in the training datasets. This results in important limitations for real-world applications where one might need to perform tasks guided by novel, open-vocabulary queries related to a wide variety of objects. Recently, open-vocabulary 3D scene understanding methods have emerged to address this problem by learning queryable features for each point in the scene. While such a representation can be directly employed to perform semantic segmentation, existing methods cannot separate multiple object instances. In this work, we address this limitation, and propose OpenMask3D, which is a zero-shot approach for open-vocabulary 3D instance segmentation. Guided by predicted class-agnostic 3D instance masks, our model aggregates per-mask features via multi-view fusion of CLIP-based image embeddings. Experiments and ablation studies on ScanNet200 and Replica show that OpenMask3D outperforms other open-vocabulary methods, especially on the long-tail distribution. Qualitative experiments further showcase OpenMask3D’s ability to segment object properties based on free-form queries describing geometry, affordances, and materials.

ICLR Conference 2023 Conference Paper

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

  • Andy Zeng 0001
  • Maria Attarian
  • Brian Ichter
  • Krzysztof Choromanski
  • Adrian Wong
  • Stefan Welker
  • Federico Tombari
  • Aveek Purohit

We investigate how multimodal prompt engineering can use language as the intermediate representation to combine complementary knowledge from different pretrained (potentially multimodal) language models for a variety of tasks. This approach is both distinct from and complementary to the dominant paradigm of joint multimodal training. It also recalls a traditional systems-building view as in classical NLP pipelines, but with prompting large pretrained multimodal models. We refer to these as Socratic Models (SMs): a modular class of systems in which multiple pretrained models may be composed zero-shot via multimodal-informed prompting to capture new multimodal capabilities, without additional finetuning. We show that these systems provide competitive state-of-the-art performance for zero-shot image captioning and video-to-text retrieval, and also enable new applications such as (i) answering free-form questions about egocentric video, (ii) engaging in multimodal assistive dialogue with people (e.g., for cooking recipes), and (iii) robot perception and planning. We hope this work provides (a) results for stronger zero-shot baseline performance with analysis also highlighting their limitations, (b) new perspectives for building multimodal systems powered by large pretrained models, and (c) practical application advantages in certain regimes limited by data scarcity, training compute, or model access.

IROS Conference 2022 Conference Paper

CloudAttention: Efficient Multi-Scale Attention Scheme For 3D Point Cloud Learning

  • Mahdi Saleh
  • Yige Wang
  • Nassir Navab
  • Benjamin Busam
  • Federico Tombari

Processing 3D data efficiently has always been a challenge. Spatial operations on large-scale point clouds, stored as sparse data, require extra cost. Attracted by the success of transformers, researchers are using multi-head attention for vision tasks. However, attention calculations in transformers come with quadratic complexity in the number of inputs and miss spatial intuition on sets like point clouds. We redesign set transformers in this work and incorporate them into a hierarchical framework for shape classification and part and scene segmentation. We propose our local attention unit, which captures features in a spatial neighborhood. We also compute efficient and dynamic global cross attentions by leveraging sampling and grouping at each iteration. Finally, to mitigate the non-heterogeneity of point clouds, we propose an efficient Multi-Scale Tokenization (MST), which extracts scale-invariant tokens for attention operations. The proposed hierarchical model achieves state-of-the-art shape classification in mean accuracy and yields results on par with the previous segmentation methods while requiring significantly fewer computations. Our proposed architecture predicts segmentation labels with around half the latency and parameter count of the previous most effi-cient method with comparable performance. The code is available at https://github.com/YigeWang-WHU/CloudAttention.

NeurIPS Conference 2022 Conference Paper

I2DFormer: Learning Image to Document Attention for Zero-Shot Image Classification

  • Muhammad Ferjad Naeem
  • Yongqin Xian
  • Luc V Gool
  • Federico Tombari

Despite the tremendous progress in zero-shot learning (ZSL), the majority of existing methods still rely on human-annotated attributes, which are difficult to annotate and scale. An unsupervised alternative is to represent each class using the word embedding associated with its semantic class name. However, word embeddings extracted from pre-trained language models do not necessarily capture visual similarities, resulting in poor zero-shot performance. In this work, we argue that online textual documents e. g. , Wikipedia, contain rich visual descriptions about object classes, therefore can be used as powerful unsupervised side information for ZSL. To this end, we propose I2DFormer, a novel transformer-based ZSL framework that jointly learns to encode images and documents by aligning both modalities in a shared embedding space. In order to distill discriminative visual words from noisy documents, we introduce a new cross-modal attention module that learns fine-grained interactions between image patches and document words. Consequently, our I2DFormer not only learns highly discriminative document embeddings that capture visual similarities but also gains the ability to localize visually relevant words in image regions. Quantitatively, we demonstrate that our I2DFormer significantly outperforms previous unsupervised semantic embeddings under both zero-shot and generalized zero-shot learning settings on three public datasets. Qualitatively, we show that our method leads to highly interpretable results where document words can be grounded in the image regions.

ICML Conference 2022 Conference Paper

On the Practicality of Deterministic Epistemic Uncertainty

  • Janis Postels
  • Mattia Segù
  • Tao Sun 0019
  • Luca Daniel Sieber
  • Luc Van Gool
  • Fisher Yu 0001
  • Federico Tombari

A set of novel approaches for estimating epistemic uncertainty in deep neural networks with a single forward pass has recently emerged as a valid alternative to Bayesian Neural Networks. On the premise of informative representations, these deterministic uncertainty methods (DUMs) achieve strong performance on detecting out-of-distribution (OOD) data while adding negligible computational costs at inference time. However, it remains unclear whether DUMs are well calibrated and can seamlessly scale to real-world applications - both prerequisites for their practical deployment. To this end, we first provide a taxonomy of DUMs, and evaluate their calibration under continuous distributional shifts. Then, we extend them to semantic segmentation. We find that, while DUMs scale to realistic vision tasks and perform well on OOD detection, the practicality of current methods is undermined by poor calibration under distributional shifts.

IROS Conference 2022 Conference Paper

SSP-Pose: Symmetry-Aware Shape Prior Deformation for Direct Category-Level Object Pose Estimation

  • Ruida Zhang
  • Yan Di
  • Fabian Manhardt
  • Federico Tombari
  • Xiangyang Ji

Category-level pose estimation is a challenging problem due to intra-class shape variations. Recent methods deform pre-computed shape priors to map the observed point cloud into the normalized object coordinate space and then retrieve the pose via post-processing, i. e. , Umeyama's Algorithm. The shortcomings of this two-stage strategy lie in two aspects: 1) The surrogate supervision on the intermediate results can not directly guide the learning of pose, resulting in large pose error after post-processing. 2) The inference speed is limited by the post-processing step. In this paper, to handle these shortcomings, we propose an end-to-end trainable network SSP-Pose for category-level pose estimation, which integrates shape priors into a direct pose regression network. SSP-Pose stacks four individual branches on a shared feature extractor, where two branches are designed to deform and match the prior model with the observed instance, and the other two branches are applied for directly regressing the totally 9 degrees-of-freedom pose and performing symmetry reconstruction and point-wise inlier mask prediction respectively. Consistency loss terms are then naturally exploited to align the outputs of different branches and promote the performance. During inference, only the direct pose regression branch is needed. In this manner, SSP-Pose not only learns category-level pose-sensitive characteristics to boost performance but also keeps a real-time inference speed. Moreover, we utilize the symmetry information of each category to guide the shape prior deformation, and propose a novel symmetry-aware loss to mitigate the matching ambiguity. Extensive experiments on public datasets demon-strate that SSP-Pose produces superior performance compared with competitors with a real-time inference speed at about 25Hz. The codes will be released soon.

IROS Conference 2021 Conference Paper

Content Disentanglement for Semantically Consistent Synthetic-to-Real Domain Adaptation

  • Mert Keser
  • Artem Savkin
  • Federico Tombari

Synthetic data generation is an appealing approach to generate novel traffic scenarios in autonomous driving. However, deep learning perception algorithms trained solely on synthetic data encounter serious performance drops when they are tested on real data. Such performance drops are commonly attributed to the domain gap between real and synthetic data. Domain adaptation methods that have been applied to mitigate the aforementioned domain gap achieve visually appealing results, but usually introduce semantic inconsistencies into the translated samples. In this work, we propose a novel, unsupervised, end-to-end domain adaptation network architecture that enables semantically consistent sim2real image transfer. Our method performs content disentanglement by employing shared content encoder and fixed style code.

ICRA Conference 2021 Conference Paper

Lightweight Semantic Mesh Mapping for Autonomous Vehicles

  • Markus Herb
  • Tobias Weiherer
  • Nassir Navab
  • Federico Tombari

Lightweight and semantically meaningful environment maps are crucial for many applications in robotics and autonomous driving to facilitate higher-level tasks such as navigation and planning. In this paper we present a novel approach to incrementally build a meaningful and lightweight semantic map directly as a 3D mesh from a monocular or stereo sequence. Our system leverages existing feature-based visual odometry paired with learned depth prediction and semantic image segmentation to identify and reconstruct semantically relevant environment structure. We introduce a probabilistic fusion scheme to incrementally refine and extend a 3D mesh with semantic labels for each face without intermediate voxel-based fusion. To demonstrate its effectiveness, we evaluate our system in outdoor driving scenarios with monocular depth prediction and stereo and present quantitative and qualitative reconstruction results with comparison to ground truth. Our results show that the proposed approach achieves reconstruction quality comparable to current state-of-the-art voxel-based methods while being much more lightweight both in storage and computation.

ICRA Conference 2021 Conference Paper

ManhattanSLAM: Robust Planar Tracking and Mapping Leveraging Mixture of Manhattan Frames

  • Raza Yunus
  • Yanyan Li 0001
  • Federico Tombari

In this paper, a robust RGB-D SLAM system is proposed to utilize the structural information in indoor scenes, allowing for accurate tracking and efficient dense mapping on a CPU. Prior works have used the Manhattan World (MW) assumption to estimate low-drift camera pose, in turn limiting the applications of such systems. This paper, in contrast, proposes a novel approach delivering robust tracking in MW and non-MW environments. We check orthogonal relations between planes to directly detect Manhattan Frames, modeling the scene as a Mixture of Manhattan Frames. For MW scenes, we decouple pose estimation and provide a novel drift-free rotation estimation based on Manhattan Frame observations. For translation estimation in MW scenes and full camera pose estimation in non-MW scenes, we make use of point, line and plane features for robust tracking in challenging scenes. Additionally, by exploiting plane features detected in each frame, we also propose an efficient surfel-based dense mapping strategy, which divides each image into planar and non-planar regions. Planar surfels are initialized directly from sparse planes in our map while non-planar surfels are built by extracting superpixels. We evaluate our method on public benchmarks for pose estimation, drift and reconstruction accuracy, achieving superior performance compared to other state-of-the-art methods. We will open-source our code in the future.

ICRA Conference 2021 Conference Paper

RGB-D SLAM with Structural Regularities

  • Yanyan Li 0001
  • Raza Yunus
  • Nikolas Brasch
  • Nassir Navab
  • Federico Tombari

This work proposes a RGB-D SLAM system specifically designed for structured environments and aimed at improved tracking and mapping accuracy by relying on geometric features that are extracted from the surrounding. Structured environments offer, in addition to points, also an abundance of geometrical features such as lines and planes, which we exploit to design both the tracking and mapping components of our SLAM system. For the tracking part, we explore geometric relationships between these features based on the assumption of a Manhattan World (MW). We propose a decoupling-refinement method based on points, lines, and planes, as well as the use of Manhattan relationships in an additional pose refinement module. For the mapping part, different levels of maps from sparse to dense are reconstructed at a low computational cost. We propose an instance-wise meshing strategy to build a dense map by meshing plane instances independently. The overall performance in terms of pose estimation and reconstruction is evaluated on public benchmarks and shows improved performance compared to state-of-the-art methods. The code is released at https://github.com/yanyan-li/PlanarSLAM.

IROS Conference 2021 Conference Paper

Semantic Image Alignment for Vehicle Localization

  • Markus Herb
  • Matthias Lemberger
  • Marcel M. Schmitt
  • Alexander Kurz 0003
  • Tobias Weiherer
  • Nassir Navab
  • Federico Tombari

Accurate and reliable localization is a fundamental requirement for autonomous vehicles to use map information in higher-level tasks such as navigation or planning. In this paper, we present a novel approach to vehicle localization in dense semantic maps, including vectorized high-definition maps or 3D meshes, using semantic segmentation from a monocular camera. We formulate the localization task as a direct image alignment problem on semantic images, which allows our approach to robustly track the vehicle pose in semantically labeled maps by aligning virtual camera views rendered from the map to sequences of semantically segmented camera images. In contrast to existing visual localization approaches, the system does not require additional keypoint features, handcrafted localization landmark extractors or expensive LiDAR sensors. We demonstrate the wide applicability of our method on a diverse set of semantic mesh maps generated from stereo or LiDAR as well as manually annotated HD maps and show that it achieves reliable and accurate localization in real-time.

ICRA Conference 2021 Conference Paper

TSDF++: A Multi-Object Formulation for Dynamic Object Tracking and Reconstruction

  • Margarita Grinvald
  • Federico Tombari
  • Roland Siegwart
  • Juan I. Nieto 0001

The ability to simultaneously track and reconstruct multiple objects moving in the scene is of the utmost importance for robotic tasks such as autonomous navigation and interaction. Virtually all of the previous attempts to map multiple dynamic objects have evolved to store individual objects in separate reconstruction volumes and track the relative pose between them. While simple and intuitive, such formulation does not scale well with respect to the number of objects in the scene and introduces the need for an explicit occlusion handling strategy. In contrast, we propose a map representation that allows maintaining a single volume for the entire scene and all the objects therein. To this end, we introduce a novel multi-object TSDF formulation that can encode multiple object surfaces at any given location in the map. In a multiple dynamic object tracking and reconstruction scenario, our representation allows maintaining accurate reconstruction of surfaces even while they become temporarily occluded by other objects moving in their proximity. We evaluate the proposed TSDF++ formulation on a public synthetic dataset and demonstrate its ability to preserve reconstructions of occluded surfaces when compared to the standard TSDF map representation. Code is available at https://github.com/ethz-asl/tsdf-plusplus.

IROS Conference 2021 Conference Paper

Unsupervised Traffic Scene Generation with Synthetic 3D Scene Graphs

  • Artem Savkin
  • Rachid Ellouze
  • Nassir Navab
  • Federico Tombari

Image synthesis driven by computer graphics achieved recently a remarkable realism, yet synthetic image data generated this way reveals a significant domain gap with respect to real-world data. This is especially true in autonomous driving scenarios, which represent a critical aspect for over-coming utilizing synthetic data for training neural networks. We propose a method based on domain-invariant scene representation to directly synthesize traffic scene imagery without rendering. Specifically, we rely on synthetic scene graphs as our internal representation and introduce an unsupervised neural network architecture for realistic traffic scene synthesis. We enhance synthetic scene graphs with spatial information about the scene and demonstrate the effectiveness of our approach through scene manipulation.

ICRA Conference 2020 Conference Paper

Adversarial Appearance Learning in Augmented Cityscapes for Pedestrian Recognition in Autonomous Driving

  • Artem Savkin
  • Thomas Lapotre
  • Kevin Strauss
  • Uzair Akbar
  • Federico Tombari

In the autonomous driving area synthetic data is crucial for cover specific traffic scenarios which autonomous vehicle must handle. This data commonly introduces domain gap between synthetic and real domains. In this paper we deploy data augmentation to generate custom traffic scenarios with VRUs in order to improve pedestrian recognition. We provide a pipeline for augmentation of the Cityscapes dataset with virtual pedestrians. In order to improve augmentation realism of the pipeline we reveal a novel generative network architecture for adversarial learning of the data-set lighting conditions. We also evaluate our approach on the tasks of semantic and instance segmentation.

ICRA Conference 2020 Conference Paper

Binary DAD-Net: Binarized Driveable Area Detection Network for Autonomous Driving

  • Alexander Frickenstein
  • Manoj Rohit Vemparala
  • Jakob Mayr
  • Naveen Shankar Nagaraja
  • Christian Unger
  • Federico Tombari
  • Walter Stechele

Driveable area detection is a key component for various applications in the field of autonomous driving (AD), such as ground-plane detection, obstacle detection and maneuver planning. Additionally, bulky and over-parameterized networks can be easily forgone and replaced with smaller networks for faster inference on embedded systems. The driveable area detection, posed as a two class segmentation task, can be efficiently modeled with slim binary networks. This paper proposes a novel binarized driveable area detection network (binary DAD-Net), which uses only binary weights and activations in the encoder, the bottleneck, and the decoder part. The latent space of the bottleneck is efficiently increased (×32→×16 downsampling) through binary dilated convolutions, learning more complex features. Along with automatically generated training data, the binary DAD-Net outperforms state-of-the-art semantic segmentation networks on public datasets. In comparison to a full-precision model, our approach has a ×14. 3 reduced compute complexity on an FPGA and it requires only 0. 9MB memory resources. Therefore, commodity SIMD-based AD-hardware is capable of accelerating the binary DAD-Net.

IROS Conference 2020 Conference Paper

KLIEP-based Density Ratio Estimation for Semantically Consistent Synthetic to Real Images Adaptation in Urban Traffic Scenes

  • Artem Savkin
  • Federico Tombari

Synthetic data has been applied in many deep learning based computer vision tasks. Limited performance of algorithms trained solely on synthetic data has been approached with domain adaptation techniques such as the ones based on generative adversarial framework. We demonstrate how adversarial training alone can introduce semantic inconsistencies in translated images. To tackle this issue we propose density prematching strategy using KLIEP-based density ratio estimation procedure. Finally, we show that aforementioned strategy improves quality of translated images of underlying method and their usability for the semantic segmentation task in the context of autonomous driving.

ICLR Conference 2020 Conference Paper

Restricting the Flow: Information Bottlenecks for Attribution

  • Karl Schulz
  • Leon Sixt
  • Federico Tombari
  • Tim Landgraf

Attribution methods provide insights into the decision-making of machine learning models like artificial neural networks. For a given input sample, they assign a relevance score to each individual input variable, such as the pixels of an image. In this work, we adopt the information bottleneck concept for attribution. By adding noise to intermediate feature maps, we restrict the flow of information and can quantify (in bits) how much information image regions provide. We compare our method against ten baselines using three different metrics on VGG-16 and ResNet-50, and find that our methods outperform all baselines in five out of six settings. The method’s information-theoretic foundation provides an absolute frame of reference for attribution values (bits) and a guarantee that regions scored close to zero are not necessary for the network's decision.

ICRA Conference 2019 Conference Paper

Attention-based Lane Change Prediction

  • Oliver Scheel
  • Naveen Shankar Nagaraja
  • Loren Arthur Schwarz
  • Nassir Navab
  • Federico Tombari

Lane change prediction of surrounding vehicles is a key building block of path planning. The focus has been on increasing the accuracy of prediction by posing it purely as a function estimation problem at the cost of model understandability. However, the efficacy of any lane change prediction model can be improved when both corner and failure cases are humanly understandable. We propose an attention-based recurrent model to tackle both understandability and prediction quality. We also propose metrics which reflect the discomfort felt by the driver. We show encouraging results on a publicly available dataset and proprietary fleet data.

IROS Conference 2019 Conference Paper

Crowd-sourced Semantic Edge Mapping for Autonomous Vehicles

  • Markus Herb
  • Tobias Weiherer
  • Nassir Navab
  • Federico Tombari

Highly accurate maps of the road infrastructure are a crucial cornerstone for self-driving cars to enable navigation in complex traffic scenarios. Traditional methods for creating detailed maps of road environments involve expensive survey vehicles that cannot keep up with the frequent changes in the road network. In this paper, we propose a novel method to derive detailed high-definition maps by crowd sourcing data using commodity sensors. Our system uses multi-session feature-based visual SLAM to align submaps recorded by individual vehicles on a central backend server. We reconstruct 3D boundaries of road infrastructure elements such as road markings and road boundaries from semantic object contours detected in keyframes by a neural network. The result is a concise map of semantically meaningful objects suitable both for localization and higher-level planning tasks of automated vehicles. We evaluate our method on real-world data against a globally referenced ground-truth map demonstrating a high level of detail and metric accuracy.

IROS Conference 2018 Conference Paper

Fast and Accurate Semantic Mapping through Geometric-based Incremental Segmentation

  • Yoshikatsu Nakajima
  • Keisuke Tateno
  • Federico Tombari
  • Hideo Saito 0001

We propose an efficient and scalable method for incrementally building a dense, semantically annotated 3D map in real-time. The proposed method assigns class probabilities to each region, not each element (e. g. , surfel and voxel), of the 3D map which is built up through a robust SLAM framework and incrementally segmented with a geometric-based segmentation method. Differently from all other approaches, our method has a capability of running at over 30Hz while performing all processing components, including SLAM, segmentation, 2D recognition, and updating class probabilities of each segmentation label at every incoming frame, thanks to the high efficiency that characterizes the computationally intensive stages of our framework. By utilizing a specifically designed CNN to improve the frame-wise segmentation result, we can also achieve high accuracy. We validate our method on the NYUv2 dataset by comparing with the state of the art in terms of accuracy and computational efficiency, and by means of an analysis in terms of time and space complexity.

IROS Conference 2018 Conference Paper

Self-Supervised Learning of the Drivable Area for Autonomous Vehicles

  • Jakob Mayr
  • Christian Unger
  • Federico Tombari

We propose a new approach for generating training data for the task of drivable area segmentation with deep neural networks (DNN). The impressive progress of deep learning in recent years demonstrated a superior performance of DNNs over traditional machine learning and deterministic algorithms for various tasks. Nevertheless, the acquisition of large-scale datasets with associated ground truth labels still poses an expensive and labor-intensive problem. We contribute to the solution of this problem for the task of road segmentation by proposing an automatic labeling pipeline which leverages a deterministic stereo-based approach for ground plane detection to create large datasets suitable for training neural networks. Based on the popular Cityscapes [1] and KITTI dataset [2] and two off-the-shelf DNNs for semantic segmentation, we show that we can achieve good segmentation results on monocular images, which substantially exceed the performance of the algorithm employed for automatic labeling without the need of any manual annotation.

IROS Conference 2018 Conference Paper

Semantic Monocular SLAM for Highly Dynamic Environments

  • Nikolas Brasch
  • Aljaz Bozic
  • Joé Lallemand
  • Federico Tombari

Recent advances in monocular SLAM have enabled real-time capable systems which run robustly under the assumption of a static environment, but fail in presence of dynamic scene changes and motion, since they lack an explicit dynamic outlier handling. We propose a semantic monocular SLAM framework designed to deal with highly dynamic environments, combining feature-based and direct approaches to achieve robustness under challenging conditions. The proposed approach exploits semantic information extracted from the scene within an explicit probabilistic model, which maximizes the probability for both tracking and mapping to rely on those scene parts that do not present a relative motion with respect to the camera. We show more stable pose estimation in dynamic environments and comparable performance to the state of the art on static sequences on the Virtual KITTI and Synthia datasets.

ICRA Conference 2018 Conference Paper

Situation Assessment for Planning Lane Changes: Combining Recurrent Models and Prediction

  • Oliver Scheel
  • Loren Arthur Schwarz
  • Nassir Navab
  • Federico Tombari

We introduce an extension of the Dubins Traveling Salesman Problem with Neighborhoods into the 3D space in which a fixed-wing aerial vehicle is requested to visit a set of target regions while the vehicle motion constraints are satisfied, i. e. , the minimum turning radius and maximum climb and dive angles. The primary challenge is to address both the combinatorial optimization part of finding the sequence of target visits and the continuous optimization part of the final trajectory determination. Due to its high complexity, we propose to address both parts of the problem separately by a decoupled approach in which the sequence is determined by a new distance function designed explicitly for the utilized 3D Dubins Airplane model. The final trajectory is then found by a local optimization which improves the solution quality. The proposed approach provides significantly better solutions than using Euclidean distance in the sequencing part of the problem. Moreover, the found solutions are of the competitive quality to the sampling-based algorithm while its computational requirements are about two orders of magnitude lower.

ICRA Conference 2018 Conference Paper

Situation Assessment for Planning Lane Changes: Combining Recurrent Models and Prediction

  • Oliver Scheel
  • Loren Arthur Schwarz
  • Nassir Navab
  • Federico Tombari

One of the greatest challenges towards fully autonomous cars is the understanding of complex and dynamic scenes. Such understanding is needed for planning of maneuvers, especially those that are particularly frequent such as lane changes. While in recent years advanced driver-assistance systems have made driving safer and more comfortable, these have mostly focused on car following scenarios, and less on maneuvers involving lane changes. In this work we propose a situation assessment algorithm for classifying driving situations with respect to their suitability for lane changing. For this, we propose a deep learning architecture based on a Bidirectional Recurrent Neural Network, which uses Long Short-Term Memory units, and integrates a prediction component in the form of the Intelligent Driver Model. We prove the feasibility of our algorithm on the publicly available NGSIM datasets, where we outperform existing methods.

IROS Conference 2016 Conference Paper

Incremental scene understanding on dense SLAM

  • Chi Li
  • Han Xiao
  • Keisuke Tateno
  • Federico Tombari
  • Nassir Navab
  • Gregory D. Hager

We present an architecture for online, incremental scene modeling which combines a SLAM-based scene understanding framework with semantic segmentation and object pose estimation. The core of this approach comprises a probabilistic inference scheme that predicts semantic labels for object hypotheses at each new frame. From these hypotheses, recognized scene structures are incrementally constructed and tracked. Semantic labels are inferred using a multi-domain convolutional architecture which operates on the image time series and which enables efficient propagation of features as well as robust model registration. To evaluate this architecture, we introduce a large-scale RGB-D dataset JHUSEQ-25 as a new benchmark for the sequence-based scene understanding in complex and densely cluttered scenes. This dataset contains 25 RGB-D video sequences with 100, 000 labeled frames in total. We validate our method on this dataset and demonstrate improved performance of semantic segmentation and 6-DoF object pose estimation compared with methods based on the single view.

IROS Conference 2016 Conference Paper

Sensor substitution for video-based action recognition

  • Christian Rupprecht 0001
  • Colin Lea
  • Federico Tombari
  • Nassir Navab
  • Gregory D. Hager

There are many applications where domain-specific sensing, such as accelerometers, kinematics, or force sensing, provide unique and important information for control or for analysis of motion. However, it is not always the case that these sensors can be deployed or accessed beyond laboratory environments. For example, it is possible to instrument humans or robots to measure motion in the laboratory in ways that it is not possible to replicate in the wild. An alternative, which we explore in this paper, is to address situations where accurate sensing is available while training an algorithm, but for which only video is available for deployment. We present two examples of this sensory substitution methodology. The first variation trains a convolutional neural network to regress real-valued signals, including robot end-effector pose, from video. The second example regresses binary signals derived from accelerometer data which signifies when specific objects are in motion. We evaluate these on the JIGSAWS dataset for robotic surgery training assessment and the 50 Salads dataset for modeling complex structured cooking tasks. We evaluate the trained models for video-based action recognition and show that the trained models provide information that is comparable to the sensory signals they replace.

ICRA Conference 2016 Conference Paper

When 2. 5D is not enough: Simultaneous reconstruction, segmentation and recognition on dense SLAM

  • Keisuke Tateno
  • Federico Tombari
  • Nassir Navab

While the main trend of 3D object recognition has been to infer object detection from single views of the scene — i. e. , 2. 5D data — this work explores the direction on performing object recognition on 3D data that is reconstructed from multiple viewpoints, under the conjecture that such data can improve the robustness of an object recognition system. To achieve this goal, we propose a framework which is able (i) to carry out incremental real-time segmentation of a 3D scene while being reconstructed via Simultaneous Localization And Mapping (SLAM), and (ii) to simultaneously and incrementally carry out 3D object recognition and pose estimation on the reconstructed and segmented 3D representations. Experimental results demonstrate the advantages of our approach with respect to traditional single view-based object recognition and pose estimation approaches, as well as its usefulness in robotic perception and augmented reality applications.

IROS Conference 2015 Conference Paper

Real-time and scalable incremental segmentation on dense SLAM

  • Keisuke Tateno
  • Federico Tombari
  • Nassir Navab

This work proposes a real-time segmentation method for 3D point clouds obtained via Simultaneous Localization And Mapping (SLAM). The proposed method incrementally merges segments obtained from each input depth image in a unified global model using a SLAM framework. Differently from all other approaches, our method is able to yield segmentation of scenes reconstructed from multiple views in real-time, with a complexity that does not depend on the size of the global model. At the same time, it is also general, as it can be deployed with any frame-wise segmentation approach as well as any SLAM algorithm. We validate our proposal by a comparison with the state of the art in terms of computational efficiency and accuracy on a benchmark dataset, as well as by showing how our method can enable real-time segmentation from reconstructions of diverse real indoor environments.

IROS Conference 2014 Conference Paper

Automatic detection of pole-like structures in 3D urban environments

  • Federico Tombari
  • Nicola Fioraio
  • Tommaso Cavallari
  • Samuele Salti
  • Alioscia Petrelli
  • Luigi Di Stefano

This work aims at automatic detection of man-made pole-like structures in scans of urban environments acquired by a 3D sensor mounted on top a moving vehicle. Pole-like structures, such as e. g. roadsigns and streetlights, are widespread in these environments, and their reliable detection is relevant to applications dealing with autonomous navigation, facility damage detection, city planning and maintenance. Yet, due to the characteristic thin shape, detection of man-made pole-like structures is significantly prone to both noise as well as occlusions and clutter, the latter being pervasive nuisances when scanning urban environments. Our approach is based on a “local” stage, whereby local features are classified and clustered together, followed by a “global” stage aimed at further classification of candidate entities. The proposed pipeline turns out effective in experiments on a standard publicly available dataset as well as on a challenging dataset acquired during the project for validation purposes.

ICRA Conference 2013 Conference Paper

Multimodal cue integration through Hypotheses Verification for RGB-D object recognition and 6DOF pose estimation

  • Aitor Aldoma
  • Federico Tombari
  • Johann Prankl
  • Andreas Richtsfeld
  • Luigi Di Stefano
  • Markus Vincze

This paper proposes an effective algorithm for recognizing objects and accurately estimating their 6DOF pose in scenes acquired by a RGB-D sensor. The proposed method is based on a combination of different recognition pipelines, each exploiting the data in a diverse manner and generating object hypotheses that are ultimately fused together in an Hypothesis Verification stage that globally enforces geometrical consistency between model hypotheses and the scene. Such a scheme boosts the overall recognition performance as it enhances the strength of the different recognition pipelines while diminishing the impact of their specific weaknesses. The proposed method outperforms the state-of-the-art on two challenging benchmark datasets for object recognition comprising 35 object models and, respectively, 176 and 353 scenes.

ICRA Conference 2012 Conference Paper

Supervised learning of hidden and non-hidden 0-order affordances and detection in real scenes

  • Aitor Aldoma
  • Federico Tombari
  • Markus Vincze

The ability to perceive possible interactions with the environment is a key capability of task-guided robotic agents. An important subset of possible interactions depends solely on the objects of interest and their position and orientation in the scene. We call these object-based interactions 0-order affordances and divide them among non-hidden and hidden whether the current configuration of an object in the scene renders its affordance directly usable or not. Conversely to other works, we propose that detecting affordances that are not directly perceivable increase the usefulness of robotic agents with manipulation capabilities, so that by appropriate manipulation they can modify the object configuration until the seeked affordance becomes available. In this paper we show how 0-order affordances depending on the geometry of the objects and their pose can be learned using a supervised learning strategy on 3D mesh representations of the objects allowing the use of the whole object geometry. Moreover, we show how the learned affordances can be detected in real scenes obtained with a low-cost depth sensor like the Microsoft Kinect through object recognition and 6D0F pose estimation and present results for both learning on meshes and detection on real scenes to demonstrate the practical application of the presented approach.

IROS Conference 2011 Conference Paper

Online learning for automatic segmentation of 3D data

  • Federico Tombari
  • Luigi Di Stefano
  • Simone Giardino

We propose a method to perform automatic segmentation of 3D scenes based on a standard classifier, whose learning model is continuously improved by means of new samples, and a grouping stage, that enforces local consistency among classified labels. The new samples are automatically delivered to the system by a feedback loop based on a feature selection approach that exploits the outcome of the grouping stage. By experimental results on several datasets we demonstrate that the proposed online learning paradigm is effective in increasing the accuracy of the whole 3D segmentation thanks to the improvement of the learning model of the classifier by means of newly acquired, unsupervised data.