Author name cluster

Francis Engelmann

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

10 papers

2 author rows

NeurIPS Conference 2025 Conference Paper

HouseLayout3D: A Benchmark and Training-free Baseline for 3D Layout Estimation in the Wild

Valentin Bieri
Marie-Julie Rakotosaona
Keisuke Tateno
Francis Engelmann
Leonidas Guibas

Current 3D layout estimation models are predominantly trained on synthetic datasets biased toward simplistic, single-floor scenes. This prevents them from generalizing to complex, multi-floor buildings, often forcing a per-floor processing approach that sacrifices global context. Few works have attempted to holistically address multi-floor layouts. In this work, we introduce HouseLayout3D, a real-world benchmark dataset, which highlights the limitations of existing research when handling expansive, architecturally complex spaces. Additionally, we propose MultiFloor3D, a baseline method leveraging recent advances in 3D reconstruction and 2D segmentation. Our approach significantly outperforms state-of-the-art methods on both our new and existing datasets. Remarkably, it does not require any layout-specific training.

PDF Details

NeurIPS Conference 2025 Conference Paper

Video Perception Models for 3D Scene Synthesis

Rui Huang
Guangyao Zhai
Zuria Bauer
Marc Pollefeys
Federico Tombari
Leonidas Guibas
Gao Huang
Francis Engelmann

Automating the expert-dependent and labor-intensive task of 3D scene synthesis would significantly benefit fields such as architectural design, robotics simulation, and virtual reality. Recent approaches to 3D scene synthesis often rely on the commonsense reasoning of large language models (LLMs) or strong visual priors from image generation models. However, current LLMs exhibit limited 3D spatial reasoning, undermining the realism and global coherence of synthesized scenes, while image-generation-based methods often constrain viewpoint control and introduce multi-view inconsistencies. In this work, we present Video Perception models for 3D Scene synthesis (VIPScene), a novel framework that exploits the encoded commonsense knowledge of the 3D physical world in video generation models to ensure coherent scene layouts and consistent object placements across views. VIPScene accepts both text and image prompts and seamlessly integrates video generation, feedforward 3D reconstruction, and open-vocabulary perception models to semantically and geometrically analyze each object in a scene. This enables flexible scene synthesis with high realism and structural consistency. For a more sufficient evaluation on coherence and plausibility, we further introduce First-Person View Score (FPVScore), utilizing a continuous first-person perspective to capitalize on the reasoning ability of multimodal large language models. Extensive experiments show that VIPScene significantly outperforms existing methods and generalizes well across diverse scenarios.

PDF Details

ICLR Conference 2024 Conference Paper

AGILE3D: Attention Guided Interactive Multi-object 3D Segmentation

Yuanwen Yue
Sabarinath Mahadevan
Jonas Schult
Francis Engelmann
Bastian Leibe
Konrad Schindler
Theodora Kontogianni

During interactive segmentation, a model and a user work together to delineate objects of interest in a 3D point cloud. In an iterative process, the model assigns each data point to an object (or the background), while the user corrects errors in the resulting segmentation and feeds them back into the model. The current best practice formulates the problem as binary classification and segments objects one at a time. The model expects the user to provide positive clicks to indicate regions wrongly assigned to the background and negative clicks on regions wrongly assigned to the object. Sequentially visiting objects is wasteful since it disregards synergies between objects: a positive click for a given object can, by definition, serve as a negative click for nearby objects. Moreover, a direct competition between adjacent objects can speed up the identification of their common boundary. We introduce AGILE3D, an efficient, attention-based model that (1) supports simultaneous segmentation of multiple 3D objects, (2) yields more accurate segmentation masks with fewer user clicks, and (3) offers faster inference. Our core idea is to encode user clicks as spatial-temporal queries and enable explicit interactions between click queries as well as between them and the 3D scene through a click attention module. Every time new clicks are added, we only need to run a lightweight decoder that produces updated segmentation masks. In experiments with four different 3D point cloud datasets, AGILE3D sets a new state-of-the-art. Moreover, we also verify its practicality in real-world setups with real user studies. Project page: https://ywyue.github.io/AGILE3D.

Details

ICRA Conference 2024 Conference Paper

ICGNet: A Unified Approach for Instance-Centric Grasping

René Zurbrügg
Yifan Liu 0001
Francis Engelmann
Suryansh Kumar 0001
Marco Hutter 0001
Vaishakh Patil
Fisher Yu 0001

Accurate grasping is the key to several robotic tasks including assembly and household robotics. Executing a successful grasp in a cluttered environment requires multiple levels of scene understanding: First, the robot needs to analyze the geometric properties of individual objects to find feasible grasps. These grasps need to be compliant with the local object geometry. Second, for each proposed grasp, the robot needs to reason about the interactions with other objects in the scene. Finally, the robot must compute a collision-free grasp trajectory while taking into account the geometry of the target object. Most grasp detection algorithms directly predict grasp poses in a monolithic fashion, which does not capture the composability of the environment. In this paper, we introduce an end-to-end architecture for object-centric grasping. The method uses pointcloud data from a single arbitrary viewing direction as an input and generates an instance-centric representation for each partially observed object in the scene. This representation is further used for object reconstruction and grasp detection in cluttered table-top scenes. We show the effectiveness of the proposed method by extensively evaluating it against state-of-the-art methods on synthetic datasets, indicating superior performance for grasping and reconstruction. Additionally, we demonstrate real-world applicability by decluttering scenes with varying numbers of objects. Videos and Code icgraspnet. github.io.

Details

ICLR Conference 2024 Conference Paper

OpenNeRF: Open Set 3D Neural Scene Segmentation with Pixel-Wise Features and Rendered Novel Views

Francis Engelmann
Fabian Manhardt
Michael Niemeyer
Keisuke Tateno
Federico Tombari

Large visual-language models (VLMs), like CLIP, enable open-set image segmentation to segment arbitrary concepts from an image in a zero-shot manner. This goes beyond the traditional closed-set assumption, i.e., where models can only segment classes from a pre-defined training set. More recently, first works on open-set segmentation in 3D scenes have appeared in the literature. These methods are heavily influenced by closed-set 3D convolutional approaches that process point clouds or polygon meshes. However, these 3D scene representations do not align well with the image-based nature of the visual-language models. Indeed, point cloud and 3D meshes typically have a lower resolution than images and the reconstructed 3D scene geometry might not project well to the underlying 2D image sequences used to compute pixel-aligned CLIP features. To address these challenges, we propose OpenNeRF which naturally operates on posed images and directly encodes the VLM features within the NeRF. This is similar in spirit to LERF, however our work shows that using pixel-wise VLM features (instead of global CLIP features) results in an overall less complex architecture without the need for additional DINO regularization. Our OpenNeRF further leverages NeRF’s ability to render novel views and extract open-set VLM features from areas that are not well observed in the initial posed images. For 3D point cloud segmentation on the Replica dataset, OpenNeRF outperforms recent open-vocabulary methods such as LERF and OpenScene by at least +4.9 mIoU.

Details

ICRA Conference 2023 Conference Paper

Mask3D: Mask Transformer for 3D Semantic Instance Segmentation

Jonas Schult
Francis Engelmann
Alexander Hermans
Or Litany
Siyu Tang 0001
Bastian Leibe

Modern 3D semantic instance segmentation approaches predominantly rely on specialized voting mechanisms followed by carefully designed geometric clustering techniques. Building on the successes of recent Transformer-based methods for object detection and image segmentation, we propose the first Transformer-based approach for 3D semantic instance segmentation. We show that we can leverage generic Transformer building blocks to directly predict instance masks from 3D point clouds. In our model - called Mask3D - each object instance is represented as an instance query. Using Transformer decoders, the instance queries are learned by iteratively attending to point cloud features at multiple scales. Combined with point features, the instance queries directly yield all instance masks in parallel. Mask3D has several advantages over current state-of-the-art approaches, since it neither relies on (1) voting schemes which require hand-selected geometric properties (such as centers) nor (2) geometric grouping mechanisms requiring manually-tuned hyper-parameters (e. g. radii) and (3) enables a loss that directly optimizes instance masks. Mask3D sets a new state-of-the-art on ScanNet test (+6. 2mAP), S3DIS 6-fold (+10. 1 mAP), STPLS3D (+11. 2 mAP) and ScanNet200 test (+12. 4 mAP).

Details

NeurIPS Conference 2023 Conference Paper

OpenMask3D: Open-Vocabulary 3D Instance Segmentation

Ayca Takmaz
Elisabetta Fedele
Robert Sumner
Marc Pollefeys
Federico Tombari
Francis Engelmann

We introduce the task of open-vocabulary 3D instance segmentation. Current approaches for 3D instance segmentation can typically only recognize object categories from a pre-defined closed set of classes that are annotated in the training datasets. This results in important limitations for real-world applications where one might need to perform tasks guided by novel, open-vocabulary queries related to a wide variety of objects. Recently, open-vocabulary 3D scene understanding methods have emerged to address this problem by learning queryable features for each point in the scene. While such a representation can be directly employed to perform semantic segmentation, existing methods cannot separate multiple object instances. In this work, we address this limitation, and propose OpenMask3D, which is a zero-shot approach for open-vocabulary 3D instance segmentation. Guided by predicted class-agnostic 3D instance masks, our model aggregates per-mask features via multi-view fusion of CLIP-based image embeddings. Experiments and ablation studies on ScanNet200 and Replica show that OpenMask3D outperforms other open-vocabulary methods, especially on the long-tail distribution. Qualitative experiments further showcase OpenMask3D’s ability to segment object properties based on free-form queries describing geometry, affordances, and materials.

PDF Details

ICRA Conference 2020 Conference Paper

Dilated Point Convolutions: On the Receptive Field Size of Point Convolutions on 3D Point Clouds

Francis Engelmann
Theodora Kontogianni
Bastian Leibe

In this work, we propose Dilated Point Convolutions (DPC). In a thorough ablation study, we show that the receptive field size is directly related to the performance of 3D point cloud processing tasks, including semantic segmentation and object classification. Point convolutions are widely used to efficiently process 3D data representations such as point clouds or graphs. However, we observe that the receptive field size of recent point convolutional networks is inherently limited. Our dilated point convolutions alleviate this issue, they significantly increase the receptive field size of point convolutions. Importantly, our dilation mechanism can easily be integrated into most existing point convolutional networks. To evaluate the resulting network architectures, we visualize the receptive field and report competitive scores on popular point cloud benchmarks.

Details

IROS Conference 2017 Conference Paper

Keyframe-based visual-inertial online SLAM with relocalization

Anton Kasyanov
Francis Engelmann
Jörg Stückler
Bastian Leibe

Complementing images with inertial measurements has become one of the most popular approaches to achieve highly accurate and robust real-time camera pose tracking. In this paper, we present a keyframe-based approach to visual-inertial simultaneous localization and mapping (SLAM) for monocular and stereo cameras. Our visual-inertial SLAM system is based on a real-time capable visual-inertial odometry method that provides locally consistent trajectory and map estimates. We achieve global consistency in the estimate through online loop-closing and non-linear optimization. Furthermore, our system supports relocalization in a map that has been previously obtained and allows for continued SLAM operation. We evaluate our approach in terms of accuracy, relocalization capability and run-time efficiency on public indoor benchmark datasets and on newly recorded outdoor sequences. We demonstrate state-of-the-art performance of our system compared to a visual-inertial odometry method and baseline visual SLAM approaches in recovering the trajectory of the camera.

Details

ICRA Conference 2016 Conference Paper

Multi-scale object candidates for generic object tracking in street scenes

Aljosa Osep
Alexander Hermans
Francis Engelmann
Dirk Klostermann
Markus Mathias
Bastian Leibe

Most vision based systems for object tracking in urban environments focus on a limited number of important object categories such as cars or pedestrians, for which powerful detectors are available. However, practical driving scenarios contain many additional objects of interest, for which suitable detectors either do not yet exist or would be cumbersome to obtain. In this paper we propose a more general tracking approach which does not follow the often used tracking-by-detection principle. Instead, we investigate how far we can get by tracking unknown, generic objects in challenging street scenes. As such, we do not restrict ourselves to only tracking the most common categories, but are able to handle a large variety of static and moving objects. We evaluate our approach on the KITTI dataset and show competitive results for the annotated classes, even though we are not restricted to them.

Details