Author name cluster

Dieter Fox

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

148 papers

2 author rows

IROS Conference 2025 Conference Paper

ACGD: Visual Multitask Policy Learning with Asymmetric Critic Guided Distillation

Krishnan Srinivasan
Jie Xu 0028
Henry Ang
Eric Heiden
Dieter Fox
Jeannette Bohg
Animesh Garg

We present Asymmetric Critic Guided Distillation, ACGD, a framework for learning multi-task dexterous manipulation policies that can manipulate articulated objects using images as input. ACGD is a scalable student-teacher distillation approach that utilizes behavior cloning to distill multiple expert policies into a single vision-based, multi-task student policy for dexterous manipulation. The expert policies are trained with traditional RL techniques with access to privileged state information of both the robot and the manipulated object, while the distilled student policy operates under realistic sensory constraints, specifically using only camera images and robot proprioception. During distillation, we use an expert-critic that provides action labels and value estimates to refine the student’s action sampling through a dual IL/RL objective. In the multi-task setting, we achieve this through an aggregate critic for different single-task experts. Our approach exhibits strong performance compared to a number of state-of-the-art imitation learning (IL) and reinforcement learning (RL) baselines. We evaluate across a variety of multi-task dexterous manipulation benchmarks including bimanual manipulation, single-hand object articulation tasks, and a tendon-actuated hand and achieves state-of-the-art performance with 10-15% improvement over the baseline algorithms. Visit our website for more details.

ICLR Conference 2025 Conference Paper

AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation

Jiafei Duan
Wilbert Pumacay
Nishanth Kumar
Yi Ru Wang
Shulin Tian
Wentao Yuan
Ranjay Krishna
Dieter Fox

Robotic manipulation in open-world settings requires not only task execution but also the ability to detect and learn from failures. While recent advances in vision-language models (VLMs) and large language models (LLMs) have improved robots' spatial reasoning and problem-solving abilities, they still struggle with failure recognition, limiting their real-world applicability. We introduce AHA, an open-source VLM designed to detect and reason about failures in robotic manipulation using natural language. By framing failure detection as a free-form reasoning task, AHA identifies failures and provides detailed, adaptable explanations across different robots, tasks, and environments. We fine-tuned AHA using FailGen, a scalable framework that generates the first large-scale dataset of robotic failure trajectories, the AHA dataset. FailGen achieves this by procedurally perturbing successful demonstrations from simulation. Despite being trained solely on the AHA dataset, AHA generalizes effectively to real-world failure datasets, robotic systems, and unseen tasks. It surpasses the second-best model (GPT-4o in-context learning) by 10.3% and exceeds the average performance of six compared models including five state-of-the-art VLMs by 35.3% across multiple metrics and datasets. We integrate AHA into three manipulation frameworks that utilize LLMs/VLMs for reinforcement learning, task and motion planning, and zero-shot trajectory generation. AHA’s failure feedback enhances these policies' performances by refining dense reward functions, optimizing task planning, and improving sub-task verification, boosting task success rates by an average of 21.4% across all three tasks compared to GPT-4 models. Project page: https://aha-vlm.github.io

ICRA Conference 2025 Conference Paper

Guiding Long-Horizon Task and Motion Planning with Vision Language Models

Zhutian Yang
Caelan Reed Garrett
Dieter Fox
Tomás Lozano-Pérez
Leslie Pack Kaelbling

Vision-Language Models (VLM) can generate plausible high-level plans when prompted with a goal, the context, an image of the scene, and any planning constraints. However, there is no guarantee that the predicted actions are geometrically and kinematically feasible for a particular robot embodiment. As a result, many prerequisite steps such as opening drawers to access objects are often omitted in their plans. Robot task and motion planners can generate motion trajectories that respect the geometric feasibility of actions and insert physically necessary actions, but do not scale to everyday problems that require common-sense knowledge and involve large state spaces comprised of many variables. We propose VLM-TAMP, a hierarchical planning algorithm that leverages a VLM to generate both semantically-meaningful and horizon-reducing intermediate subgoals that guide a task and motion planner. When a subgoal or action cannot be refined, the VLM is queried again for replanning. We evaluate VLMTAMP on kitchen tasks where a robot must accomplish cooking goals that require performing 30-50 actions in sequence and interacting with up to 21 objects. VLM-TAMP substantially outperforms baselines that rigidly and independently execute VLM-generated action sequences, both in terms of success rates (50 to 100 % versus 0 %) and average task completion percentage (72 to 100 % versus 15 to 45 %). See project site https://zt-yang.github.io/vlm-tamp-robot/ for more information.

ICLR Conference 2025 Conference Paper

HAMSTER: Hierarchical Action Models for Open-World Robot Manipulation

Yi Li 0038
Yuquan Deng
Jesse Zhang
Joel Jang
Marius Memmel
Caelan Reed Garrett
Fabio Ramos 0001
Dieter Fox

Large foundation models have shown strong open-world generalization to complex problems in vision and language, but similar levels of generalization have yet to be achieved in robotics. One fundamental challenge is the lack of robotic data, which are typically obtained through expensive on-robot operation. A promising remedy is to leverage cheaper, *off-domain* data such as action-free videos, hand-drawn sketches, or simulation data. In this work, we posit that *hierarchical* vision-language-action (VLA) models can be more effective in utilizing off-domain data than standard monolithic VLA models that directly finetune vision-language models (VLMs) to predict actions. In particular, we study a class of hierarchical VLA models, where the high-level VLM is finetuned to produce a coarse 2D path indicating the desired robot end-effector trajectory given an RGB image and a task description. The intermediate 2D path prediction is then served as guidance to the low-level, 3D-aware control policy capable of precise manipulation. Doing so alleviates the high-level VLM from fine-grained action prediction, while reducing the low-level policy's burden on complex task-level reasoning. We show that, with the hierarchical design, the high-level VLM can transfer across significant domain gaps between the off-domain finetuning data and real-robot testing scenarios, including differences in embodiments, dynamics, visual appearances, and task semantics, etc. In the real-robot experiments, we observe an average of 20% improvement in success rate across seven different axes of generalization over OpenVLA, representing a 50% relative gain. Visual results are provided at: [https://hamster-robot.github.io/](https://hamster-robot.github.io/)

ICRA Conference 2025 Conference Paper

Inference-Time Policy Steering Through Human Interactions

Yanwei Wang
Lirui Wang
Yilun Du
Balakumar Sundaralingam
Xuning Yang
Yu-Wei Chao
Claudia Pérez-D'Arpino
Dieter Fox

Generative policies trained with human demonstrations can autonomously accomplish multimodal, longhorizon tasks. However, during inference, humans are often removed from the policy execution loop, limiting the ability to guide a pre-trained policy towards a specific sub-goal or trajectory shape among multiple predictions. Naive human intervention may inadvertently exacerbate distribution shift, leading to constraint violations or execution failures. To better align policy output with human intent without inducing out-of-distribution errors, we propose an Inference-Time Policy Steering (ITPS) framework that leverages human interactions to bias the generative sampling process, rather than finetuning the policy on interaction data. We evaluate ITPS across three simulated and real-world benchmarks, testing three forms of human interaction and associated alignment distance metrics. Among six sampling strategies, our proposed stochastic sampling with diffusion policy achieves the best trade-off between alignment and distribution shift. Videos are available at https://yanweiw.github.io/itps/.

ICLR Conference 2025 Conference Paper

Latent Action Pretraining from Videos

Seonghyeon Ye
Joel Jang
Byeongguk Jeon
Se June Joo
Jianwei Yang
Baolin Peng
Ajay Mandlekar
Reuben Tan

We introduce Latent Action Pretraining for general Action models (LAPA), the first unsupervised method for pretraining Vision-Language-Action (VLA) models without ground-truth robot action labels. Existing Vision-Language-Action models require action labels typically collected by human teleoperators during pretraining, which significantly limits possible data sources and scale. In this work, we propose a method to learn from internet-scale videos that do not have robot action labels. We first train an action quantization model leveraging VQ-VAE-based objective to learn discrete latent actions between image frames, then pretrain a latent VLA model to predict these latent actions from observations and task descriptions, and finally finetune the VLA on small-scale robot manipulation data to map from latent to robot actions. Experimental results demonstrate that our method significantly outperforms existing techniques that train robot manipulation policies from large-scale videos. Furthermore, it outperforms the state-of-the-art VLA model trained with robotic action labels on real-world manipulation tasks that require language conditioning, generalization to unseen objects, and semantic generalization to unseen instructions. Training only on human manipulation videos also shows positive transfer, opening up the potential for leveraging web-scale data for robotics foundation models.

ICRA Conference 2025 Conference Paper

MatchMaker: Automated Asset Generation for Robotic Assembly

Yian Wang 0001
Bingjie Tang
Chuang Gan 0001
Dieter Fox
Kaichun Mo
Yashraj Narang
Iretiayo Akinola

Robotic assembly remains a significant challenge due to complexities in visual perception, functional grasping, contact-rich manipulation, and performing high-precision tasks. Simulation-based learning and sim-to-real transfer have led to recent success in solving assembly tasks in the presence of object pose variation, perception noise, and control error; however, the development of a generalist (i. e. , multi-task) agent for a broad range of assembly tasks has been limited by the need to manually curate assembly assets, which greatly constrains the number and diversity of assembly problems that can be used for policy learning. Inspired by recent success of using generative AI to scale up robot learning, we propose Match-Maker, a pipeline to automatically generate diverse, simulation-compatible assembly asset pairs to facilitate learning assembly skills. Specifically, MatchMaker can 1) take a simulation-incompatible, interpenetrating asset pair as input, and automatically convert it into a simulation-compatible, interpenetration-free pair, 2) take an arbitrary single asset as input, and generate a geometrically-mating asset to create an asset pair, 3) automatically erode contact surfaces from (1) or (2) according to a user-specified clearance parameter to generate realistic parts. We demonstrate that data generated by MatchMaker outperforms previous work in terms of diversity and effectiveness for downstream assembly skill learning. Project page: https://wangyian-me.github.io/MatchMaker/.

IROS Conference 2025 Conference Paper

OptiGrasp: Optimized Grasp Pose Detection Using RGB Images for Warehouse Picking Robots

Soofiyan Atar
Yi Li 0038
Markus Grotz
Michael Wolf
Dieter Fox
Joshua R. Smith 0001

In warehouse environments, robots require robust picking capabilities to manage a wide variety of objects. Effective deployment demands minimal hardware, strong generalization to new products, and resilience in diverse settings. Current methods often rely on depth sensors for structural information, which suffer from high costs, complex setups, and technical limitations. Inspired by recent advancements in computer vision, we propose an innovative approach that leverages foundation models to enhance suction grasping using only RGB images. Trained solely on a synthetic dataset, our method generalizes its grasp prediction capabilities to real-world robots and a diverse range of novel objects not included in the training set. Our network achieves an 81. 9% success rate in real-world applications. The project website with code and data will be available at http://optigrasp.github.io.

NeurIPS Conference 2025 Conference Paper

RobotSmith: Generative Robotic Tool Design for Acquisition of Complex Manipulation Skills

Chunru Lin
Haotian Yuan
Yian Wang
Xiaowen Qiu
Tsun-Hsuan Johnson Wang
Minghao Guo
Bohan Wang
Yashraj Narang

Endowing robots with tool design abilities is critical for enabling them to solve complex manipulation tasks that would otherwise be intractable. While recent generative frameworks can automatically synthesize task settings—such as 3D scenes and reward functions—they have not yet addressed the challenge of tool-use scenarios. Simply retrieving human-designed tools might not be ideal since many tools (e. g. , a rolling pin) are difficult for robotic manipulators to handle. Furthermore, existing tool design approaches either rely on predefined templates with limited parameter tuning or apply generic 3D generation methods that are not optimized for tool creation. To address these limitations, we propose RobotSmith, an automated pipeline that leverages the implicit physical knowledge embedded in vision-language models (VLMs) alongside the more accurate physics provided by physics simulations to design and use tools for robotic manipulation. Our system (1) iteratively proposes tool designs using collaborative VLM agents, (2) generates low-level robot trajectories for tool use, and (3) jointly optimizes tool geometry and usage for task performance. We evaluate our approach across a wide range of manipulation tasks involving rigid, deformable, and fluid objects. Experiments show that our method consistently outperforms strong baselines in both task success rate and overall performance. Notably, our approach achieves a 50. 0\% average success rate, significantly surpassing other baselines such as 3D generation (21. 4\%) and tool retrieval (11. 1\%). Finally, we deploy our system in real-world settings, demonstrating that the generated tools and their usage plans transfer effectively to physical execution, validating the practicality and generalization capabilities of our approach.

ICML Conference 2025 Conference Paper

SAM2Act: Integrating Visual Foundation Model with A Memory Architecture for Robotic Manipulation

Haoquan Fang
Markus Grotz
Wilbert Pumacay
Yi Ru Wang
Dieter Fox
Ranjay Krishna
Jiafei Duan

Robotic manipulation systems operating in diverse, dynamic environments must exhibit three critical abilities: multitask interaction, generalization to unseen scenarios, and spatial memory. While significant progress has been made in robotic manipulation, existing approaches often fall short in generalization to complex environmental variations and addressing memory-dependent tasks. To bridge this gap, we introduce SAM2Act, a multi-view robotic transformer-based policy that leverages multi-resolution upsampling with visual representations from large-scale foundation model. SAM2Act achieves a state-of-the-art average success rate of 86. 8% across 18 tasks in the RLBench benchmark, and demonstrates robust generalization on The Colosseum benchmark, with only a 4. 3% performance gap under diverse environmental perturbations. Building on this foundation, we propose SAM2Act+, a memory-based architecture inspired by SAM2, which incorporates a memory bank, an encoder, and an attention mechanism to enhance spatial memory. To address the need for evaluating memory-dependent tasks, we introduce MemoryBench, a novel benchmark designed to assess spatial memory and action recall in robotic manipulation. SAM2Act+ achieves an average success rate of 94. 3% on memory-based tasks in MemoryBench, significantly outperforming existing approaches and pushing the boundaries of memory-based robotic systems. Project page: sam2act. github. io.

ICLR Conference 2025 Conference Paper

SRSA: Skill Retrieval and Adaptation for Robotic Assembly Tasks

Yijie Guo
Bingjie Tang
Iretiayo Akinola
Dieter Fox
Abhishek Gupta 0004
Yashraj Narang

Enabling robots to learn novel tasks in a data-efficient manner is a long-standing challenge. Common strategies involve carefully leveraging prior experiences, especially transition data collected on related tasks. Although much progress has been made for general pick-and-place manipulation, far fewer studies have investigated contact-rich assembly tasks, where precise control is essential. We introduce SRSA} (Skill Retrieval and Skill Adaptation), a novel framework designed to address this problem by utilizing a pre-existing skill library containing policies for diverse assembly tasks. The challenge lies in identifying which skill from the library is most relevant for fine-tuning on a new task. Our key hypothesis is that skills showing higher zero-shot success rates on a new task are better suited for rapid and effective fine-tuning on that task. To this end, we propose to predict the transfer success for all skills in the skill library on a novel task, and then use this prediction to guide the skill retrieval process. We establish a framework that jointly captures features of object geometry, physical dynamics, and expert actions to represent the tasks, allowing us to efficiently learn the transfer success predictor. Extensive experiments demonstrate that SRSA significantly outperforms the leading baseline. When retrieving and fine-tuning skills on unseen tasks, SRSA achieves a 19% relative improvement in success rate, exhibits 2.6x lower standard deviation across random seeds, and requires 2.4x fewer transition samples to reach a satisfactory success rate, compared to the baseline. In a continual learning setup, SRSA efficiently learns policies for new tasks and incorporates them into the skill library, enhancing future policy learning. Furthermore, policies trained with SRSA in simulation achieve a 90% mean success rate when deployed in the real world. Please visit our project webpage https://srsa2024.github.io/.

ICRA Conference 2025 Conference Paper

TWIN: Two-handed Intelligent Benchmark for Bimanual Manipulation

Markus Grotz
Mohit Shridhar
Yu-Wei Chao
Tamim Asfour
Dieter Fox

Bimanual manipulation is challenging due to precise spatial and temporal coordination required between two arms. While there exist several real-world bimanual systems, there is a lack of simulated benchmarks with a large task diversity for systematically studying bimanual capabilities across a wide range of tabletop tasks. This paper addresses the gap by presenting a benchmark for bimanual manipulation. A key functionality is the ability to autonomously generate training data without the necessity of human demonstrations to the robot. We open-source our code and benchmark, which comprises 13 new tasks with 23 unique task variations, each requiring a high degree of coordination and adaptability. To initiate the benchmark, we extended multiple state-of-the-art techniques to the domain of bimanual manipulation. The project website with code is available at: http://bimanual.github.io.

ICLR Conference 2024 Conference Paper

ASID: Active Exploration for System Identification in Robotic Manipulation

Marius Memmel
Andrew Wagenmaker
Chuning Zhu
Dieter Fox
Abhishek Gupta 0004

Model-free control strategies such as reinforcement learning have shown the ability to learn control strategies without requiring an accurate model or simulator of the world. While this is appealing due to the lack of modeling requirements, such methods can be sample inefficient, making them impractical in many real-world domains. On the other hand, model-based control techniques leveraging accurate simulators can circumvent these challenges and use a large amount of cheap simulation data to learn controllers that can effectively transfer to the real world. The challenge with such model-based techniques is the requirement for an extremely accurate simulation, requiring both the specification of appropriate simulation assets and physical parameters. This requires considerable human effort to design for every environment being considered. In this work, we propose a learning system that can leverage a small amount of real-world data to autonomously refine a simulation model and then plan an accurate control strategy that can be deployed in the real world. Our approach critically relies on utilizing an initial (possibly inaccurate) simulator to design effective exploration policies that, when deployed in the real world, collect high-quality data. We demonstrate the efficacy of this paradigm in identifying articulation, mass, and other physical parameters in several challenging robotic manipulation tasks, and illustrate that only a small amount of real-world data can allow for effective sim-to-real transfer.

IROS Conference 2024 Conference Paper

DiMSam: Diffusion Models as Samplers for Task and Motion Planning under Partial Observability

Xiaolin Fang 0002
Caelan Reed Garrett
Clemens Eppner
Tomás Lozano-Pérez
Leslie Pack Kaelbling
Dieter Fox

Generative models such as diffusion models, excel at capturing high-dimensional distributions with diverse input modalities, e. g. robot trajectories, but are less effective at multistep constraint reasoning. Task and Motion Planning (TAMP) approaches are suited for planning multi-step autonomous robot manipulation. However, it can be difficult to apply them to domains where the environment and its dynamics are not fully known. We propose to overcome these limitations by composing diffusion models using a TAMP system. We use the learned components for constraints and samplers that are difficult to engineer in the planning model, and use a TAMP solver to search for the task plan with constraint-satisfying action parameter values. To tractably make predictions for unseen objects in the environment, we define the learned samplers and TAMP operators on learned latent embedding of changing object states. We evaluate our approach in a simulated articulated object manipulation domain and show how the combination of classical TAMP, generative modeling, and latent embedding enables multi-step constraint-based reasoning. We also apply the learned sampler in the real world. Website: https://sites.google.com/view/dimsam-tamp.

IROS Conference 2024 Conference Paper

Fast Explicit-Input Assistance for Teleoperation in Clutter

Nick Walker 0001
Xuning Yang
Animesh Garg
Maya Cakmak
Dieter Fox
Claudia Pérez-D'Arpino

The performance of prediction-based assistance for robot teleoperation degrades in unseen or goal-rich environments due to incorrect or quickly-changing intent inferences. Poor predictions can confuse operators or cause them to change their control input to implicitly signal their goal. We present a new assistance interface for robotic manipulation where an operator can explicitly communicate a manipulation goal by pointing the end-effector. The pointing target specifies a region for local pose generation and optimization, providing interactive control over grasp and placement pose candidates. We evaluate this explicit pointing interface against an implicit inference-based assistance scheme and an unassisted control condition in a within-subjects user study (N=20), where participants teleoperate a simulated robot to complete a multi-step singulation and stacking task in cluttered environments. We find that operators prefer the explicit interface, experience fewer pick failures and report lower cognitive workload. Our code is available at: github.com/NVlabs/fast-explicit-teleop.

IROS Conference 2024 Conference Paper

IntervenGen: Interventional Data Generation for Robust and Data-Efficient Robot Imitation Learning

Ryan Hoque
Ajay Mandlekar
Caelan Reed Garrett
Ken Goldberg
Dieter Fox

Imitation learning is a promising paradigm for training robot control policies, but these policies can suffer from distribution shift, where the conditions at evaluation time differ from those in the training data. A popular approach for increasing policy robustness to distribution shift is interactive imitation learning (i. e. , DAgger and variants), where a human operator provides corrective interventions during policy rollouts. However, collecting a sufficient amount of interventions to cover the distribution of policy mistakes can be burdensome for human operators. We propose IntervenGen (I-Gen), a novel data augmentation system for robot control that autonomously produces a large set of corrective interventions with rich coverage of the state space from a small number of human interventions. We apply I-Gen to 4 simulated environments and 1 physical environment with object pose estimation error and show that it can increase policy robustness by up to 39× with only 10 human interventions. Videos and more results are available at https://sites.google.com/view/intervengen2024.

ICRA Conference 2023 Conference Paper

CabiNet: Scaling Neural Collision Detection for Object Rearrangement with Procedural Scene Generation

Adithyavairavan Murali
Arsalan Mousavian
Clemens Eppner
Adam Fishman
Dieter Fox

We address the important problem of generalizing robotic rearrangement to clutter without any explicit object models. We first generate over 650K cluttered scenes-orders of magnitude more than prior work-in diverse everyday environments, such as cabinets and shelves. We render synthetic partial point clouds from this data and use it to train our CabiNet model architecture. CabiNet is a collision model that accepts object and scene point clouds, captured from a single-view depth observation, and predicts collisions for SE(3) object poses in the scene. Our representation has a fast inference speed of 7μs/query with nearly 20% higher performance than baseline approaches in challenging environments. We use this collision model in conjunction with a Model Predictive Path Integral (MPPI) planner to generate collision-free trajectories for picking and placing in clutter. CabiNet also predicts waypoints, computed from the scene's signed distance field (SDF), that allows the robot to navigate tight spaces during rearrangement. This improves rearrangement performance by nearly 35% compared to baselines. We systematically evaluate our approach, procedurally generate simulated experiments, and demonstrate that our approach directly transfers to the real world, despite training exclusively in simulation. Supplementary material and videos of robot experiments in completely unknown scenes are available at: cabinet-object-rearrangement.github.io.

IROS Conference 2023 Conference Paper

Constrained Generative Sampling of 6-DoF Grasps

Jens Lundell
Francesco Verdoja
Tran Nguyen Le
Arsalan Mousavian
Dieter Fox
Ville Kyrki

Most state-of-the-art data-driven grasp sampling methods propose stable and collision-free grasps uniformly on the target object. For bin-picking, executing any of those reachable grasps is sufficient. However, for completing specific tasks, such as squeezing out liquid from a bottle, we want the grasp to be on a specific part of the object's body while avoiding other locations, such as the cap. This work presents a generative grasp sampling network, VCGS, capable of constrained 6-Degrees of Freedom (DoF) grasp sampling. In addition, we also curate a new dataset designed to train and evaluate methods for constrained grasping. The new dataset, called CONG, consists of over 14 million training samples of synthetically rendered point clouds and grasps at random target areas on 2889 objects. VCGS is benchmarked against GraspNet, a state-of-the-art unconstrained grasp sampler, in simulation and on a real robot. The results demonstrate that VCGS achieves a 10-15% higher grasp success rate than the baseline while being 2–3 times as sample efficient. Supplementary material is available on our project website.

ICRA Conference 2023 Conference Paper

CuRobo: Parallelized Collision-Free Robot Motion Generation

Balakumar Sundaralingam
Siva Kumar Sastry Hari
Adam Fishman
Caelan Reed Garrett
Karl Van Wyk
Valts Blukis
Alexander Millane
Helen Oleynikova

This paper explores the problem of collision-free motion generation for manipulators by formulating it as a global motion optimization problem. We develop a parallel optimization technique to solve this problem and demonstrate its effectiveness on massively parallel GPUs. We show that combining simple optimization techniques with many parallel seeds leads to solving difficult motion generation problems within 53ms on average, 62x faster than SOTA trajectory optimization methods. We achieve SOTA performance by combining L-BFGS step direction estimation with a novel parallel noisy line search scheme and a particle-based optimization solver. To further aid trajectory optimization, we develop a parallel geometric planner that is atleast 28x faster than SOTA RRTConnect implementations. We also introduce a collision-free IK solver that can solve over 9000 queries/s. We are releasing our GPU accelerated library CuRobo that contains core components for robot motion generation. Additional details are available at sites.google.com/nvidia.com/curobo.

ICRA Conference 2023 Conference Paper

DefGraspNets: Grasp Planning on 3D Fields with Graph Neural Nets

Isabella Huang
Yashraj Narang
Ruzena Bajcsy
Fabio Ramos 0001
Tucker Hermans
Dieter Fox

Robotic grasping of 3D deformable objects is critical for real-world applications such as food handling and robotic surgery. Unlike rigid and articulated objects, 3D deformable objects have infinite degrees of freedom. Fully defining their state requires 3D deformation and stress fields, which are exceptionally difficult to analytically compute or experimentally measure. Thus, evaluating grasp candidates for grasp planning typically requires accurate, but slow 3D finite element method (FEM) simulation. Sampling-based grasp planning is often impractical, as it requires evaluation of a large number of grasp candidates. Gradient-based grasp planning can be more efficient, but requires a differentiable model to synthesize optimal grasps from initial candidates. Differentiable FEM simulators may fill this role, but are typically no faster than standard FEM. In this work, we propose learning a predictive graph neural network (GNN), DefGraspNets, to act as our differentiable model. We train DefGraspNets to predict 3D stress and deformation fields based on FEM-based grasp simulations. DefGraspNets not only runs up to 1500x faster than the FEM simulator, but also enables fast gradient-based grasp optimization over 3D stress and deformation metrics. We design DefGraspNets to align with real-world grasp planning practices and demonstrate generalization across multiple test sets, including real-world experiments.

ICLR Conference 2023 Conference Paper

Impossibly Good Experts and How to Follow Them

Aaron Walsman
Muru Zhang
Sanjiban Choudhury
Dieter Fox
Ali Farhadi

We consider the sequential decision making problem of learning from an expert that has access to more information than the learner. For many problems this extra information will enable the expert to achieve greater long term reward than any policy without this privileged information access. We call these experts ``Impossibly Good'' because no learning algorithm will be able to reproduce their behavior. However, in these settings it is reasonable to attempt to recover the best policy possible given the agent's restricted access to information. We provide a set of necessary criteria on the expert that will allow a learner to recover the optimal policy in the reduced information space from the expert's advice alone. We also provide a new approach called Elf Distillation (Explorer Learning from Follower) that can be used in cases where these criteria are not met and environmental rewards must be taken into account. We show that this algorithm performs better than a variety of strong baselines on a challenging suite of Minigrid and Vizdoom environments.

ICRA Conference 2023 Conference Paper

ProgPrompt: Generating Situated Robot Task Plans using Large Language Models

Ishika Singh
Valts Blukis
Arsalan Mousavian
Ankit Goyal 0001
Danfei Xu
Jonathan Tremblay
Dieter Fox
Jesse Thomason

Task planning can require defining myriad domain knowledge about the world in which a robot needs to act. To ameliorate that effort, large language models (LLMs) can be used to score potential next actions during task planning, and even generate action sequences directly, given an instruction in natural language with no additional domain information. However, such methods either require enumerating all possible next steps for scoring, or generate free-form text that may contain actions not possible on a given robot in its current context. We present a programmatic LLM prompt structure that enables plan generation functional across situated environments, robot capabilities, and tasks. Our key insight is to prompt the LLM with program-like specifications of the available actions and objects in an environment, as well as with example programs that can be executed. We make concrete recommendations about prompt structure and generation constraints through ablation experiments, demonstrate state of the art success rates in VirtualHome household tasks, and deploy our method on a physical robot arm for tabletop tasks. Website at progprompt.github.io

ICRA Conference 2022 Conference Paper

HandoverSim: A Simulation Framework and Benchmark for Human-to-Robot Object Handovers

Yu-Wei Chao
Chris Paxton 0001
Yu Xiang 0001
Wei Yang 0019
Balakumar Sundaralingam
Tao Chen 0046
Adithyavairavan Murali
Maya Cakmak

We introduce a new simulation benchmark “Han-doverSim” for human-to-robot object handovers. To simulate the giver's motion, we leverage a recent motion capture dataset of hand grasping of objects. We create training and evaluation environments for the receiver with standardized protocols and metrics. We analyze the performance of a set of baselines and show a correlation with a real-world evaluation. 1 1 Code is open sourced at https://handover-sim.github.io.

ICRA Conference 2022 Conference Paper

Model Predictive Control for Fluid Human-to-Robot Handovers

Wei Yang 0019
Balakumar Sundaralingam
Chris Paxton 0001
Iretiayo Akinola
Yu-Wei Chao
Maya Cakmak
Dieter Fox

Human-robot handover is a fundamental yet challenging task in human-robot interaction and collaboration. Recently, remarkable progressions have been made in human-to-robot handovers of unknown objects by using learning-based grasp generators. However, how to responsively generate smooth motions to take an object from a human is still an open question. Specifically, planning motions that take human comfort into account is not a part of the human-robot handover process in most prior works. In this paper, we propose to generate smooth motions via an efficient model-predictive control (MPC) framework that integrates perception and complex domain-specific constraints into the optimization problem. We introduce a learning-based grasp reachability model to select candidate grasps which maximize the robot's manipulability, giving it more freedom to satisfy these constraints. Finally, we integrate a neural net force/torque classifier that detects contact events from noisy data. We conducted human-to-robot handover experiments on a diverse set of objects with several users ( $N=4$ ) and performed a systematic evaluation of each module. The study shows that the users preferred our MPC approach over the baseline system by a large margin.

ICRA Conference 2022 Conference Paper

StructFormer: Learning Spatial Structure for Language-Guided Semantic Rearrangement of Novel Objects

Weiyu Liu
Chris Paxton 0001
Tucker Hermans
Dieter Fox

Geometric organization of objects into semantically meaningful arrangements pervades the built world. As such, assistive robots operating in warehouses, offices, and homes would greatly benefit from the ability to recognize and rearrange objects into these semantically meaningful structures. To be useful, these robots must contend with previously unseen objects and receive instructions without significant programming. While previous works have examined recognizing pairwise semantic relations and sequential manipulation to change these simple relations none have shown the ability to arrange objects into complex structures such as circles or table settings. To address this problem we propose a novel transformer-based neural network, StructFormer, which takes as input a partial-view point cloud of the current object arrangement and a structured language command encoding the desired object configuration. We show through rigorous experiments that StructFormer enables a physical robot to rearrange novel objects into semantically meaningful structures with multi-object relational constraints inferred from the language command.

ICRA Conference 2021 Conference Paper

ACRONYM: A Large-Scale Grasp Dataset Based on Simulation

Clemens Eppner
Arsalan Mousavian
Dieter Fox

We introduce ACRONYM, a dataset for robot grasp planning based on physics simulation. The dataset contains 17. 7M parallel-jaw grasps, spanning 8872 objects from 262 different categories, each labeled with the grasp result obtained from a physics simulator. We show the value of this large and diverse dataset by using it to train two state-of-the-art learning-based grasp planning algorithms. Grasp performance improves significantly when compared to the original smaller dataset. Data and tools can be accessed at https://sites.google.com/nvidia.com/graspdataset.

ICRA Conference 2021 Conference Paper

Alternative Paths Planner (APP) for Provably Fixed-time Manipulation Planning in Semi-structured Environments

Fahad Islam 0002
Chris Paxton 0001
Clemens Eppner
Bryan Peele
Maxim Likhachev
Dieter Fox

In many applications, including logistics and manufacturing, robot manipulators operate in semi-structured environments alongside humans or other robots. These environments are largely static, but they may contain some movable obstacles that the robot must avoid. Manipulation tasks in these applications are often highly repetitive, but require fast and reliable motion planning capabilities, often under strict time constraints. Existing preprocessing-based approaches are beneficial when the environments are highly-structured, but their performance degrades in the presence of movable obstacles, since these are not modelled a priori. We propose a novel preprocessing-based method called Alternative Paths Planner (APP) that provides provably fixed-time planning guarantees in semi-structured environments. APP plans a set of alternative paths offline such that, for any configuration of the movable obstacles, at least one of the paths from this set is collision-free. During online execution, a collision-free path can be looked up efficiently within a few microseconds. We evaluate APP on a 7 DoF robot arm in semi-structured domains of varying complexity and demonstrate that APP is several orders of magnitude faster than state-of-the-art motion planners for each domain. We further validate this approach with real-time experiments on a robotic manipulator.

ICRA Conference 2021 Conference Paper

Contact-GraspNet: Efficient 6-DoF Grasp Generation in Cluttered Scenes

Martin Sundermeyer
Arsalan Mousavian
Rudolph Triebel
Dieter Fox

Grasping unseen objects in unconstrained, cluttered environments is an essential skill for autonomous robotic manipulation. Despite recent progress in full 6-DoF grasp learning, existing approaches often consist of complex sequential pipelines that possess several potential failure points and run-times unsuitable for closed-loop grasping. Therefore, we propose an end-to-end network that efficiently generates a distribution of 6-DoF parallel-jaw grasps directly from a depth recording of a scene. Our novel grasp representation treats 3D points of the recorded point cloud as potential grasp contacts. By rooting the full 6-DoF grasp pose and width in the observed point cloud, we can reduce the dimensionality of our grasp representation to 4-DoF which greatly facilitates the learning process. Our class-agnostic approach is trained on 17 million simulated grasps and generalizes well to real world sensor data. In a robotic grasping study of unseen objects in structured clutter we achieve over 90% success rate, cutting the failure rate in half compared to a recent state-of-the-art method. Video of the real world experiments and code are available at https://research.nvidia.com/publication/2021-03_Contact-GraspNet%3A--Efficient.

ICRA Conference 2021 Conference Paper

Object Rearrangement Using Learned Implicit Collision Functions

Michael Danielczuk
Arsalan Mousavian
Clemens Eppner
Dieter Fox

Robotic object rearrangement combines the skills of picking and placing objects. When object models are unavailable, typical collision-checking models may be unable to predict collisions in partial point clouds with occlusions, making generation of collision-free grasping or placement trajectories challenging. We propose a learned collision model that accepts scene and query object point clouds and predicts collisions for 6DOF object poses within the scene. We train the model on a synthetic set of 1 million scene/object point cloud pairs and 2 billion collision queries. We leverage the learned collision model as part of a model predictive path integral (MPPI) policy in a tabletop rearrangement task and show that the policy can plan collision-free grasps and placements for objects unseen in training in both simulated and physical cluttered scenes with a Franka Panda robot. The learned model outperforms both traditional pipelines and learned ablations by 9. 8% in accuracy on a dataset of simulated collision queries and is 75x faster than the best-performing baseline. Videos and supplementary material are available at https://research.nvidia.com/publication/2021-03_Object-Rearrangement-Using.

ICRA Conference 2021 Conference Paper

Reactive Human-to-Robot Handovers of Arbitrary Objects

Wei Yang 0019
Chris Paxton 0001
Arsalan Mousavian
Yu-Wei Chao
Maya Cakmak
Dieter Fox

Human-robot object handovers have been an actively studied area of robotics over the past decade; however, very few techniques and systems have addressed the challenge of handing over diverse objects with arbitrary appearance, size, shape, and deformability. In this paper, we present a vision-based system that enables reactive human-to-robot handovers of unknown objects. Our approach combines closed-loop motion planning with real-time, temporally consistent grasp generation to ensure reactivity and motion smoothness. Our system is robust to different object positions and orientations, and can grasp both rigid and non-rigid objects. We demonstrate the generalizability, usability, and robustness of our approach on a novel benchmark set of 26 diverse household objects, a user study with six participants handing over a subset of 15 objects, and a systematic evaluation examining different ways of handing objects.

IROS Conference 2021 Conference Paper

Reactive Long Horizon Task Execution via Visual Skill and Precondition Models

Shohin Mukherjee
Chris Paxton 0001
Arsalan Mousavian
Adam Fishman
Maxim Likhachev
Dieter Fox

Zero-shot execution of unseen robotic tasks is important to allowing robots to perform a wide variety of tasks in human environments, but collecting the amounts of data necessary to train end-to-end policies in the real-world is often infeasible. We describe an approach for sim-to-real training that can accomplish unseen robotic tasks using models learned in simulation to ground components of a simple task planner. We learn a library of parameterized skills, along with a set of predicates-based preconditions and termination conditions, entirely in simulation. We explore a block-stacking task because it has a clear structure, where multiple skills must be chained together, but our methods are applicable to a wide range of other problems and domains, and can transfer from simulation to the real-world with no fine tuning. The system is able to recognize failures and accomplish long-horizon tasks from perceptual input, which is critical for real-world execution. We evaluate our proposed approach in both simulation and in the real-world, showing an increase in success rate from 91. 6% to 98% in simulation and from 10% to 80% success rate in the real-world as compared with naive baselines. For experiment videos including both real-world and simulation, see: https://www.youtube.com/playlist? list=PL-oD0xHUngeLfQmpngYkGFZarstfPOXqX

ICRA Conference 2021 Conference Paper

Sim-to-Real for Robotic Tactile Sensing via Physics-Based Simulation and Learned Latent Projections

Yashraj Narang
Balakumar Sundaralingam
Miles Macklin
Arsalan Mousavian
Dieter Fox

Tactile sensing is critical for robotic grasping and manipulation of objects under visual occlusion. However, in contrast to simulations of robot arms and cameras, current simulations of tactile sensors have limited accuracy, speed, and utility. In this work, we develop an efficient 3D finite element method (FEM) model of the SynTouch BioTac sensor using an open-access, GPU-based robotics simulator. Our simulations closely reproduce results from an experimentally-validated model in an industry-standard, CPU-based simulator, but at 75x the speed. We then learn latent representations for simulated BioTac deformations and real-world electrical output through self-supervision, as well as projections between the latent spaces using a small supervised dataset. Using these learned latent projections, we accurately synthesize real-world BioTac electrical output and estimate contact patches, both for unseen contact interactions. This work contributes an efficient, freely-accessible FEM model of the BioTac and comprises one of the first efforts to combine self-supervision, cross-modal transfer, and sim-to-real transfer for tactile sensors.

IROS Conference 2021 Conference Paper

Towards Coordinated Robot Motions: End-to-End Learning of Motion Policies on Transform Trees

Muhammad Asif Rana
Anqi Li 0001
Dieter Fox
Sonia Chernova
Byron Boots
Nathan D. Ratliff

Generating robot motion that fulfills multiple tasks simultaneously is challenging due to the geometric constraints imposed on the robot. In this paper, we propose to solve multi-task problems through learning structured policies from human demonstrations. Our structured policy is inspired by RMPflow, a framework for combining subtask policies on different spaces. The policy structure provides the user an interface to 1) specifying the spaces that are directly relevant to the completion of the tasks, and 2) designing policies for certain tasks that do not need to be learned. We derive an end-to-end learning objective that is suitable for the multi-task problem, emphasizing the distance between generated motions and demonstrations measured on task spaces. Furthermore, the motion generated from the learned policy class is guaranteed to be stable. We validate the effectiveness of our proposed learning framework through qualitative and quantitative evaluations on three robotic tasks on a 7-DOF Rethink Sawyer robot.

ICML Conference 2021 Conference Paper

Value Iteration in Continuous Actions, States and Time

Michael Lutter
Shie Mannor
Jan Peters 0001
Dieter Fox
Animesh Garg

Classical value iteration approaches are not applicable to environments with continuous states and actions. For such environments the states and actions must be discretized, which leads to an exponential increase in computational complexity. In this paper, we propose continuous fitted value iteration (cFVI). This algorithm enables dynamic programming for continuous states and actions with a known dynamics model. Exploiting the continuous time formulation, the optimal policy can be derived for non-linear control-affine dynamics. This closed-form solution enables the efficient extension of value iteration to continuous environments. We show in non-linear control experiments that the dynamic programming solution obtains the same quantitative performance as deep reinforcement learning methods in simulation but excels when transferred to the physical system. The policy obtained by cFVI is more robust to changes in the dynamics despite using only a deterministic model and without explicitly incorporating robustness in the optimization

ICRA Conference 2020 Conference Paper

6-DOF Grasping for Target-driven Object Manipulation in Clutter

Adithyavairavan Murali
Arsalan Mousavian
Clemens Eppner
Chris Paxton 0001
Dieter Fox

Grasping in cluttered environments is a fundamental but challenging robotic skill. It requires both reasoning about unseen object parts and potential collisions with the manipulator. Most existing data-driven approaches avoid this problem by limiting themselves to top-down planar grasps which is insufficient for many real-world scenarios and greatly limits possible grasps. We present a method that plans 6-DOF grasps for any desired object in a cluttered scene from partial point cloud observations. Our method achieves a grasp success of 80. 3%, outperforming baseline approaches by 17. 6% and clearing 9 cluttered table scenes (which contain 23 unknown objects and 51 picks in total) on a real robotic platform. By using our learned collision checking module, we can even reason about effective grasp sequences to retrieve objects that are not immediately accessible. Supplementary video can be found here.

ICRA Conference 2020 Conference Paper

Camera-to-Robot Pose Estimation from a Single Image

Timothy E. Lee
Jonathan Tremblay
Thang To
Jia Cheng
Terry Mosier
Oliver Kroemer
Dieter Fox
Stan Birchfield

We present an approach for estimating the pose of an external camera with respect to a robot using a single RGB image of the robot. The image is processed by a deep neural network to detect 2D projections of keypoints (such as joints) associated with the robot. The network is trained entirely on simulated data using domain randomization to bridge the reality gap. Perspective-n-point (PnP) is then used to recover the camera extrinsics, assuming that the camera intrinsics and joint configuration of the robot manipulator are known. Unlike classic hand-eye calibration systems, our method does not require an off-line calibration step. Rather, it is capable of computing the camera extrinsics from a single frame, thus opening the possibility of on-line calibration. We show experimental results for three different robots and camera sensors, demonstrating that our approach is able to achieve accuracy with a single frame that is comparable to that of classic off-line hand-eye calibration using multiple frames. With additional frames from a static pose, accuracy improves even further. Code, datasets, and pretrained models for three widely-used robot manipulators are made available.

NeurIPS Conference 2020 Conference Paper

Causal Discovery in Physical Systems from Videos

Yunzhu Li
Antonio Torralba
Anima Anandkumar
Dieter Fox
Animesh Garg

Causal discovery is at the core of human cognition. It enables us to reason about the environment and make counterfactual predictions about unseen scenarios that can vastly differ from our previous experiences. We consider the task of causal discovery from videos in an end-to-end fashion without supervision on the ground-truth graph structure. In particular, our goal is to discover the structural dependencies among environmental and object variables: inferring the type and strength of interactions that have a causal effect on the behavior of the dynamical system. Our model consists of (a) a perception module that extracts a semantically meaningful and temporally consistent keypoint representation from images, (b) an inference module for determining the graph distribution induced by the detected keypoints, and (c) a dynamics module that can predict the future by conditioning on the inferred graph. We assume access to different configurations and environmental conditions, i. e. , data from unknown interventions on the underlying system; thus, we can hope to discover the correct underlying causal graph without explicit interventions. We evaluate our method in a planar multi-body interaction environment and scenarios involving fabrics of different shapes like shirts and pants. Experiments demonstrate that our model can correctly identify the interactions from a short sequence of images and make long-term future predictions. The causal structure assumed by the model also allows it to make counterfactual predictions and extrapolate to systems of unseen interaction graphs or graphs of various sizes.

IROS Conference 2020 Conference Paper

Collaborative Interaction Models for Optimized Human-Robot Teamwork

Adam Fishman
Chris Paxton 0001
Wei Yang 0019
Dieter Fox
Byron Boots
Nathan D. Ratliff

Effective human-robot collaboration requires informed anticipation. The robot must anticipate the human’s actions, but also react quickly and intuitively when its predictions are wrong. The robot must plan its actions to account for the human’s own plan, with the knowledge that the human’s behavior will change based on what the robot actually does. This cyclical game of predicting a human’s future actions and generating a corresponding motion plan is extremely difficult to model using standard techniques. In this work, we describe a novel Model Predictive Control (MPC)-based framework for finding optimal trajectories in a collaborative, multi-agent setting, in which we simultaneously plan for the robot while predicting the actions of its external collaborators. We use human-robot handovers to demonstrate that with a strong model of the collaborator, our framework produces fluid, reactive human-robot interactions in novel, cluttered environments. Our method efficiently generates coordinated trajectories, and achieves a high success rate in handover, even in the presence of significant sensor noise.

ICRA Conference 2020 Conference Paper

DexPilot: Vision-Based Teleoperation of Dexterous Robotic Hand-Arm System

Ankur Handa
Karl Van Wyk
Wei Yang 0019
Jacky Liang
Yu-Wei Chao
Qian Wan
Stan Birchfield
Nathan D. Ratliff

Teleoperation offers the possibility of imparting robotic systems with sophisticated reasoning skills, intuition, and creativity to perform tasks. However, teleoperation solutions for high degree-of-actuation (DoA), multi-fingered robots are generally cost-prohibitive, while low-cost offerings usually offer reduced degrees of control. Herein, a low-cost, depth-based teleoperation system, DexPilot, was developed that allows for complete control over the full 23 DoA robotic system by merely observing the bare human hand. DexPilot enabled operators to solve a variety of complex manipulation tasks that go beyond simple pick-and-place operations and performance was measured through speed and reliability metrics. DexPilot cost-effectively enables the production of high dimensional, multi-modality, state-action data that can be leveraged in the future to learn sensorimotor policies for challenging manipulation tasks. The videos of the experiments can be found at https://sites.google.com/view/dex-pilot.

ICRA Conference 2020 Conference Paper

Guided Uncertainty-Aware Policy Optimization: Combining Learning and Model-Based Strategies for Sample-Efficient Policy Learning

Michelle A. Lee
Carlos Florensa
Jonathan Tremblay
Nathan D. Ratliff
Animesh Garg
Fabio Ramos 0001
Dieter Fox

Traditional robotic approaches rely on an accurate model of the environment, a detailed description of how to perform the task, and a robust perception system to keep track of the current state. On the other hand, reinforcement learning approaches can operate directly from raw sensory inputs with only a reward signal to describe the task, but are extremely sampleinefficient and brittle. In this work, we combine the strengths of model-based methods with the flexibility of learning-based methods to obtain a general method that is able to overcome inaccuracies in the robotics perception/actuation pipeline, while requiring minimal interactions with the environment. This is achieved by leveraging uncertainty estimates to divide the space in regions where the given model-based policy is reliable, and regions where it may have flaws or not be well defined. In these uncertain regions, we show that a locally learned-policy can be used directly with raw sensory inputs. We test our algorithm, Guided Uncertainty-Aware Policy Optimization (GUAPO), on a real-world robot performing peg insertion. Videos are available at: https://sites.google.com/view/guapo-rl.

IROS Conference 2020 Conference Paper

Human Grasp Classification for Reactive Human-to-Robot Handovers

Wei Yang 0019
Chris Paxton 0001
Maya Cakmak
Dieter Fox

Transfer of objects between humans and robots is a critical capability for collaborative robots. Although there has been a recent surge of interest in human-robot handovers, most prior research focus on robot-to-human handovers. Further, work on the equally critical human-to-robot handovers often assumes humans can place the object in the robot’s gripper. In this paper, we propose an approach for human-to-robot handovers in which the robot meets the human halfway, by classifying the human’s grasp of the object and quickly planning a trajectory accordingly to take the object from the human’s hand according to their intent. To do this, we collect a human grasp dataset which covers typical ways of holding objects with various hand shapes and poses, and learn a deep model on this dataset to classify the hand grasps into one of these categories. We present a planning and execution approach that takes the object from the human hand according to the detected grasp and hand position, and replans as necessary when the handover is interrupted. Through a systematic evaluation, we demonstrate that our system results in more fluent handovers versus two baselines. We also present findings from a user study (N = 9) demonstrating the effectiveness and usability of our approach with naive users in different scenarios. More information can be found at http://wyang.me/handovers.

ICRA Conference 2020 Conference Paper

In-Hand Object Pose Tracking via Contact Feedback and GPU-Accelerated Robotic Simulation

Jacky Liang
Ankur Handa
Karl Van Wyk
Viktor Makoviychuk
Oliver Kroemer
Dieter Fox

Tracking the pose of an object while it is being held and manipulated by a robot hand is difficult for vision-based methods due to significant occlusions. Prior works have explored using contact feedback and particle filters to localize in-hand objects. However, they have mostly focused on the static grasp setting and not when the object is in motion, as doing so requires modeling of complex contact dynamics. In this work, we propose using GPU-accelerated parallel robot simulations and derivative-free, sample-based optimizers to track in-hand object poses with contact feedback during manipulation. We use physics simulation as the forward model for robot-object interactions, and the algorithm jointly optimizes for the state and the parameters of the simulations, so they better match with those of the real world. Our method runs in real-time (30Hz) on a single GPU, and it achieves an average point cloud distance error of 6mm in simulation experiments and 13mm in the real-world ones.

ICRA Conference 2020 Conference Paper

Inferring the Material Properties of Granular Media for Robotic Tasks

Carolyn Matl
Yashraj Narang
Ruzena Bajcsy
Fabio Ramos 0001
Dieter Fox

Granular media (e. g. , cereal grains, plastic resin pellets, and pills) are ubiquitous in robotics-integrated industries, such as agriculture, manufacturing, and pharmaceutical development. This prevalence mandates the accurate and efficient simulation of these materials. This work presents a software and hardware framework that automatically calibrates a fast physics simulator to accurately simulate granular materials by inferring material properties from real-world depth images of granular formations (i. e. , piles and rings). Specifically, coefficients of sliding friction, rolling friction, and restitution of grains are estimated from summary statistics of grain formations using likelihood-free Bayesian inference. The calibrated simulator accurately predicts unseen granular formations in both simulation and experiment; furthermore, simulator predictions are shown to generalize to more complex tasks, including using a robot to pour grains into a bowl, as well as to create a desired pattern of piles and rings.

ICRA Conference 2020 Conference Paper

IRIS: Implicit Reinforcement without Interaction at Scale for Learning Control from Offline Robot Manipulation Data

Ajay Mandlekar
Fabio Ramos 0001
Byron Boots
Silvio Savarese
Li Fei-Fei 0001
Animesh Garg
Dieter Fox

Learning from offline task demonstrations is a problem of great interest in robotics. For simple short-horizon manipulation tasks with modest variation in task instances, offline learning from a small set of demonstrations can produce controllers that successfully solve the task. However, leveraging a fixed batch of data can be problematic for larger datasets and longer-horizon tasks with greater variations. The data can exhibit substantial diversity and consist of suboptimal solution approaches. In this paper, we propose Implicit Reinforcement without Interaction at Scale (IRIS), a novel framework for learning from large-scale demonstration datasets. IRIS factorizes the control problem into a goal-conditioned low-level controller that imitates short demonstration sequences and a high-level goal selection mechanism that sets goals for the low-level and selectively combines parts of suboptimal solutions leading to more successful task completions. We evaluate IRIS across three datasets, including the RoboTurk Cans dataset collected by humans via crowdsourcing, and show that performant policies can be learned from purely offline learning. Additional results at https://sites.google.com/stanford.edu/iris/.

ICRA Conference 2020 Conference Paper

Motion Reasoning for Goal-Based Imitation Learning

De-An Huang
Yu-Wei Chao
Chris Paxton 0001
Xinke Deng
Li Fei-Fei 0001
Juan Carlos Niebles
Animesh Garg
Dieter Fox

We address goal-based imitation learning, where the aim is to output the symbolic goal from a third-person video demonstration. This enables the robot to plan for execution and reproduce the same goal in a completely different environment. The key challenge is that the goal of a video demonstration is often ambiguous at the level of semantic actions. The human demonstrators might unintentionally achieve certain subgoals in the demonstrations with their actions. Our main contribution is to propose a motion reasoning framework that combines task and motion planning to disambiguate the true intention of the demonstrator in the video demonstration. This allows us to recognize the goals that cannot be disambiguated by previous action-based approaches. We evaluate our approach on a new dataset of 96 video demonstrations in a mockup kitchen environment. We show that our motion reasoning plays an important role in recognizing the actual goal of the demonstrator and improves the success rate by over 20%. We further show that by using the automatically inferred goal from the video demonstration, our robot is able to reproduce the same task in a real kitchen environment.

IROS Conference 2020 Conference Paper

Online BayesSim for Combined Simulator Parameter Inference and Policy Improvement

Rafael Possas
Lucas Barcelos
Rafael Oliveira 0001
Dieter Fox
Fabio Ramos 0001

Recent advancements in Bayesian likelihood-free inference enables a probabilistic treatment for the problem of estimating simulation parameters and their uncertainty given sequences of observations. Domain randomization can be performed much more effectively when a posterior distribution provides the correct uncertainty over parameters in a simulated environment. In this paper, we study the integration of simulation parameter inference with both model-free reinforcement learning and model-based control in a novel sequential algorithm that alternates between learning a better estimation of parameters and improving the controller. This approach exploits the interdependence between the two problems to generate computational efficiencies and improved reliability when a black-box simulator is available. Experimental results suggest that both control strategies have better performance when compared to traditional domain randomization methods.

ICRA Conference 2020 Conference Paper

Online Replanning in Belief Space for Partially Observable Task and Motion Problems

Caelan Reed Garrett
Chris Paxton 0001
Tomás Lozano-Pérez
Leslie Pack Kaelbling
Dieter Fox

To solve multi-step manipulation tasks in the real world, an autonomous robot must take actions to observe its environment and react to unexpected observations. This may require opening a drawer to observe its contents or moving an object out of the way to examine the space behind it. Upon receiving a new observation, the robot must update its belief about the world and compute a new plan of action. In this work, we present an online planning and execution system for robots faced with these challenges. We perform deterministic cost-sensitive planning in the space of hybrid belief states to select likely-to-succeed observation actions and continuous control actions. After execution and observation, we replan using our new state estimate. We initially enforce that planner reuses the structure of the unexecuted tail of the last plan. This both improves planning efficiency and ensures that the overall policy does not undo its progress towards achieving the goal. Our approach is able to efficiently solve partially observable problems both in simulation and in a real-world kitchen.

ICRA Conference 2020 Conference Paper

Scaling Local Control to Large-Scale Topological Navigation

Xiangyun Meng
Nathan D. Ratliff
Yu Xiang 0001
Dieter Fox

Visual topological navigation has been revitalized recently thanks to the advancement of deep learning that substantially improves robot perception. However, the scalability and reliability issue remain challenging due to the complexity and ambiguity of real world images and mechanical constraints of real robots. We present an intuitive approach to show that by accurately measuring the capability of a local controller, large-scale visual topological navigation can be achieved while being scalable and robust. Our approach achieves state-of-the-art results in trajectory following and planning in large-scale environments. It also generalizes well to real robots and new environments without retraining or finetuning.

ICRA Conference 2020 Conference Paper

Self-supervised 6D Object Pose Estimation for Robot Manipulation

Xinke Deng
Yu Xiang 0001
Arsalan Mousavian
Clemens Eppner
Timothy Bretl
Dieter Fox

To teach robots skills, it is crucial to obtain data with supervision. Since annotating real world data is time-consuming and expensive, enabling robots to learn in a self- supervised way is important. In this work, we introduce a robot system for self-supervised 6D object pose estimation. Starting from modules trained in simulation, our system is able to label real world images with accurate 6D object poses for self-supervised learning. In addition, the robot interacts with objects in the environment to change the object configuration by grasping or pushing objects. In this way, our system is able to continuously collect data and improve its pose estimation modules. We show that the self-supervised learning improves object segmentation and 6D pose estimation performance, and consequently enables the system to grasp objects more reliably. A video showing the experiments can be found at https://youtu.be/W1Y0Mmh1Gd8.

ICRA Conference 2020 Conference Paper

Transferable Task Execution from Pixels through Deep Planning Domain Learning

Kei Kase
Chris Paxton 0001
Hammad Mazhar
Tetsuya Ogata
Dieter Fox

While robots can learn models to solve many manipulation tasks from raw visual input, they cannot usually use these models to solve new problems. On the other hand, symbolic planning methods such as STRIPS have long been able to solve new problems given only a domain definition and a symbolic goal, but these approaches often struggle on the real world robotic tasks due to the challenges of grounding these symbols from sensor data in a partially-observable world. We propose Deep Planning Domain Learning (DPDL), an approach that combines the strengths of both methods to learn a hierarchical model. DPDL learns a high-level model which predicts values for a large set of logical predicates consisting of the current symbolic world state, and separately learns a low-level policy which translates symbolic operators into executable actions on the robot. This allows us to perform complex, multistep tasks even when the robot has not been explicitly trained on them. We show our method on manipulation tasks in a photorealistic kitchen scenario.

ICRA Conference 2019 Conference Paper

Closing the Sim-to-Real Loop: Adapting Simulation Randomization with Real World Experience

Yevgen Chebotar
Ankur Handa
Viktor Makoviychuk
Miles Macklin
Jan Issac
Nathan D. Ratliff
Dieter Fox

We consider the problem of transferring policies to the real world by training on a distribution of simulated scenarios. Rather than manually tuning the randomization of simulations, we adapt the simulation parameter distribution using a few real world roll-outs interleaved with policy training. In doing so, we are able to change the distribution of simulations to improve the policy transfer by matching the policy behavior in simulation and the real world. We show that policies trained with our method are able to reliably transfer to different robots in two real world tasks: swing-peg-in-hole and opening a cabinet drawer. The video of our experiments can be found at https://sites.google.com/view/simopt.

IROS Conference 2019 Conference Paper

ContactGrasp: Functional Multi-finger Grasp Synthesis from Contact

Samarth Brahmbhatt
Ankur Handa
James Hays
Dieter Fox

Grasping and manipulating objects is an important human skill. Since most objects are designed to be manipulated by human hands, anthropomorphic hands can enable richer human-robot interaction. Desirable grasps are not only stable, but also functional: they enable post-grasp actions with the object. However, functional grasp synthesis for high degree-of-freedom anthropomorphic hands from object shape alone is challenging. We present ContactGrasp, a framework for functional grasp synthesis from object shape and contact on the object surface. Contact can be manually specified or obtained through demonstrations. Our contact representation is object-centric and allows functional grasp synthesis even for hand models different than the one used for demonstration. Using a dataset of contact demonstrations from humans grasping diverse household objects, we synthesize functional grasps for three hand models and two functional intents. The project webpage is https://contactdb.cc.gatech.edu/contactgrasp.html.

IROS Conference 2019 Conference Paper

EARLY FUSION for Goal Directed Robotic Vision

Aaron Walsman
Yonatan Bisk
Saadia Gabriel
Dipendra Misra
Yoav Artzi
Yejin Choi 0001
Dieter Fox

Building perceptual systems for robotics which perform well under tight computational budgets requires novel architectures which rethink the traditional computer vision pipeline. Modern vision architectures require the agent to build a summary representation of the entire scene, even if most of the input is irrelevant to the agent’s current goal. In this work, we flip this paradigm, by introducing EARLYFUSION vision models that condition on a goal to build custom representations for downstream tasks. We show that these goal specific representations can be learned more quickly, are substantially more parameter efficient, and more robust than existing attention mechanisms in our domain. We demonstrate the effectiveness of these methods on a simulated item retrieval problem that is trained in a fully end-to-end manner via imitation learning.

ICRA Conference 2019 Conference Paper

Joint Inference of Kinematic and Force Trajectories with Visuo-Tactile Sensing

Alexander Lambert
Mustafa Mukadam
Balakumar Sundaralingam
Nathan D. Ratliff
Byron Boots
Dieter Fox

To perform complex tasks, robots must be able to interact with and manipulate their surroundings. One of the key challenges in accomplishing this is robust state estimation during physical interactions, where the state involves not only the robot and the object being manipulated, but also the state of the contact itself. In this work, within the context of planar pushing, we extend previous inference-based approaches to state estimation in several ways. We estimate the robot, object, and the contact state on multiple manipulation platforms configured with a vision-based articulated model tracker, and either a biomimetic tactile sensor or a force-torque sensor. We show how to fuse raw measurements from the tracker and tactile sensors to jointly estimate the trajectory of the kinematic states and the forces in the system via probabilistic inference on factor graphs, in both batch and incremental settings. We perform several benchmarks with our framework and show how performance is affected by incorporating various geometric and physics based constraints, occluding vision sensors, or injecting noise in tactile sensors. We also compare with prior work on multiple datasets and demonstrate that our approach can effectively optimize over multi-modal sensor data and reduce uncertainty to find better state estimates.

ICRA Conference 2019 Conference Paper

Learning Latent Space Dynamics for Tactile Servoing

Giovanni Sutanto
Nathan D. Ratliff
Balakumar Sundaralingam
Yevgen Chebotar
Zhe Su
Ankur Handa
Dieter Fox

To achieve a dexterous robotic manipulation, we need to endow our robot with tactile feedback capability, i. e. the ability to drive action based on tactile sensing. In this paper, we specifically address the challenge of tactile servoing, i. e. given the current tactile sensing and a target/goal tactile sensing - memorized from a successful task execution in the past - what is the action that will bring the current tactile sensing to move closer towards the target tactile sensing at the next time step. We develop a data-driven approach to acquire a dynamics model for tactile servoing by learning from demonstration. Moreover, our method represents the tactile sensing information as to lie on a surface - or a 2D manifold - and perform a manifold learning, making it applicable to any tactile skin geometry. We evaluate our method on a contact point tracking task using a robot equipped with a tactile finger.

ICRA Conference 2019 Conference Paper

Neural Autonomous Navigation with Riemannian Motion Policy

Xiangyun Meng
Nathan D. Ratliff
Yu Xiang 0001
Dieter Fox

End-to-end learning for autonomous navigation has received substantial attention recently as a promising method for reducing modeling error. However, its data complexity, especially around generalization to unseen environments, is high. We introduce a novel image-based autonomous navigation technique that leverages in policy structure using the Riemannian Motion Policy (RMP) framework for deep learning of vehicular control. We design a deep neural network to predict control point RMPs of the vehicle from visual images, from which the optimal control commands can be computed analytically. We show that our network trained in the Gibson environment can be used for indoor obstacle avoidance and navigation on a real RC car, and our RMP representation generalizes better to unseen environments than predicting local geometry or predicting control commands directly.

ICRA Conference 2019 Conference Paper

Part Segmentation for Highly Accurate Deformable Tracking in Occlusions via Fully Convolutional Neural Networks

Weilin Wan 0001
Aaron Walsman
Dieter Fox

Successfully tracking the human body is an important perceptual challenge for robots that must work around people. Existing methods fall into two broad categories: geometric tracking and direct pose estimation using machine learning. While recent work has shown direct estimation techniques can be quite powerful, geometric tracking methods using point clouds can provide a very high level of 3D accuracy which is necessary for many robotic applications. However these approaches can have difficulty in clutter when large portions of the subject are occluded. To overcome this limitation, we propose a solution based on fully convolutional neural networks (FCN). We develop an optimized Fast-FCN network architecture for our application which allows us to filter observed point clouds and improve tracking accuracy while maintaining interactive frame rates. We also show that this model can be trained with a limited number of examples and almost no manual labelling by using an existing geometric tracker and data augmentation to automatically generate segmentation maps. We demonstrate the accuracy of our full system by comparing it against an existing geometric tracker, and show significant improvement in these challenging scenarios.

ICRA Conference 2019 Conference Paper

Prospection: Interpretable plans from language by predicting the future

Chris Paxton 0001
Yonatan Bisk
Jesse Thomason
Arunkumar Byravan
Dieter Fox

High-level human instructions often correspond to behaviors with multiple implicit steps. In order for robots to be useful in the real world, they must be able to to reason over both motions and intermediate goals implied by human instructions. In this work, we propose a framework for learning representations that convert from a natural-language command to a sequence of intermediate goals for execution on a robot. A key feature of this framework is prospection, training an agent not just to correctly execute the prescribed command, but to predict a horizon of consequences of an action before taking it. We demonstrate the fidelity of plans generated by our framework when interpreting real, crowd-sourced natural language commands for a robot in simulated scenes.

IROS Conference 2019 Conference Paper

Representing Robot Task Plans as Robust Logical-Dynamical Systems

Chris Paxton 0001
Nathan D. Ratliff
Clemens Eppner
Dieter Fox

It is difficult to create robust, reusable, and reactive behaviors for robots that can be easily extended and combined. Frameworks such as Behavior Trees are flexible but difficult to characterize, especially when designing reactions and recovery behaviors to consistently converge to a desired goal condition. We propose a framework which we call Robust Logical-Dynamical Systems (RLDS), which combines the advantages of task representations like behavior trees with theoretical guarantees on performance. RLDS can also be constructed automatically from simple sequential task plans and will still achieve robust, reactive behavior in dynamic real-world environments. In this work, we describe both our proposed framework and a case study on a simple household manipulation task, with examples for how specific pieces can be implemented to achieve robust behavior. Finally, we show how in the context of these manipulation tasks, a combination of an RLDS with planning can achieve better results under adversarial conditions.

ICRA Conference 2019 Conference Paper

Robust Learning of Tactile Force Estimation through Robot Interaction

Balakumar Sundaralingam
Alexander Lambert
Ankur Handa
Byron Boots
Tucker Hermans
Stan Birchfield
Nathan D. Ratliff
Dieter Fox

Current methods for estimating force from tactile sensor signals are either inaccurate analytic models or task-specific learned models. In this paper, we explore learning a robust model that maps tactile sensor signals to force. We specifically explore learning a mapping for the SynTouch BioTac sensor via neural networks. We propose a voxelized input feature layer for spatial signals and leverage information about the sensor surface to regularize the loss function. To learn a robust tactile force model that transfers across tasks, we generate ground truth data from three different sources: (1) the BioTac rigidly mounted to a force torque (FT) sensor, (2) a robot interacting with a ball rigidly attached to the same FT sensor, and (3) through force inference on a planar pushing task by formalizing the mechanics as a system of particles and optimizing over the object motion. A total of 140k samples were collected from the three sources. We achieve a median angular accuracy of 3. 5 degrees in predicting force direction (66% improvement over the current state of the art) and a median magnitude accuracy of 0. 06 N (93% improvement) on a test dataset. Additionally, we evaluate the learned force model in a force feedback grasp controller performing object lifting and gentle placement. Our results can be found on https://sites.google.com/view/tactile-force.

IROS Conference 2019 Conference Paper

Synthesizing Robot Manipulation Programs from a Single Observed Human Demonstration

Justin Huang
Dieter Fox
Maya Cakmak

Programming by Demonstration (PbD) lets users with little technical background program a wide variety of manipulation tasks for robots, but it should be as intuitive as possible for users while requiring as little time as possible. In this paper, we present a Programming by Demonstration system that synthesizes manipulation programs from a single observed demonstration, allowing users to program new tasks for a robot simply by performing the task once themselves. A human-in-the-loop interface helps users make corrections to the perceptual state as needed. We introduce Object Interaction Programs as a representation of multi-object, bimanual manipulation tasks and present algorithms for extracting programs from observed demonstrations and transferring programs to a robot to perform the task in a new scene. We demonstrate the expressivity and generalizability of our approach through an evaluation on a benchmark of complex tasks.

ICRA Conference 2018 Conference Paper

Real-time 3D Glint Detection in Remote Eye Tracking Based on Bayesian Inference

David Geisler
Dieter Fox
Enkelejda Kasneci

As human gaze provides information on our cognitive states, actions, and intentions, gaze-based interaction has the potential to enable a fluent and natural human-robot collaboration. In this work, we focus on reliable gaze estimation in remote eye tracking based on calibration-free methods. Although these methods work well in controlled settings, they fail when illumination conditions change or other objects induce noise. We propose a novel, adaptive method based on a probabilistic model, which reliably detects glints from stereo images and evaluate our method using a data set that contains different challenges with regarding to light and reflections.

ICRA Conference 2018 Conference Paper

SE3-Pose-Nets: Structured Deep Dynamics Models for Visuomotor Control

Arunkumar Byravan
Felix Leeb
Franziska Meier
Dieter Fox

In this work, we present an approach to deep visuomotor control using structured deep dynamics models. Our model, a variant of SE3-Nets, learns a low-dimensional pose embedding for visuomotor control via an encoder-decoder structure. Unlike prior work, our model is structured: given an input scene, our network explicitly learns to segment salient parts and predict their pose embedding and motion, modeled as a change in the pose due to the applied actions. We train our model using a pair of point clouds separated by an action and show that given supervision only through point-wise data associations between the frames our network is able to learn a meaningful segmentation of the scene along with consistent poses. We further show that our model can be used for closed-loop control directly in the learned low-dimensional pose space, where the actions are computed by minimizing pose error using gradient-based methods, similar to traditional model-based control. We present results on controlling a Baxter robot from raw depth data in simulation and RGBD data in the real world and compare against two baseline deep networks. We also test the robustness and generalization performance of our controller under changes in camera pose, lighting, occlusion, and motion. Our method is robust, runs in real-time, achieves good prediction of scene dynamics, and outperforms baselines on multiple control runs. Video results can be found at: https://rse-lab.cs.washington.edu/se3-structured-deep-ctrl/.

ICRA Conference 2017 Conference Paper

SE3-nets: Learning rigid body motion using deep neural networks

Arunkumar Byravan
Dieter Fox

We introduce SE3-Nets which are deep neural networks designed to model and learn rigid body motion from raw point cloud data. Based only on sequences of depth images along with action vectors and point wise data associations, SE3-Nets learn to segment effected object parts and predict their motion resulting from the applied force. Rather than learning point wise flow vectors, SE3-Nets predict SE(3) transformations for different parts of the scene. Using simulated depth data of a table top scene and a robot manipulator, we show that the structure underlying SE3-Nets enables them to generate a far more consistent prediction of object motion than traditional flow based networks. Additional experiments with a depth camera observing a Baxter robot pushing objects on a table show that SE3-Nets also work well on real data.

ICRA Conference 2017 Conference Paper

Visual closed-loop control for pouring liquids

Connor Schenck
Dieter Fox

Pouring a specific amount of liquid is a challenging task. In this paper we develop methods for robots to use visual feedback to perform closed-loop control for pouring liquids. We propose both a model-based and a model-free method utilizing deep learning for estimating the volume of liquid in a container. Our results show that the model-free method is better able to estimate the volume. We combine this with a simple PID controller to pour specific amounts of liquid, and show that the robot is able to achieve an average 38ml deviation from the target amount. To our knowledge, this is the first use of raw visual feedback to pour liquids in robotics.

IROS Conference 2016 Conference Paper

Autonomous question answering with mobile robots in human-populated environments

Michael Jae-Yoon Chung
Andrzej Pronobis
Maya Cakmak
Dieter Fox
Rajesh P. N. Rao

Autonomous mobile robots will soon become ubiquitous in human-populated environments. Besides their typical applications in fetching, delivery, or escorting, such robots present the opportunity to assist human users in their daily tasks by gathering and reporting up-to-date knowledge about the environment. In this paper, we explore this use case and present an end-to-end framework that enables a mobile robot to answer natural language questions about the state of a large-scale, dynamic environment asked by the inhabitants of that environment. The system parses the question and estimates an initial viewpoint that is likely to contain information for answering the question based on prior environment knowledge. Then, it autonomously navigates towards the viewpoint while dynamically adapting to changes and new information. The output of the system is an image of the most relevant part of the environment that allows the user to obtain an answer to their question. We additionally demonstrate the benefits of a continuously operating information gathering robot by showing how the system can answer retrospective questions about the past state of the world using incidentally recorded sensory data. We evaluate our approach with a custom mobile robot deployed in a university building, with questions collected from occupants of the building. We demonstrate our system's ability to respond to these questions in different environmental conditions.

ICRA Conference 2016 Conference Paper

NEOL: Toward Never-Ending Object Learning for robots

Yuyin Sun
Dieter Fox

Learning to recognize objects based on names is a crucial capability for personal robots. Recent recognition methods successfully learn to recognize objects in a train-once-then-test setting. Yet, these methods do not apply readily to robotic settings, where a robot might continuously encounter new objects and new names. In this work, we present a framework for Never-Ending Object Learning (NEOL). Our framework automatically learns to organize object names into a semantic hierarchy using crowdsourcing and background knowledge bases. It then uses the hierarchy to improve the consistency and efficiency of annotating objects. It also adapts information from additional image datasets to learn object classifiers from a very small number of training examples. We present experiments to test the performance of the adaptation method and demonstrate the full system in a never-ending object learning experiment.

IJCAI Conference 2015 Conference Paper

Building Hierarchies of Concepts via Crowdsourcing

Yuyin Sun
Adish Singla
Dieter Fox
Andreas Krause

Hierarchies of concepts are useful in many applications from navigation to organization of objects. Usually, a hierarchy is created in a centralized manner by employing a group of domain experts, a time-consuming and expensive process. The experts often design one single hierarchy to best explain the semantic relationships among the concepts, and ignore the natural uncertainty that may exist in the process. In this paper, we propose a crowdsourcing system to build a hierarchy and furthermore capture the underlying uncertainty. Our system maintains a distribution over possible hierarchies and actively selects questions to ask using an information gain criterion. We evaluate our methodology on simulated data and on a set of real world application domains. Experimental results show that our system is robust to noise, efficient in picking questions, cost-effective, and builds high quality hierarchies.

ICRA Conference 2015 Conference Paper

Depth-based tracking with physical constraints for robot manipulation

Tanner Schmidt
Katharina Hertkorn
Richard A. Newcombe
Zoltán-Csaba Márton
Michael Suppa
Dieter Fox

This work integrates visual and physical constraints to perform real-time depth-only tracking of articulated objects, with a focus on tracking a robot's manipulators and manipulation targets in realistic scenarios. As such, we extend DART, an existing visual articulated object tracker, to additionally avoid interpenetration of multiple interacting objects, and to make use of contact information collected via torque sensors or touch sensors. To achieve greater stability, the tracker uses a switching model to detect when an object is stationary relative to the table or relative to the palm and then uses information from multiple frames to converge to an accurate and stable estimate. Deviation from stable states is detected in order to remain robust to failed grasps and dropped objects. The tracker is integrated into a shared autonomy system in which it provides state estimates used by a grasp planner and the controller of two anthropomorphic hands. We demonstrate the advantages and performance of the tracking system in simulation and on a real robot. Qualitative results are also provided for a number of challenging manipulations that are made possible by the speed, accuracy, and stability of the tracking system.

IROS Conference 2015 Conference Paper

Designing information gathering robots for human-populated environments

Michael Jae-Yoon Chung
Andrzej Pronobis
Maya Cakmak
Dieter Fox
Rajesh P. N. Rao

Advances in mobile robotics have enabled robots that can autonomously operate in human-populated environments. Although primary tasks for such robots might be fetching, delivery, or escorting, they present an untapped potential as information gathering agents that can answer questions for the community of co-inhabitants. In this paper, we seek to better understand requirements for such information gathering robots (InfoBots) from the perspective of the user requesting the information. We present findings from two studies: (i) a user survey conducted in two office buildings and (ii) a 4-day long deployment in one of the buildings, during which inhabitants of the building could ask questions to an InfoBot through a web-based interface. These studies allow us to characterize the types of information that InfoBots can provide for their users.

IJCAI Conference 2015 Conference Paper

Graph-Based Inverse Optimal Control for Robot Manipulation

Arunkumar Byravan
Mathew Monfort
Brian Ziebart
Byron Boots
Dieter Fox

Inverse optimal control (IOC) is a powerful approach for learning robotic controllers from demonstration that estimates a cost function which rationalizes demonstrated control trajectories. Unfortunately, its applicability is difficult in settings where optimal control can only be solved approximately. While local IOC approaches have been shown to successfully learn cost functions in such settings, they rely on the availability of good reference trajectories, which might not be available at test time. We address the problem of using IOC in these computationally challenging control tasks by using a graph-based discretization of the trajectory space. Our approach projects continuous demonstrations onto this discrete graph, where a cost function can be tractably learned via IOC. Discrete control trajectories from the graph are then projected back to the original space and locally optimized using the learned cost function. We demonstrate the effectiveness of the approach with experiments conducted on two 7-degree of freedom robotic arms.

ICRA Conference 2014 Conference Paper

Hierarchical sparse coded surface models

Michael Ruhnke
Liefeng Bo
Dieter Fox
Wolfram Burgard

In this paper, we describe a novel approach to construct textured 3D environment models in a hierarchical fashion based on local surface patches. Compared to previous approaches, the hierarchy enables our method to represent the environment with differently sized surface patches. The reconstruction scheme starts at a coarse resolution with large patches and in an iterative fashion uses the reconstruction error to guide the decision as to whether the resolution should be refined. This leads to variable resolution models that represent areas with few variations at low resolution and areas with large variations at high resolution. In addition, we compactly describe local surface attributes via sparse coding based on an overcomplete dictionary. In this way, we additionally exploit similarities in structure and texture, which leads to compact models. We learn the dictionary directly from the input data and independently for every level in the hierarchy in an unsupervised fashion. Practical experiments with large-scale datasets demonstrate that our method compares favorably with two state-of-the-art techniques while being comparable in accuracy.

AAAI Conference 2014 Conference Paper

Learning from Unscripted Deictic Gesture and Language for Human-Robot Interactions

Cynthia Matuszek
Liefeng Bo
Luke Zettlemoyer
Dieter Fox

As robots become more ubiquitous, it is increasingly important for untrained users to be able to interact with them intuitively. In this work, we investigate how people refer to objects in the world during relatively unstructured communication with robots. We collect a corpus of deictic interactions from users describing objects, which we use to train language and gesture models that allow our robot to determine what objects are being indicated. We introduce a temporal extension to stateof-the-art hierarchical matching pursuit features to support gesture understanding, and demonstrate that combining multiple communication modalities more effectively capture user intent than relying on a single type of input. Finally, we present initial interactions with a robot that uses the learned models to follow commands.

ICRA Conference 2014 Conference Paper

Learning predictive models of a depth camera & manipulator from raw execution traces

Byron Boots
Arunkumar Byravan
Dieter Fox

In this paper, we attack the problem of learning a predictive model of a depth camera and manipulator directly from raw execution traces. While the problem of learning manipulator models from visual and proprioceptive data has been addressed before, existing techniques often rely on assumptions about the structure of the robot or tracked features in observation space. We make no such assumptions. Instead, we formulate the problem as that of learning a high-dimensional controlled stochastic process. We leverage recent work on nonparametric predictive state representations to learn a generative model of the depth camera and robotic arm from sequences of uninterpreted actions and observations. We perform several experiments in which we demonstrate that our learned model can accurately predict future depth camera observations in response to sequences of motor commands.

ICRA Conference 2014 Conference Paper

Learning to identify new objects

Yuyin Sun
Liefeng Bo
Dieter Fox

Identifying objects based on language descriptions is an important capability for robots interacting with people in everyday environments. People naturally use attributes and names to refer to objects of interest. Due to the complexity of indoor environments and the fact that people use various ways to refer to objects, a robot frequently encounters new objects or object names. To deal with such situations, a robot must be able to continuously grow its object knowledge base. In this work we introduce a system that organizes objects and names in a semantic hierarchy. Similarity between name words is learned via a hierarchy embedded vector representation. The hierarchy enables reasoning about unknown objects and names. Novel objects are inserted automatically into the knowledge base, where the exact location in the hierarchy is determined by asking a user questions. The questions are informed by the current hierarchy and the appearance of the object. Experiments demonstrate that the learned representation captures the meaning of names and is helpful for object identification with new names.

ICRA Conference 2014 Conference Paper

Multi-task policy search for robotics

Marc Peter Deisenroth
Peter Englert
Jan Peters 0001
Dieter Fox

Learning policies that generalize across multiple tasks is an important and challenging research topic in reinforcement learning and robotics. Training individual policies for every single potential task is often impractical, especially for continuous task variations, requiring more principled approaches to share and transfer knowledge among similar tasks. We present a novel approach for learning a nonlinear feedback policy that generalizes across multiple tasks. The key idea is to define a parametrized policy as a function of both the state and the task, which allows learning a single policy that generalizes across multiple known and unknown tasks. Applications of our novel approach to reinforcement and imitation learning in real-robot experiments are shown.

ICRA Conference 2014 Conference Paper

Space-time functional gradient optimization for motion planning

Arunkumar Byravan
Byron Boots
Siddhartha S. Srinivasa
Dieter Fox

Functional gradient algorithms (e. g. CHOMP) have recently shown great promise for producing locally optimal motion for complex many degree-of-freedom robots. A key limitation of such algorithms is the difficulty in incorporating constraints and cost functions that explicitly depend on time. We present T-CHOMP, a functional gradient algorithm that overcomes this limitation by directly optimizing in space-time. We outline a framework for joint space-time optimization, derive an efficient trajectory-wide update for maintaining time monotonicity, and demonstrate the significance of T-CHOMP over CHOMP in several scenarios. By manipulating time, T-CHOMP produces lower-cost trajectories leading to behavior that is meaningfully different from CHOMP.

ICRA Conference 2014 Conference Paper

ST-HMP: Unsupervised Spatio-Temporal feature learning for tactile data

Marianna Madry
Liefeng Bo
Danica Kragic
Dieter Fox

Tactile sensing plays an important role in robot grasping and object recognition. In this work, we propose a new descriptor named Spatio-Temporal Hierarchical Matching Pursuit (ST-HMP) that captures properties of a time series of tactile sensor measurements. It is based on the concept of unsupervised hierarchical feature learning realized using sparse coding. The ST-HMP extracts rich spatio-temporal structures from raw tactile data without the need to predefine discriminative data characteristics. We apply it to two different applications: (1) grasp stability assessment and (2) object instance recognition, presenting its universal properties. An extensive evaluation on several synthetic and real datasets collected using the Schunk Dexterous, Schunk Parallel and iCub hands shows that our approach outperforms previously published results by a large margin.

ICRA Conference 2014 Conference Paper

Toward online 3-D object segmentation and mapping

Evan Herbst
Peter Henry
Dieter Fox

We build on recent fast and accurate 3-D reconstruction techniques to segment objects during scene reconstruction. We take object outline information from change detection to build 3-D models of rigid objects and represent the scene as static and dynamic components. Object models are updated online during mapping, and can integrate segmentation information from sources other than change detection.

ICRA Conference 2014 Conference Paper

Unsupervised feature learning for 3D scene labeling

Kevin Lai 0001
Liefeng Bo
Dieter Fox

This paper presents an approach for labeling objects in 3D scenes. We introduce HMP3D, a hierarchical sparse coding technique for learning features from 3D point cloud data. HMP3D classifiers are trained using a synthetic dataset of virtual scenes generated using CAD models from an online database. Our scene labeling system combines features learned from raw RGB-D images and 3D point clouds directly, without any hand-designed features, to assign an object label to every 3D point in the scene. Experiments on the RGB-D Scenes Dataset v. 2 demonstrate that the proposed approach can be used to label indoor scenes containing both small tabletop objects and large furniture pieces.

ICRA Conference 2013 Conference Paper

Attribute based object identification

Yuyin Sun
Liefeng Bo
Dieter Fox

Over the last years, the robotics community has made substantial progress in detection and 3D pose estimation of known and unknown objects. However, the question of how to identify objects based on language descriptions has not been investigated in detail. While the computer vision community recently started to investigate the use of attributes for object recognition, these approaches do not consider the task settings typically observed in robotics, where a combination of appearance attributes and object names might be used in referral language to identify specific objects in a scene. In this paper, we introduce an approach for identifying objects based on natural language containing appearance and name attributes. To learn rich RGB-D features needed for attribute classification, we extend recently introduced sparse coding techniques so as to automatically learn attribute-dependent features. We introduce a large data set of attribute descriptions of objects in the RGB-D object dataset. Experiments on this data set demonstrate the strong performance of our approach to language based object identification. We also show that our attribute-dependent features provide significantly better generalization to previously unseen attribute values, thereby enabling more rapid learning of new attribute values.

AAAI Conference 2013 Conference Paper

Compact RGBD Surface Models Based on Sparse Coding

Michael Ruhnke
Liefeng Bo
Dieter Fox
Wolfram Burgard

In this paper, we describe a novel approach to construct compact colored 3D environment models representing local surface attributes via sparse coding. Our method decomposes a set of colored point clouds into local surface patches and encodes them based on an overcomplete dictionary. Instead of storing the entire point cloud, we store a dictionary, surface patch positions, and a sparse code description of the depth and RGB attributes for every patch. The dictionary is learned in an unsupervised way from surface patches sampled from indoor maps. We show that better dictionaries can be learned by extending the K-SVD method with a binary weighting scheme that ignores undeﬁned surface cells. Through experimental evaluation on real world laser and RGBD datasets we demonstrate that our method produces compact and accurate models. Furthermore, we clearly outperform an existing state of the art method in terms of compactness, accuracy, and computation time. Additionally, we demonstrate that our sparse code descriptions can be utilized for other important tasks such as object detection.

ICRA Conference 2013 Conference Paper

RGB-D flow: Dense 3-D motion estimation using color and depth

Evan Herbst
Xiaofeng Ren
Dieter Fox

3-D motion estimation is a fundamental problem that has far-reaching implications in robotics. A scene flow formulation is attractive as it makes no assumptions about scene complexity, object rigidity, or camera motion. RGB-D cameras provide new information useful for computing dense 3-D flow in challenging scenes. In this work we show how to generalize two-frame variational 2-D flow algorithms to 3-D. We show that scene flow can be reliably computed using RGB-D data, overcoming depth noise and outperforming previous results on a variety of scenes. We apply dense 3-D flow to rigid motion segmentation.

ICML Conference 2012 Conference Paper

A Joint Model of Language and Perception for Grounded Attribute Learning

Cynthia Matuszek
Nicholas FitzGerald
Luke Zettlemoyer
Liefeng Bo
Dieter Fox

ICRA Conference 2012 Conference Paper

Detection-based object labeling in 3D scenes

Kevin Lai 0001
Liefeng Bo
Xiaofeng Ren
Dieter Fox

We propose a view-based approach for labeling objects in 3D scenes reconstructed from RGB-D (color+depth) videos. We utilize sliding window detectors trained from object views to assign class probabilities to pixels in every RGB-D frame. These probabilities are projected into the reconstructed 3D scene and integrated using a voxel representation. We perform efficient inference on a Markov Random Field over the voxels, combining cues from view-based detection and 3D shape, to label the scene. Our detection-based approach produces accurate scene labeling on the RGB-D Scenes Dataset and improves the robustness of object detection.

ICRA Conference 2012 Conference Paper

Exploiting segmentation for robust 3D object matching

Michael Krainin
Kurt Konolige
Dieter Fox

While Iterative Closest Point (ICP) algorithms have been successful at aligning 3D point clouds, they do not take into account constraints arising from sensor viewpoints. More recent beam-based models take into account sensor noise and viewpoint, but problems still remain. In particular, good optimization strategies are still lacking for the beam-based model. In situations of occlusion and clutter, both beam-based and ICP approaches can fail to find good solutions. In this paper, we present both an optimization method for beambased models and a novel framework for modeling observation dependencies in beam-based models using over-segmentations. This technique enables reasoning about object extents and works well in heavy clutter. We also make available a ground-truth 3D dataset for testing algorithms in this area.

ICRA Conference 2012 Conference Paper

Interactive singulation of objects from a pile

Lillian Y. Chang
Joshua R. Smith 0001
Dieter Fox

Interaction with unstructured groups of objects allows a robot to discover and manipulate novel items in cluttered environments. We present a framework for interactive singulation of individual items from a pile. The proposed framework provides an overall approach for tasks involving operation on multiple objects, such as counting, arranging, or sorting items in a pile. A perception module combined with pushing actions accumulates evidence of singulated items over multiple pile interactions. A decision module scores the likelihood of a single-item pile to a multiple-item pile based on the magnitude of motion and matching determined from the perception module. Three variations of the singulation framework were evaluated on a physical robot for an arrangement task. The proposed interactive singulation method with adaptive pushing reduces the grasp errors on non-singulated piles compared to alternative methods without the perception and decision modules. This work contributes the general pile interaction framework, a specific method for integrating perception and action plans with grasp decisions, and an experimental evaluation of the cost trade-offs for different singulation methods.

ICRA Conference 2011 Conference Paper

A large-scale hierarchical multi-view RGB-D object dataset

Kevin Lai 0001
Liefeng Bo
Xiaofeng Ren
Dieter Fox

Over the last decade, the availability of public image repositories and recognition benchmarks has enabled rapid progress in visual object category and instance detection. Today we are witnessing the birth of a new generation of sensing technologies capable of providing high quality synchronized videos of both color and depth, the RGB-D (Kinect-style) camera. With its advanced sensing capabilities and the potential for mass adoption, this technology represents an opportunity to dramatically increase robotic object recognition, manipulation, navigation, and interaction capabilities. In this paper, we introduce a large-scale, hierarchical multi-view object dataset collected using an RGB-D camera. The dataset contains 300 objects organized into 51 categories and has been made publicly available to the research community so as to enable rapid progress based on this promising technology. This paper describes the dataset collection procedure and introduces techniques for RGB-D based object recognition and detection, demonstrating that combining color and depth information substantially improves quality of results.

AAAI Conference 2011 Conference Paper

A Scalable Tree-Based Approach for Joint Object and Pose Recognition

Kevin Lai
Liefeng Bo
Xiaofeng Ren
Dieter Fox

Recognizing possibly thousands of objects is a crucial capability for an autonomous agent to understand and interact with everyday environments. Practical object recognition comes in multiple forms: Is this a coffee mug? (category recognition). Is this Alice’s coffee mug? (instance recognition). Is the mug with the handle facing left or right? (pose recognition). We present a scalable framework, Object-Pose Tree, which efﬁciently organizes data into a semantically structured tree. The tree structure enables both scalable training and testing, allowing us to solve recognition over thousands of object poses in near real-time. Moreover, by simultaneously optimizing all three tasks, our approach outperforms standard nearest neighbor and 1-vs-all classiﬁcations, with large improvements on pose recognition. We evaluate the proposed technique on a dataset of 300 household objects collected using a Kinect-style 3D camera. Experiments demonstrate that our system achieves robust and efﬁcient object category, instance, and pose recognition on challenging everyday objects.

ICRA Conference 2011 Conference Paper

Autonomous generation of complete 3D object models using next best view manipulation planning

Michael Krainin
Brian Curless
Dieter Fox

Recognizing and manipulating objects is an important task for mobile robots performing useful services in everyday environments. In this paper, we develop a system that enables a robot to grasp an object and to move it in front of its depth camera so as to build a 3D surface model of the object. We derive an information gain based variant of the next best view algorithm in order to determine how the manipulator should move the object in front of the camera. By considering occlusions caused by the robot manipulator, our technique also determines when and how the robot should re-grasp the object in order to build a complete model.

IROS Conference 2011 Conference Paper

Depth kernel descriptors for object recognition

Liefeng Bo
Xiaofeng Ren
Dieter Fox

Consumer depth cameras, such as the Microsoft Kinect, are capable of providing frames of dense depth values at real time. One fundamental question in utilizing depth cameras is how to best extract features from depth frames. Motivated by local descriptors on images, in particular kernel descriptors, we develop a set of kernel features on depth images that model size, 3D shape, and depth edges in a single framework. Through extensive experiments on object recognition, we show that (1) our local features capture different aspects of cues from a depth frame/view that complement one another; (2) our kernel features significantly outperform traditional 3D features (e. g. Spin images); and (3) we significantly improve the capabilities of depth and RGB-D (color+depth) recognition, achieving 10–15% improvement in accuracy over the state of the art.

ICRA Conference 2011 Conference Paper

Gambit: An autonomous chess-playing robotic system

Cynthia Matuszek
Brian Mayton
Roberto Aimi
Marc Peter Deisenroth
Liefeng Bo
Robert Chu
Mike Kung
Louis LeGrand

This paper presents Gambit, a custom, mid-cost 6-DoF robot manipulator system that can play physical board games against human opponents in non-idealized environments. Historically, unconstrained robotic manipulation in board games has often proven to be more challenging than the underlying game reasoning, making it an ideal testbed for small-scale manipulation. The Gambit system includes a low-cost Kinect-style visual sensor, a custom manipulator, and state-of-the-art learning algorithms for automatic detection and recognition of the board and objects on it. As a use-case, we describe playing chess quickly and accurately with arbitrary, uninstrumented boards and pieces, demonstrating that Gambit's engineering and design represent a new state-of-the-art in fast, robust tabletop manipulation.

NeurIPS Conference 2011 Conference Paper

Hierarchical Matching Pursuit for Image Classification: Architecture and Fast Algorithms

Liefeng Bo
Xiaofeng Ren
Dieter Fox

Extracting good representations from images is essential for many computer vision tasks. In this paper, we propose hierarchical matching pursuit (HMP), which builds a feature hierarchy layer-by-layer using an efficient matching pursuit encoder. It includes three modules: batch (tree) orthogonal matching pursuit, spatial pyramid max pooling, and contrast normalization. We investigate the architecture of HMP, and show that all three components are critical for good performance. To speed up the orthogonal matching pursuit, we propose a batch tree orthogonal matching pursuit that is particularly suitable to encode a large number of observations that share the same large dictionary. HMP is scalable and can efficiently handle full-size images. In addition, HMP enables linear support vector machines (SVM) to match the performance of nonlinear SVM while being scalable to large datasets. We compare HMP with many state-of-the-art algorithms including convolutional deep belief networks, SIFT based single layer sparse coding, and kernel based feature learning. HMP consistently yields superior accuracy on three types of image classification problems: object recognition (Caltech-101), scene recognition (MIT-Scene), and static event recognition (UIUC-Sports).

IROS Conference 2011 Conference Paper

RGB-D object discovery via multi-scene analysis

Evan Herbst
Xiaofeng Ren
Dieter Fox

We introduce an algorithm for object discovery from RGB-D (color plus depth) data, building on recent progress in using RGB-D cameras for 3-D reconstruction. A set of 3-D maps are built from multiple visits to the same scene. We introduce a multi-scene MRF model to detect objects that moved between visits, combining shape, visibility, and color cues. We measure similarities between candidate objects using both 2-D and 3-D matching, and apply spectral clustering to infer object clusters from noisy links. Our approach can robustly detect objects and their motion between scenes even when objects are textureless or have the same shape as other objects.

ICRA Conference 2011 Conference Paper

Sparse distance learning for object recognition combining RGB and depth information

Kevin Lai 0001
Liefeng Bo
Xiaofeng Ren
Dieter Fox

In this work we address joint object category and instance recognition in the context of RGB-D (depth) cameras. Motivated by local distance learning, where a novel view of an object is compared to individual views of previously seen objects, we define a view-to-object distance where a novel view is compared simultaneously to all views of a previous object. This novel distance is based on a weighted combination of feature differences between views. We show, through jointly learning per-view weights, that this measure leads to superior classification performance on object category and instance recognition. More importantly, the proposed distance allows us to find a sparse solution via Group-Lasso regularization, where a small subset of representative views of an object is identified and used, with the rest discarded. This significantly reduces computational cost without compromising recognition accuracy. We evaluate the proposed technique, Instance Distance Learning (IDL), on the RGB-D Object Dataset, which consists of 300 object instances in 51 everyday categories and about 250, 000 views of objects with both RGB color and depth. We empirically compare IDL to several alternative state-of-the-art approaches and also validate the use of visual and shape cues and their combination.

ICRA Conference 2011 Conference Paper

Toward object discovery and modeling via 3-D scene comparison

Evan Herbst
Peter Henry
Xiaofeng Ren
Dieter Fox

The performance of indoor robots that stay in a single environment can be enhanced by gathering detailed knowledge of objects that frequently occur in that environment. We use an inexpensive sensor providing dense color and depth, and fuse information from multiple sensing modalities to detect changes between two 3-D maps. We adapt a recent SLAM technique to align maps. A probabilistic model of sensor readings lets us reason about movement of surfaces. Our method handles arbitrary shapes and motions, and is robust to lack of texture. We demonstrate the ability to find whole objects in complex scenes by regularizing over surface patches.

NeurIPS Conference 2010 Conference Paper

Kernel Descriptors for Visual Recognition

Liefeng Bo
Xiaofeng Ren
Dieter Fox

The design of low-level image features is critical for computer vision algorithms. Orientation histograms, such as those in SIFT~\cite{Lowe2004Distinctive} and HOG~\cite{Dalal2005Histograms}, are the most successful and popular features for visual object and scene recognition. We highlight the kernel view of orientation histograms, and show that they are equivalent to a certain type of match kernels over image patches. This novel view allows us to design a family of kernel descriptors which provide a unified and principled framework to turn pixel attributes (gradient, color, local binary pattern, \etc) into compact patch-level features. In particular, we introduce three types of match kernels to measure similarities between image patches, and construct compact low-dimensional kernel descriptors from these match kernels using kernel principal component analysis (KPCA)~\cite{Scholkopf1998Nonlinear}. Kernel descriptors are easy to design and can turn any type of pixel attribute into patch-level features. They outperform carefully tuned and sophisticated features including SIFT and deep belief networks. We report superior performance on standard image classification benchmarks: Scene-15, Caltech-101, CIFAR10 and CIFAR10-ImageNet.

ICRA Conference 2010 Conference Paper

Learning to navigate through crowded environments

Peter Henry
Christian Vollmer
Brian Ferris
Dieter Fox

The goal of this research is to enable mobile robots to navigate through crowded environments such as indoor shopping malls, airports, or downtown side walks. The key research question addressed in this paper is how to learn planners that generate human-like motion behavior. Our approach uses inverse reinforcement learning (IRL) to learn human-like navigation behavior based on example paths. Since robots have only limited sensing, we extend existing IRL methods to the case of partially observable environments. We demonstrate the capabilities of our approach using a realistic crowd flow simulator in which we modeled multiple scenarios in crowded environments. We show that our planner learned to guide the robot along the flow of people when the environment is crowded, and along the shortest path if no people are around.

ICRA Conference 2009 Conference Paper

Anatomically correct testbed hand control: Muscle and joint control strategies

Ashish D. Deshpande
Jonathan Ko
Dieter Fox
Yoky Matsuoka

Human hands are capable of many dexterous grasping and manipulation tasks. To understand human levels of dexterity and to achieve it with robotic hands, we constructed an anatomically correct testbed (ACT) hand which allows for the investigation of the biomechanical features and neural control strategies of the human hand. This paper focuses on developing control strategies for the index finger motion of the ACT Hand. A direct muscle position control and a force-optimized joint control are implemented as building blocks and tools for comparisons with future biological control approaches. We show how Gaussian process regression techniques can be used to determine the relationships between the muscle and joint motions in both controllers. Our experiments demonstrate that the direct muscle position controller allows for accurate and fast position tracking, while the force-optimized joint controller allows for exploitation of actuation redundancy in the finger critical for this redundant system. Furthermore, a comparison between Gaussian processes and least squares regression method shows that Gaussian processes provide better parameter estimation and tracking performance. This first control investigation on the ACT hand opens doors to implement biological strategies observed in humans and achieve the ultimate human-level dexterity.

IROS Conference 2008 Conference Paper

GP-BayesFilters: Bayesian filtering using Gaussian process prediction and observation models

Jonathan Ko
Dieter Fox

Bayesian filtering is a general framework for recursively estimating the state of a dynamical system. The most common instantiations of Bayes filters are Kalman filters (extended and unscented) and particle filters. Key components of each Bayes filter are probabilistic prediction and observation models. Recently, Gaussian processes have been introduced as a non-parametric technique for learning such models from training data. In the context of unscented Kalman filters, these models have been shown to provide estimates that can be superior to those achieved with standard, parametric models. In this paper we show how Gaussian process models can be integrated into other Bayes filters, namely particle filters and extended Kalman filters. We provide a complexity analysis of these filters and evaluate the alternative techniques using data collected with an autonomous micro-blimp.

IJCAI Conference 2007 Conference Paper

Brian Ferris
Dieter Fox
Neil Lawrence

WiFi localization, the task of determining the physical location of a mobile device from wireless signal strengths, has been shown to be an accurate method of indoor and outdoor localization and a powerful building block for location-aware applications. However, most localization techniques require a training set of signal strength readings labeled against a ground truth location map, which is prohibitive to collect and maintain as maps grow large. In this paper we propose a novel technique for solving the WiFi SLAM problem using the Gaussian Process Latent Variable Model (GP-LVM) to determine the latent-space locations of unlabeled signal strength data. We show how GP-LVM, in combination with an appropriate motion dynamics model, can be used to reconstruct a topological connectivity graph from a signal strength sequence which, in combination with the learned Gaussian Process signal strength model, can be used to perform efficient localization.

IJCAI Conference 2007 Conference Paper

Lin Liao
Tanzeem Choudhury
Dieter Fox
Henry Kautz

While conditional random fields (CRFs) have been applied successfully in a variety of domains, their training remains a challenging task. In this paper, we introduce a novel training method for CRFs, called virtual evidence boosting, which simultaneously performs feature selection and parameter estimation. To achieve this, we extend standard boosting to handle virtual evidence, where an observation can be specified as a distribution rather than a single number. This extension allows us to develop a unified framework for learning both local and compatibility features in CRFs. In experiments on synthetic data as well as real activity classification problems, our new training algorithm outperforms other training approaches including maximum likelihood, maximum pseudo-likelihood, and the most recent boosted random fields.

IJCAI Conference 2007 Conference Paper

Christian Plagemann
Dieter Fox
Wolfram Burgard

The ability to detect failures and to analyze their causes is one of the preconditions of truly autonomous mobile robots. Especially online failure detection is a complex task, since the effects of failures are typically difficult to model and often resemble the noisy system behavior in a fault-free operational mode. The extremely low a priori likelihood of failures poses additional challenges for detection algorithms. In this paper, we present an approach that applies Gaussian process classification and regression techniques for learning highly effective proposal distributions of a particle filter that is applied to track the state of the system. As a result, the efficiency and robustness of the state estimation process is substantially improved. In practical experiments carried out with a real robot we demonstrate that our system is capable of detecting collisions with unseen obstacles while at the same time estimating the changing point of contact with the obstacle.

IJCAI Conference 2007 Conference Paper

Stephen Friedman
Hanna Pasula
Dieter Fox

The ability to build maps of indoor environments is extremely important for autonomous mobile robots. In this paper we introduce Voronoi random fields (VRFs), a novel technique for mapping the topological structure of indoor environments. Our maps describe environments in terms of their spatial layout along with information about the different places and their connectivity. To build these maps, we extract a Voronoi graph from an occupancy grid map generated with a laser range-finder, and then represent each point on the Voronoi graph as a node of a conditional random field, which is a discriminatively trained graphical model. The resulting VRF estimates the label of each node, integrating features from both the map and the Voronoi topology. The labels provide a segmentation of an environment, with the different segments corresponding to rooms, hallways, or doorways. Experiments using different maps show that our technique is able to label unknown environments based on parameters learned from other environments.

IROS Conference 2007 Conference Paper

A spatio-temporal probabilistic model for multi-sensor object recognition

Bertrand Douillard
Dieter Fox
Fabio Ramos 0001

This paper presents a general framework for multi-sensor object recognition through a discriminative probabilistic approach modelling spatial and temporal correlations. The algorithm is developed in the context of Conditional Random Fields (CRFs) trained with virtual evidence boosting. The resulting system is able to integrate arbitrary sensor information and incorporate features extracted from the data. The spatial relationships captured by are further integrated into a smoothing algorithm to improve recognition over time. We demonstrate the benefits of modelling spatial and temporal relationships for the problem of detecting cars using laser and vision data in outdoor environments.

ICRA Conference 2007 Conference Paper

CRF-Filters: Discriminative Particle Filters for Sequential State Estimation

Benson Limketkai
Dieter Fox
Lin Liao

Particle filters have been applied with great success to various state estimation problems in robotics. However, particle filters often require extensive parameter tweaking in order to work well in practice. This is based on two observations. First, particle filters typically rely on independence assumptions such as "the beams in a laser scan are independent given the robot's location in a map". Second, even when the noise parameters of the dynamical system are perfectly known, the sample-based approximation can result in poor filter performance. In this paper we introduce CRF-filters, a novel variant of particle filtering for sequential state estimation. CRF-filters are based on conditional random fields, which are discriminative models that can handle arbitrary dependencies between observations. We show how to learn the parameters of CRF-filters based on labeled training data. Experiments using a robot equipped with a laser range-finder demonstrate that our technique is able to learn parameters of the robot's motion and sensor models that result in good localization performance, without the need of additional parameter tweaking.

ICRA Conference 2007 Conference Paper

Gaussian Processes and Reinforcement Learning for Identification and Control of an Autonomous Blimp

Jonathan Ko
Daniel J. Klein
Dieter Fox
Dirk Hähnel

Blimps are a promising platform for aerial robotics and have been studied extensively for this purpose. Unlike other aerial vehicles, blimps are relatively safe and also possess the ability to loiter for long periods. These advantages, however, have been difficult to exploit because blimp dynamics are complex and inherently non-linear. The classical approach to system modeling represents the system as an ordinary differential equation (ODE) based on Newtonian principles. A more recent modeling approach is based on representing state transitions as a Gaussian process (GP). In this paper, we present a general technique for system identification that combines these two modeling approaches into a single formulation. This is done by training a Gaussian process on the residual between the non-linear model and ground truth training data. The result is a GP-enhanced model that provides an estimate of uncertainty in addition to giving better state predictions than either ODE or GP alone. We show how the GP-enhanced model can be used in conjunction with reinforcement learning to generate a blimp controller that is superior to those learned with ODE or GP models alone.

IROS Conference 2007 Conference Paper

GP-UKF: Unscented kalman filters with Gaussian process prediction and observationmodels

Jonathan Ko
Daniel J. Klein
Dieter Fox
Dirk Hähnel

This paper considers the use of non-parametric system models for sequential state estimation. In particular, motion and observation models are learned from training examples using Gaussian process (GP) regression. The state estimator is an unscented Kalman filter (UKF). The resulting GP-UKF algorithm has a number of advantages over standard (parametric) UKFs. These include the ability to estimate the state of arbitrary nonlinear systems, improved tracking quality compared to a parametric UKF, and graceful degradation with increased model uncertainty. These advantages stem from the fact that GPs consider both the noise in the system and the uncertainty in the model. If an approximate parametric model is available, it can be incorporated into the GP; resulting in further performance improvements. In experiments, we show how the GP-UKF algorithm can be applied to the problem of tracking an autonomous micro-blimp.

AIJ Journal 2007 Journal Article

Learning and inferring transportation routines

Lin Liao
Donald J. Patterson
Dieter Fox
Henry Kautz

This paper introduces a hierarchical Markov model that can learn and infer a user's daily movements through an urban community. The model uses multiple levels of abstraction in order to bridge the gap between raw GPS sensor measurements and high level information such as a user's destination and mode of transportation. To achieve efficient inference, we apply Rao–Blackwellized particle filters at multiple levels of the model hierarchy. Locations such as bus stops and parking lots, where the user frequently changes mode of transportation, are learned from GPS data logs without manual labeling of training data. We experimentally demonstrate how to accurately detect novel behavior or user errors (e. g. taking a wrong bus) by explicitly modeling activities in the context of the user's historical data. Finally, we discuss an application called “Opportunity Knocks” that employs our techniques to help cognitively-impaired people use public transportation safely.

UAI Conference 2006 Conference Paper

Recognizing Activities and Spatial Context Using Wearable Sensors

Amar Subramanya
Alvin Raj
Jeff A. Bilmes
Dieter Fox

We introduce a new dynamic model with the capability of recognizing both activities that an individual is performing as well as where that ndividual is located. Our model is novel in that it utilizes a dynamic graphical model to jointly estimate both activity and spatial context over time based on the simultaneous use of asynchronous observations consisting of GPS measurements, and measurements from a small mountable sensor board. Joint inference is quite desirable as it has the ability to improve accuracy of the model. A key goal, however, in designing our overall system is to be able to perform accurate inference decisions while minimizing the amount of hardware an individual must wear. This minimization leads to greater comfort and flexibility, decreased power requirements and therefore increased battery life, and reduced cost. We show results indicating that our joint measurement model outperforms measurements from either the sensor board or GPS alone, using two types of probabilistic inference procedures, namely particle filtering and pruned exact inference.

ICRA Conference 2005 Conference Paper

Autonomous Terrain Mapping and Classification Using Hidden Markov Models

Denis Fernando Wolf
Gaurav S. Sukhatme
Dieter Fox
Wolfram Burgard

This paper presents a new approach for terrain mapping and classification using mobile robots with 2D laser range finders. Our algorithm generates 3D terrain maps and classifies navigable and non-navigable regions on those maps using Hidden Markov models. The maps generated by our approach can be used for path planning, navigation, local obstacle avoidance, detection of changes in the terrain, and object recognition. We propose a map segmentation algorithm based on Markov Random Fields, which removes small errors in the classification. In order to validate our algorithms, we present experimental results using two robotic platforms.

NeurIPS Conference 2005 Conference Paper

Location-based activity recognition

Lin Liao
Dieter Fox
Henry Kautz

Learning patterns of human behavior from sensor data is extremely important for high-level activity inference. We show how to extract and label a person's activities and significant places from traces of GPS data. In contrast to existing techniques, our approach simultaneously detects and classifies the significant locations of a person and takes the highlevel context into account. Our system uses relational Markov networks to represent the hierarchical activity model that encodes the complex relations among GPS readings, activities and significant places. We apply FFT-based message passing to perform efficient summation over large numbers of nodes in the networks. We present experiments that show significant improvements over existing techniques.

IJCAI Conference 2005 Conference Paper

Location-Based Activity Recognition using Relational Markov Networks

Lin Liao
Dieter Fox
Henry

In this paper we define a general framework for activity recognition by building upon and extending Relational Markov Networks. Using the example of activity recognition from location data, we show that our model can represent a variety of features including temporal information such as time of day, spatial information extracted from geographic databases, and global constraints such as the number of homes or workplaces of a person. We develop an efficient inference and learning technique based on MCMC. Using GPS location data collected by multiple people we show that the technique can accurately label a person’s activity locations. Furthermore, we show that it is possible to learn good models from less data by using priors extracted from other people’s data.

IROS Conference 2004 Conference Paper

Bayesian color estimation for adaptive vision-based robot localization

Dirk Schulz 0001
Dieter Fox

In this article we introduce a hierarchical Bayesian model to estimate a set of colors with a mobile robot. Estimating colors is particularly important if objects in an environment can only be distinguished by their color. Since the appearance of colors can change due to variations in the lighting condition, a robot needs to adapt its color model to such changes. We propose a two level Gaussian model in which the lighting conditions are estimated at the upper level using a switching Kalman filter. A hierarchical Bayesian technique learns Gaussian priors from data collected in other environments. Furthermore, since estimation of the color model depends on knowledge of the robot's location, we employ a Rao-Blackwellised particle filter to maintain a joint posterior over robot positions and lighting conditions. We evaluate the technique in the context of the RoboCup AIBO league, where a legged AIBO robot has to localize itself in an environment similar to a soccer field. Our experiments show that the robot can localize under different lighting conditions and adapt to changes in the lighting condition, for example, due to a light being turned on or off.

AAAI Conference 2004 System Paper

Centibots: Very Large Scale Distributed Robotic Teams

Charlie Ortiz
Regis Vincent
Andrew Agno
Dieter Fox
Jonathan Ko

In this paper, we describe the development of Centibots, a framework for very large teams of robots that are able to perceive, explore, plan and collaborate in unknown environments. Teams consist of approximately 100 robots which can be deployed in unexplored areas and which can efficiently distribute tasks among themselves; the system also makes use of a mixed initiative mode of interaction in which a user can easily influence missions as necessary. In contrast to simulation-based systems which abstract away aspects of the environment for examining component technologies, our design reflects an integrated, end-to-end system. Fex

ICRA Conference 2004 Conference Paper

Mapping and Localization with RFID Technology

Dirk Hähnel
Wolfram Burgard
Dieter Fox
Kenneth P. Fishkin
Matthai Philipose

We analyze whether radio frequency identification (RFID) technology can be used to improve the localization of mobile robots and persons in their environment. In particular we study the problem of localizing RFID tags with a mobile platform that is equipped with a pair of RFID antennas. We present a probabilistic measurement model for RFID readers that allow us to accurately localize RFID tags in the environment. We also demonstrate how such maps can be used to localize a robot and persons in their environment. Finally, we present experiments illustrating that the computational requirements for global robot localization can be reduced strongly by fusing RFID information with laser data.

IROS Conference 2004 Conference Paper

Reinforcement learning for sensing strategies

Cody C. T. Kwok
Dieter Fox

Since sensors have limited range and coverage, mobile robots often have to make decisions on where to point their sensors. A good sensing strategy allows a robot to collect information that is useful for its tasks. Most existing solutions to this active sensing problem choose the direction that maximally reduces the uncertainty in a single state variable. In more complex problem domains, however, uncertainties exist in multiple state variables, and they affect the performance of the robot in different ways. The robot thus needs to have more sophisticated sensing strategies in order to decide which uncertainties to reduce, and to make the correct trade-offs. In this work, we apply a least squares reinforcement learning method to solve this problem. We implemented and tested the learning approach in the RoboCup domain, where the robot attempts to reach a ball and accurately kick it into the goal. We present experimental results that suggest our approach is able to learn highly effective sensing strategies.

IROS Conference 2003 Conference Paper

A practical, decision-theoretic approach to multi-robot mapping and exploration

Jonathan Ko
Benjamin Stewart
Dieter Fox
Kurt Konolige
Benson Limketkai

An important assumption underlying virtually all approaches to multi-robot exploration is prior knowledge about their relative locations. This is due to the fact that robots need to merge their maps so as to coordinate their exploration strategies. The key step in map merging is to estimate the relative locations of the individual robots. This paper presents a novel approach to multi-robot map merging under global uncertainty about the robot's relative locations. Our approach uses an adapted version of particle filters to estimate the position of one robot in the other robot's partial map. The risk of false-positive map matches is avoided by verifying match hypotheses using a rendezvous approach. We show how to seamlessly integrate this approach into a decision-theoretic multi-robot coordination strategy. The experiments show that our sample-based technique can reliably find good hypotheses for map matches. Furthermore, we present results obtained with two robots successfully merging their maps using the decision-theoretic rendezvous strategy.

ICRA Conference 2003 Conference Paper

Adaptive real-time particle filters for robot localization

Cody C. T. Kwok
Dieter Fox
Marina Meila

Particle filters have recently been applied with great success to mobile robot localization. This success is mostly due to their simplicity and their ability to represent arbitrary, multi-modal densities over a robot's state space. The increased representational power, however, comes at the cost of higher computational complexity. In this paper we introduce adaptive real-time particle filters that greatly increase the performance of particle filters under limited computational resources. Our approach improves the efficiency of state estimation by adapting the size of sample sets on-the-fly. Furthermore, even when large sample sets are needed to represent a robot's uncertainty, the approach takes every sensor measurement into account, thereby avoiding the risk of losing valuable sensor information during the update of the filter. We demonstrate empirically that this new algorithm drastically improves the performance of particle filters for robot localization.

IROS Conference 2003 Conference Paper

An efficient fastSLAM algorithm for generating maps of large-scale cyclic environments from raw laser range measurements

Dirk Hähnel
Wolfram Burgard
Dieter Fox
Sebastian Thrun

The ability to learn a consistent model of its environment is a prerequisite for autonomous mobile robots. A particularly challenging problem in acquiring environment maps is that of closing loops; loops in the environment create challenging data association problems [J. -S. Gutman et al. , 1999]. This paper presents a novel algorithm that combines Rao-Blackwellized particle filtering and scan matching. In our approach scan matching is used for minimizing odometric errors during mapping. A probabilistic model of the residual errors of scan matching process is then used for the resampling steps. This way the number of samples required is seriously reduced. Simultaneously we reduce the particle depletion problem that typically prevents the robot from closing large loops. We present extensive experiments that illustrate the superior performance of our approach compared to previous approaches.

IROS Conference 2003 Conference Paper

Map merging for distributed robot navigation

Kurt Konolige
Dieter Fox
Benson Limketkai
Jonathan Ko
Benjamin Stewart

A set of robots mapping an area can potentially combine their information to produce a distributed map more efficiently than a single robot alone. We describe a general framework for distributed map building in the presence of uncertain communication. Within this framework, we then present a technical solution to the key decision problem of determining relative location within partial maps.

IJCAI Conference 2003 Conference Paper

People Tracking with Anonymous and ID-Sensors Using Rao-Blackwellised Particle Filters

Dirk Schulz
Dieter Fox
Jeffrey Hightower

Estimating the location of people using a network of sensors placed throughout an environment is a fundamental challenge in smart environments and ubiquitous computing. Id-sensors such as infrared badges provide explicit object identity information but coarse location information while anonymous sensors such as laser range-finders provide accurate location information only. Tracking using both sensor types simultaneously is an open research challenge. We present a novel approach to tracking multiple objects that combines the accuracy benefits of anonymous sensors and the identification certainty of id-sensors. Rao-Blackwellised particle filters are used to estimate object locations. Each particle represents the association history between Kalman filtered object tracks and observations. After using only anonymous sensors until id estimates are certain enough, id assignments are sampled as well resulting in a fully Rao-Blackwellised particle filter over both object tracks and id assignments. Our approach was implemented and tested successfully using data collected in an indoor environment.

UAI Conference 2003 Conference Paper

The Revisiting Problem in Mobile Robot Map Building: A Hierarchical Bayesian Approach

Benjamin Stewart
Jonathan Ko
Dieter Fox
Kurt Konolige

IROS Conference 2003 Conference Paper

Voronoi tracking: location estimation using sparse and noisy sensor data

Lin Liao
Dieter Fox
Jeffrey Hightower
Henry A. Kautz
Dirk Schulz 0001

Tracking the activity of people in indoor environments has gained considerable attention in the robotics community over the last years. Most of the existing approaches are based on sensors, which allow to accurately determining the locations of people but do not provide means to distinguish between different persons. In this paper we propose a novel approach to tracking moving objects and their identity using noisy, sparse information collected by id-sensors such as infrared and ultrasound badge systems. The key idea of our approach is to use particle filters to estimate the locations of people on the Voronoi graph of the environment. By restricting particles to a graph, we make use of the inherent structure of indoor environments. The approach has two key advantages. First, it is by far more efficient and robust than unconstrained particle filters. Second, the Voronoi graph provides a natural discretization of human motion, which allows us to apply unsupervised learning techniques to derive typical motion patterns of the people in the environment. Experiments using a robot to collect ground-truth data indicate the superior performance of Voronoi tracking. Furthermore, we demonstrate that EM-based learning of behavior patterns increases the tracking performance and provides valuable information for high-level behavior recognition.

IROS Conference 2002 Conference Paper

An experimental comparison of localization methods continued

Jens-Steffen Gutmann
Dieter Fox

Localization is one of the fundamental problems in mobile robot navigation. Past experiments have shown that, in general, grid-based Markov localization is more robust than Kalman filtering while the latter can be more accurate than the former Recently new methods for localization employing particle filters have become popular. In this paper, we compare different localization methods using Kalman filtering, grid-based Markov localization, Monte Carlo Localization (MCL), and combinations thereof. We give experimental evidence that a combination of Markov localization and Kalman filtering as well as a variant of MCL outperform the other methods in terms of accuracy, robustness, and time needed for recovering from manual robot displacement, while requiring only few computational resources.

NeurIPS Conference 2002 Conference Paper

Real-Time Particle Filters

Cody Kwok
Dieter Fox
Marina Meila

Particle ﬁlters estimate the state of dynamical systems from sensor infor- mation. In many real time applications of particle ﬁlters, however, sensor information arrives at a signiﬁcantly higher rate than the update rate of the ﬁlter. The prevalent approach to dealing with such situations is to update the particle ﬁlter as often as possible and to discard sensor information that cannot be processed in time. In this paper we present real-time particle ﬁl- ters, which make use of all sensor information even when the ﬁlter update rate is below the update rate of the sensors. This is achieved by represent- ing posteriors as mixtures of sample sets, where each mixture component integrates one observation arriving during a ﬁlter update. The weights of the mixture components are set so as to minimize the approximation error introduced by the mixture representation. Thereby, our approach focuses computational resources (samples) on valuable sensor information. Exper- iments using data collected with a mobile robot show that our approach yields strong improvements over other approaches.

NeurIPS Conference 2001 Conference Paper

KLD-Sampling: Adaptive Particle Filters

Dieter Fox

Over the last years, particle ﬁlters have been applied with great success to a variety of state estimation problems. We present a statistical approach to increasing the efﬁciency of particle ﬁlters by adapting the size of sample sets on-the-ﬂy. The key idea of the KLD-sampling method is to bound the approximation error introduced by the sample-based representation of the particle ﬁlter. The name KLD-sampling is due to the fact that we measure the approximation error by the Kullback-Leibler distance. Our adaptation approach chooses a small number of samples if the density is focused on a small part of the state space, and it chooses a large number of samples if the state uncertainty is high. Both the implementation and computation overhead of this approach are small. Extensive experiments using mobile robot localization as a test application show that our approach yields drastic improvements over particle ﬁlters with ﬁxed sample set sizes and over a previously introduced adaptation technique.

AIJ Journal 2001 Journal Article

Robust Monte Carlo localization for mobile robots

Sebastian Thrun
Dieter Fox
Wolfram Burgard
Frank Dellaert

Mobile robot localization is the problem of determining a robot's pose from sensor data. This article presents a family of probabilistic localization algorithms known as Monte Carlo Localization (MCL). MCL algorithms represent a robot's belief by a set of weighted hypotheses (samples), which approximate the posterior under a common Bayesian formulation of the localization problem. Building on the basic MCL algorithm, this article develops a more robust algorithm called Mixture-MCL, which integrates two complimentary ways of generating samples in the estimation. To apply this algorithm to mobile robots equipped with range finders, a kernel density tree is learned that permits fast sampling. Systematic empirical results illustrate the robustness and computational efficiency of the approach.

ICRA Conference 2001 Conference Paper

Tracking Multiple Moving Targets with a Mobile Robot using Particle Filters and Statistical Data Association

Dirk Schulz 0001
Wolfram Burgard
Dieter Fox
Armin B. Cremers

One of the goals in the field of mobile robotics is the development of mobile platforms which operate in populated environments and offer various services to humans. For many tasks it is highly desirable that a robot can determine the positions of the humans in its surrounding. In this paper we present a method for tracking multiple moving objects with a mobile robot. We introduce a sample-based variant of joint probabilistic data association filters to track features originating from individual objects and to solve the correspondence problem between the detected features and the filters. In contrast to standard methods, occlusions are handled explicitly during data association. The technique has been implemented and tested on a real robot. Experiments carried out in a typical office environment show that the method is able to track multiple persons even when the trajectories of two people are crossing each other.

ICRA Conference 2000 Conference Paper

A Real-Time Algorithm for Mobile Robot Mapping With Applications to Multi-Robot and 3D Mapping

Sebastian Thrun
Wolfram Burgard
Dieter Fox

We present an incremental method for concurrent mapping and localization for mobile robots equipped with 2D laser range finders. The approach uses a fast implementation of scan-matching for mapping, paired with a sample-based probabilistic method for localization. Compact 3D maps are generated using a multi-resolution approach adopted from the computer graphics literature, fed by data from a dual laser system. Our approach builds 3D maps of large, cyclic environments in real-time, and it is robust. Experimental results illustrate that accurate maps of large, cyclic environments can be generated even in the absence of any odometric data.

ICRA Conference 2000 Conference Paper

Collaborative Multi-Robot Exploration

Wolfram Burgard
Mark Moors
Dieter Fox
Reid G. Simmons
Sebastian Thrun

In this paper we consider the problem of exploring an unknown environment by a team of robots. As in single-robot exploration the goal is to minimize the overall exploration time. The key problem to be solved therefore is to choose appropriate target points for the individual robots so that they simultaneously explore different regions of their environment. We present a probabilistic approach for the coordination of multiple robots which, in contrast to previous approaches, simultaneously takes into account the costs of reaching a target point and the utility of target points. The utility of target points is given by the size of the unexplored area that a robot can cover with its sensors upon reaching a target position. Whenever a target point is assigned to a specific robot, the utility of the unexplored area visible from this target position is reduced for the other robots. This way, a team of multiple robots assigns different target points to the individual robots. The technique has been implemented and tested extensively in real-world experiments and simulation runs. The results given in this paper demonstrate that our coordination technique significantly reduces the exploration time compared to previous approaches.

IROS Conference 2000 Conference Paper

Coordinated deployment of multiple, heterogeneous robots

Reid G. Simmons
David Apfelbaum
Dieter Fox
Robert P. Goldman
Karen Zita Haigh
David J. Musliner
Michael J. S. Pelican
Sebastian Thrun

To be truly useful, mobile robots need to be fairly autonomous and easy to control. This is especially true in situations where multiple robots are used, due to the increase in sensory information and the fact that the robots can interfere with one another. The paper describes a system that integrates autonomous navigation, a task executive, task planning, and an intuitive graphical user interface to control multiple, heterogeneous robots. We have demonstrated a prototype system that plans and coordinates the deployment of teams of robots. Testing has shown the effectiveness and robustness of the system, and of the coordination strategies in particular.

ICRA Conference 1999 Conference Paper

Coastal Navigation: Mobile Robot Navigation with Uncertainty in Dynamic Environments

Nicholas Roy
Wolfram Burgard
Dieter Fox
Sebastian Thrun

Ships often use the coasts of continents for navigation in the absence of better tools such as GPS, since being close to land allows sailors to determine with high accuracy where they are. Similarly for mobile robots, in many environments global and accurate localization is not always feasible. Environments can lack features, and dynamic obstacles such as people can confuse and block sensors. We demonstrate a technique for generating trajectories that take into account both the information content of the environment, and the density of the people in the environment. These trajectories reduce the average positional certainty as the robot moves, reducing the likelihood the robot will become lost at any point. Our method was successfully implemented and used by the mobile robot Minerva, a museum tourguide robot, for a 2 week period in the Smithsonian National Museum of American History.

AIJ Journal 1999 Journal Article

Experiences with an interactive museum tour-guide robot

Wolfram Burgard
Armin B. Cremers
Dieter Fox
Dirk Hähnel
Gerhard Lakemeyer
Dirk Schulz
Walter Steiner
Sebastian Thrun

This article describes the software architecture of an autonomous, interactive tour-guide robot. It presents a modular and distributed software architecture, which integrates localization, mapping, collision avoidance, planning, and various modules concerned with user interaction and Web-based telepresence. At its heart, the software approach relies on probabilistic computation, on-line learning, and any-time algorithms. It enables robots to operate safely, reliably, and at high speeds in highly dynamic environments, and does not require any modifications of the environment to aid the robot's operation. Special emphasis is placed on the design of interactive capabilities that appeal to people's intuition. The interface provides new means for human-robot interaction with crowds of people in public places, and it also provides people all around the world with the ability to establish a “virtual telepresence” using the Web. To illustrate our approach, results are reported obtained in mid-1997, when our robot “RHINO” was deployed for a period of six days in a densely populated museum. The empirical results demonstrate reliable operation in public environments. The robot successfully raised the museum's attendance by more than 50%. In addition, thousands of people all over the world controlled the robot through the Web. We conjecture that these innovations transcend to a much larger range of application domains for service robots.

ICRA Conference 1999 Conference Paper

MINERVA: A Second-Generation Museum Tour-Guide Robot

Sebastian Thrun
Maren Bennewitz
Wolfram Burgard
Armin B. Cremers
Frank Dellaert
Dieter Fox
Dirk Hähnel
Charles R. Rosenberg

This paper describes an interactive tour-guide robot, which was successfully exhibited in a Smithsonian museum. During its two weeks of operation, the robot interacted with thousands of people, traversing more than 44 km at speeds of up to 163 cm/sec. Our approach specifically addresses issues such as safe navigation in unmodified and dynamic environments, and short-term human-robot interaction. It uses learning pervasively at all levels of the software architecture.

ICML Conference 1999 Conference Paper

Monte Carlo Hidden Markov Models: Learning Non-Parametric Models of Partially Observable Stochastic Processes

Sebastian Thrun
John Langford 0001
Dieter Fox

ICRA Conference 1999 Conference Paper

Monte Carlo Localization for Mobile Robots

Frank Dellaert
Dieter Fox
Wolfram Burgard
Sebastian Thrun

To navigate reliably in indoor environments, a mobile robot must know where it is. Thus, reliable position estimation is a key problem in mobile robotics. We believe that probabilistic approaches are among the most promising candidates to providing a comprehensive and real-time solution to the robot localization problem. However, current methods still face considerable hurdles. In particular the problems encountered are closely related to the type of representation used to represent probability densities over the robot's state space. Earlier work on Bayesian filtering with particle-based density representations opened up a new approach for mobile robot localization based on these principles. We introduce the Monte Carlo localization method, where we represent the probability density involved by maintaining a set of samples that are randomly drawn from it. By using a sampling-based representation we obtain a localization method that can represent arbitrary distributions. We show experimentally that the resulting method is able to efficiently localize a mobile robot without knowledge of its starting location. It is faster, more accurate and less memory-intensive than earlier grid-based methods, .

AAAI Conference 1999 Conference Paper

Monte Carlo Localization: Efficient Position Estimation for Mobile Robots

Dieter Fox
Carnegie Mellon University; Wolfram Burgard
University of Bonn; Frank Dellaert
Sebastian Thrun
Carnegie Mellon University

This paper presents a new algorithm for mobile robot localization, called Monte Carlo Localization (MCL). MCL is a version of Markov localization, a family of probabilistic approaches that have recently been applied with great practical success. However, previous approaches were either computationally cumbersome (such as grid-based approaches that represent the state space by high-resolution 3D grids), or had to resort to extremely coarse-grained resolutions. Our approach is computationally efficient while retaining the ability to represent (almost) arbitrary distributions. MCL applies sampling-based methods for approximating probability distributions, in a way that places computation“where needed. ” The number of samples is adapted on-line, thereby invoking large sample sets only when necessary. Empirical results illustrate that MCL yields improved accuracy while requiring an order of magnitude less computation when compared to previous approaches. It is also much easier to implement.

ICML Conference 1999 Conference Paper

Sonar-Based Mapping of Large-Scale Mobile Robot Environments using EM

Wolfram Burgard
Dieter Fox
Hauke Jans
Christian Matenar
Sebastian Thrun

ICRA Conference 1998 Conference Paper

A Hybrid Collision Avoidance Method for Mobile Robots

Dieter Fox
Wolfram Burgard
Sebastian Thrun
Armin B. Cremers

Proposes a hybrid approach to the problem of collision avoidance for indoor mobile robots. The /spl mu/DWA (model-based dynamic window approach) integrates sensor data from various sensors with information extracted from a map of the environment, to generate collision-free motion. A novel integration rule ensures that with high likelihood, the robot avoids collisions with obstacles not detectable with its sensors, even if it is uncertain about its position. The approach was implemented and tested extensively as part of an installation, in which a mobile robot gave interactive tours to visitors of the "Deutsches Museum Bonn. " Here our approach was essential for the success of the entire mission, because a large number of ill-shaped obstacles prohibited the use of purely sensor-based methods for collision avoidance.

IROS Conference 1998 Conference Paper

An experimental comparison of localization methods

Jens-Steffen Gutmann
Wolfram Burgard
Dieter Fox
Kurt Konolige

Localization is the process of updating the pose of a robot in an environment, based on sensor readings. In this experimental study, we compare two methods for localization of indoor mobile robots: Markov localization, which uses a probability distribution across a grid of robot poses; and scan matching, which uses Kalman filtering techniques based on matching sensor scans. Both these techniques are dense matching methods, that is, they match dense sets of environment features to an a priori map. To arrive at results for a range of situations, we utilize several different types of environments, and add noise to both the dead-reckoning and the sensors. Analysis shows that, roughly, the scan-matching techniques are more efficient and accurate, but Markov localization is better able to cope with large amounts of noise. These results suggest hybrid methods that are efficient, accurate and robust to noise.

IROS Conference 1998 Conference Paper

Integrating global position estimation and position tracking for mobile robots: the dynamic Markov localization approach

Wolfram Burgard
Andrcas Derr
Dieter Fox
Armin B. Cremers

Localization is one of the fundamental problems of mobile robots. In order to efficiently perform useful tasks such as office delivery, mobile robots must know their position in their environment. Existing approaches can be distinguished according to the type of localization problem they are designed to solve. Tracking techniques aim at monitoring the robot's position. They assume that the position is initially known and cannot recover from situations in which they lost track of the robot's position. Global localization techniques on the other hand, are able to estimate the robot's position under complete uncertainty. We present the dynamic Markov localization technique as a uniform approach to position estimation, which is able (1) to globally estimate the position of the robot, (2) to efficiently track its position whenever the robot's certainty is high, and (3) to detect and recover from localization failures. The approach has been implemented and intensively tested in real-world environments. We present several experiments illustrating the strength of our method.

AAAI Conference 1998 Conference Paper

Integrating Topological and Metric Maps for Mobile Robot Navigation: A Statistical Approach

Sebastian Thrun
Dieter Fox

The problem of concurrent mapping and localization has received considerable attention in the mobile robotics community. Existing approachescan largely be grouped into two distinct paradigms: topological and metric. This paper proposes a method that integrates both. It posesthe mappingproblem as a statistical maximum likelihood problem, and devises an efficient algorithm for search in likelihood space. It presents an novel mapping algorithm that integrates two phases: a topological and a metric mapping phase. The topological mapping phasesolves a globalposition alignment problem between potentially indistinguishable, significantplaces. The subsequent metric mapping phase produces a fine-grained metric map of the environment in floating-point resolution. The approach is demonstrated empirically to scale up to large, cyclic, and highly ambiguous environments.

AAAI Conference 1998 Conference Paper

Position Estimation for Mobile Robots in Dynamic Environments

Dieter Fox
Sebastian Thrun

For mobile robots to be successful, they have to navigate safely in populated and dynamic environments. While recent research has led to a variety of localization methods that can track robots well in static environments, we still lack methods that can robustly localize mobile robots in dynamic environments, in which people block the robot’s sensors for extensive periods of time or the position of furniture may change. This paper proposes extensions to Markov localization algorithms enabling them to localize mobile robots even in densely populated environments. Two different filters for determining the “believability” of sensor readings are employed. These filters are designed to detect sensor readings that are corrupted by humans or unexpected changes in the environment. The technique was recently implemented and applied as part of an installation, in which a mobile robot gave interactive tours to visitors of the “Deutsches Museum Bonn. ” Extensive empirical tests involving datasets recorded during peak traffic hours in the museum demonstrate that this approach is able to accurately estimate the robot’s position in more than 98% of the cases even in such highly dynamic environments.

ICRA Conference 1998 Conference Paper

Probabilistic Mapping of an Environment by a Mobile Robot

Sebastian Thrun
Dieter Fox
Wolfram Burgard

This paper addresses the problem of building large-scale maps of indoor environments with mobile robots. It proposes a statistical approach that describes the map building problem as a constrained maximum-likelihood estimation problem, for which it devises a practical algorithm. Experimental results in large, cyclic environments illustrate the appropriateness of the approach.

AAAI Conference 1998 Conference Paper

The Interactive Museum Tour-Guide Robot

Wolfram Burgard
Dieter Fox
Gerhard Lakemeyer
Walter Steiner

This paper describes the software architecture of an autonomous tour-guide/tutor robot. This robot was recently deployed in the “Deutsches Museum Bonn, ” were it guided hundreds of visitors through the museum during a six-day deployment period. The robot’s control software integrates low-level probabilistic reasoning with high-level problem solving embedded in first order logic. A collection of software innovations, described in this paper, enabled the robot to navigate at high speeds through dense crowds, while reliably avoiding collisions with obstacles—some of which could not even be perceived. Also described in this paper is a user interface tailored towards non-expert users, which was essential for the robot’s success in the museum. Based on these experiences, this paper argues that time is ripe for the development of AI-based commercial service robots that assist people in everyday life.

IJCAI Conference 1997 Conference Paper

Active Mobile Robot Localization

Wolfram Burgard
Dieter Fox
Sebastian Thrun

Localization is the problem of determining the position of a mobile robot from sensor data. Most existing localization approaches are passive, i. e. , they do not exploit the opportunity to control the robot's effectors during localization. This paper proposes an active localization approach. The approach provides rational criteria for (1) setting the robot's motion direction (exploration), and (2) determining the pointing direction of the sensors so as to most efficiently localize the robot. Furthermore, it is able to deal with noisy sensors and approximative world models. The appropriateness of our approach is demonstrated empirically using a mobile robot in a structured office environment.

IROS Conference 1996 Conference Paper

Controlling synchro-drive robots with the dynamic window approach to collision avoidance

Dieter Fox
Wolfram Burgard
Sebastian Thrun

This paper proposes the dynamic window approach to reactive collision avoidance for mobile robots equipped with synchro-drives. The approach is derived directly from the motion dynamics of the robot and is therefore particularly well-suited for robots operating at high speed. It differs from previous approaches in that the search for commands controlling the translational and rotational velocity of the robot is carried out directly in the space of velocities. The advantage of our approach is that it correctly and in a rigorous way incorporates the dynamics of the robot. This is done by reducing the search space to the dynamic window, which consists of the velocities reachable within a short time interval. Within the dynamic window the approach only considers admissible velocities yielding a trajectory on which the robot is able to stop safely. Among these velocities the combination of translational and rotational velocity is chosen by maximizing an objective function. The objective function includes a measure of progress towards a goal location, the forward velocity of the robot, and the distance to the next obstacle on the trajectory. In extensive experiments the approach presented here has been found to safely control our mobile robot RHINO with speeds of up to 95 cm/sec, in populated and dynamic environments.