Author name cluster

Stefanie Tellex

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

43 papers

2 author rows

ICRA Conference 2025 Conference Paper

Verifiably Following Complex Robot Instructions with Foundation Models

Benedict Quartey
Eric Rosen
Stefanie Tellex
George Konidaris 0001

When instructing robots, users want to flexibly express constraints, refer to arbitrary landmarks, and verify robot behavior, while robots must disambiguate instructions into specifications and ground instruction referents in the real world. To address this problem, we propose Language Instruction grounding for Motion Planning (LIMP), an approach that enables robots to verifiably follow complex, open-ended instructions in real-world environments without prebuilt semantic maps. LIMP constructs a symbolic instruction representation that reveals the robot's alignment with an instructor's intended motives and affords the synthesis of correct-by-construction robot behaviors. We conduct a large-scale evaluation of LIMP on 150 instructions across five real-world environments, demonstrating its versatility and ease of deployment in diverse, unstructured domains. LIMP performs comparably to state-of-the-art baselines on standard open-vocabulary tasks and additionally achieves a 79% success rate on complex spatiotemporal instructions, significantly outperforming baselines that only reach 38%. 1 1 See supplementary materials and demo videos at robotlimp.github.io

IROS Conference 2025 Conference Paper

λ: A Benchmark for Data-Efficiency in Long-Horizon Indoor Mobile Manipulation Robotics

Ahmed Jaafar
Shreyas Sundara Raman
Sudarshan S. Harithas
Yichen Wei
Sofia Juliani
Anneke Wernerfelt
Benedict Quartey
Ifrah Idrees

Learning to execute long-horizon mobile manipulation tasks is crucial for advancing robotics in household and workplace settings. However, current approaches are typically data-inefficient, underscoring the need for improved models that require realistically sized benchmarks to evaluate their efficiency. To address this, we introduce the LAMBDA (λ) benchmark 1 ––Long-horizon Actions for Mobile-manipulation Benchmarking of Directed Activities––which evaluates the data efficiency of models on language-conditioned, long-horizon, multi-room, multi-floor, pick-and-place tasks using a dataset of manageable size, more feasible for collection. Our benchmark includes 571 human-collected demonstrations that provide realism and diversity in simulated and real-world settings. Unlike planner-generated data, these trajectories offer natural variability and replay-verifiability, ensuring robust learning and evaluation. We leverage λ to benchmark current end-to-end learning methods and a modular neuro-symbolic approach that combines foundation models with task and motion planning. We find that learning methods, even when pretrained, yield lower success rates, while a neuro-symbolic method performs significantly better and requires less data.

IJCAI Conference 2024 Conference Paper

A Survey of Robotic Language Grounding: Tradeoffs between Symbols and Embeddings

Vanya Cohen
Jason Xinyu Liu
Raymond Mooney
Stefanie Tellex
David Watkins

With large language models, robots can understand language more flexibly and more capable than ever before. This survey reviews and situates recent literature into a spectrum with two poles: 1) mapping between language and some manually defined formal representation of meaning, and 2) mapping between language and high-dimensional vector spaces that translate directly to low-level robot policy. Using a formal representation allows the meaning of the language to be precisely represented, limits the size of the learning problem, and leads to a framework for interpretability and formal safety guarantees. Methods that embed language and perceptual data into high-dimensional spaces avoid this manually specified symbolic structure and thus have the potential to be more general when fed enough data but require more data and computing to train. We discuss the benefits and tradeoffs of each approach and finish by providing directions for future work that achieves the best of both worlds.

PDF Details DOI

ICRA Conference 2024 Conference Paper

CAPE: Corrective Actions from Precondition Errors using Large Language Models

Shreyas Sundara Raman
Vanya Cohen
Ifrah Idrees
Eric Rosen
Raymond Mooney
Stefanie Tellex
David Paulius

Extracting knowledge and reasoning from large language models (LLMs) offers a path to designing intelligent robots. Common approaches that leverage LLMs for planning are unable to recover when actions fail and resort to retrying failed actions without resolving the underlying cause. We propose a novel approach (CAPE) that generates corrective actions to resolve precondition errors during planning. CAPE improves the quality of generated plans through few-shot reasoning on action preconditions. Our approach enables embodied agents to execute more tasks than baseline methods while maintaining semantic correctness and minimizing re-prompting. In VirtualHome, CAPE improves a human-annotated plan correctness metric from 28. 89% to 49. 63% over SayCan, whilst achieving competitive executability. Our improvements transfer to a Boston Dynamics Spot robot initialized with a set of skills (specified in language) and associated preconditions, where CAPE improves correctness by 76. 49% with higher executability compared to SayCan. Our approach enables embodied agents to follow natural language commands and robustly recover from failures.

IROS Conference 2024 Conference Paper

Lang2LTL-2: Grounding Spatiotemporal Navigation Commands Using Large Language and Vision-Language Models

Jason Xinyu Liu
Ankit Shah
George Konidaris 0001
Stefanie Tellex
David Paulius

Grounding spatiotemporal navigation commands to structured task specifications enables autonomous robots to understand a broad range of natural language and solve long-horizon tasks with safety guarantees. Prior works mostly focus on grounding spatial or temporally extended language for robots. We propose Lang2LTL-2, a modular system that leverages pretrained large language and vision-language models and multimodal semantic information to ground spatiotemporal navigation commands in novel city-scaled environments without retraining. Lang2LTL-2 achieves 93. 53% language grounding accuracy on a dataset of 21, 780 semantically diverse natural language commands in unseen environments. We run an ablation study to validate the need for different modalities. We also show that a physical robot equipped with the same system without modification can execute 50 semantically diverse natural language commands in both indoor and outdoor environments.

ICRA Conference 2024 Conference Paper

Plug in the Safety Chip: Enforcing Constraints for LLM-driven Robot Agents

Ziyi Yang
Shreyas Sundara Raman
Ankit Shah
Stefanie Tellex

Recent advancements in large language models (LLMs) have enabled a new research domain, LLM agents, for solving robotics and planning tasks by leveraging the world knowledge and general reasoning abilities of LLMs obtained during pretraining. However, while considerable effort has been made to teach the robot the "dos", the "don’ts" received relatively less attention. We argue that, for any practical usage, it is as crucial to teach the robot the "don’ts": conveying explicit instructions about prohibited actions, assessing the robot’s comprehension of these restrictions, and, most importantly, ensuring compliance. Moreover, verifiable safe operation is essential for deployments that satisfy worldwide standards such as ISO 61508, which defines standards for safely deploying robots in industrial factory environments worldwide. Aiming at deploying the LLM agents in a collaborative environment, we propose a queryable safety constraint module based on linear temporal logic (LTL) that simultaneously enables natural language (NL) to temporal constraints encoding, safety violation reasoning and explaining, and unsafe action pruning. To demonstrate the effectiveness of our system, we conducted experiments in VirtualHome environment and on a real robot. The experimental results show that our system strictly adheres to the safety constraints and scales well with complex safety constraints, highlighting its potential for practical utility.

ICRA Conference 2024 Conference Paper

Skill Transfer for Temporal Task Specification

Jason Xinyu Liu
Ankit Shah
Eric Rosen
Mingxi Jia
George Konidaris 0001
Stefanie Tellex

Deploying robots in real-world environments, such as households and manufacturing lines, requires generalization across novel task specifications without violating safety constraints. Linear temporal logic (LTL) is a widely used task specification language with a compositional grammar that naturally induces commonalities among tasks while preserving safety guarantees. However, most prior work on reinforcement learning with LTL specifications treats every new task independently, thus requiring large amounts of training data to generalize. We propose LTL-Transfer, a zero-shot transfer algorithm that composes task-agnostic skills learned during training to safely satisfy a wide variety of novel LTL task specifications. Experiments in Minecraft-inspired domains show that after training on only 50 tasks, LTL-Transfer can solve over 90% of 100 challenging unseen tasks and 100% of 300 commonly used novel tasks without violating any safety constraints. We deployed LTL-Transfer at the task-planning level of a quadruped mobile manipulator to demonstrate its zero-shot transfer ability for fetch-and-deliver and navigation tasks.

ICRA Conference 2023 Conference Paper

ASystem for Generalized 3D Multi-Object Search

Kaiyu Zheng
Anirudha Paul
Stefanie Tellex

Searching for objects is a fundamental skill for robots. As such, we expect object search to eventually become an off-the-shelf capability for robots, similar to e. g. , object detection and SLAM. In contrast, however, no system for 3D object search exists that generalizes across real robots and environments. In this paper, building upon a recent theoretical framework that exploited the octree structure for representing belief in 3D, we present GenMOS (Generalized Multi-Object Search), the first general-purpose system for multi-object search (MOS) in a 3D region that is robot-independent and environment-agnostic. GenMOS takes as input point cloud observations of the local region, object detection results, and localization of the robot's view pose, and outputs a 6D viewpoint to move to through online planning. In particular, GenMOS uses point cloud observations in three ways: (1) to simulate occlusion; (2) to inform occupancy and initialize octree belief; and (3) to sample a belief-dependent graph of view positions that avoid obstacles. We evaluate our system both in simulation and on two real robot platforms. Our system enables, for example, a Boston Dynamics Spot robot to find a toy cat hidden underneath a couch in under one minute. We further integrate 3D local search with 2D global search to handle larger areas, demonstrating the resulting system in a 25m 2 lobby area.

IROS Conference 2023 Conference Paper

Improved Inference of Human Intent by Combining Plan Recognition and Language Feedback

Ifrah Idrees
Tian Yun 0001
Naveen Sharma
Yunxin Deng
Nakul Gopalan
George Konidaris 0001
Stefanie Tellex

Conversational assistive robots can aid people, especially those with cognitive impairments, to accomplish various tasks such as cooking meals, performing exercises, or operating machines. However, to interact with people effectively, robots must recognize human plans and goals from noisy observations of human actions, even when the user acts sub-optimally. Previous works on Plan and Goal Recognition (PGR) as planning have used hierarchical task networks (HTN) to model the actor/human. However, these techniques are insufficient as they do not have user engagement via natural modes of interaction such as language. Moreover, they have no mechanisms to let users, especially those with cognitive impairments, know of a deviation from their original plan or about any sub-optimal actions taken towards their goal. We propose a novel framework for plan and goal recognition in partially observable domains—Dialogue for Goal Recognition (D4GR) enabling a robot to rectify its belief in human progress by asking clarification questions about noisy sensor data and sub-optimal human actions. We evaluate the performance of D4GR over two simulated domains—kitchen and blocks domain. With language feedback and the world state information in a hierarchical task model, we show that D4GR framework for the highest sensor noise performs 1% better than HTN in goal accuracy in both domains. For plan accuracy, D4GR outperforms by 4% in the kitchen domain and 2% in the blocks domain in comparison to HTN. The ALWAYS-ASK oracle outperforms our policy by 3% in goal recognition and 7% in plan recognition. D4GR does so by asking 68% fewer questions than an oracle baseline. We also demonstrate a real-world robot scenario in the kitchen domain, validating the improved plan and goal recognition of D4GR in a realistic setting.

IROS Conference 2023 Conference Paper

Language-Conditioned Observation Models for Visual Object Search

Thao Nguyen
Vladislav Hrosinkov
Eric Rosen
Stefanie Tellex

Object search is a challenging task because when given complex language descriptions (e. g. , “find the white cup on the table”), the robot must move its camera through the environment and recognize the described object. Previous works map language descriptions to a set of fixed object detectors with predetermined noise models, but these approaches are challenging to scale because new detectors need to be made for each object. In this work, we bridge the gap in realistic object search by posing the search problem as a partially observable Markov decision process (POMDP) where the object detector and visual sensor noise in the observation model is determined by a single Deep Neural Network conditioned on complex language descriptions. We incorporate the neural network's outputs into our language-conditioned observation model (LCOM) to represent dynamically changing sensor noise. With an LCOM, any language description of an object can be used to generate an appropriate object detector and noise model, and training an LCOM only requires readily available supervised image-caption datasets. We empirically evaluate our method by comparing against a state-of-the-art object search algorithm in simulation, and demonstrate that planning with our observation model yields a significantly higher average task completion rate (from 0. 46 to 0. 66) and more efficient and quicker object search than with a fixed-noise model. We demonstrate our method on a Boston Dynamics Spot robot, enabling it to handle complex natural language object descriptions and efficiently find objects in a room-scale environment.

ICML Conference 2023 Conference Paper

RLang: A Declarative Language for Describing Partial World Knowledge to Reinforcement Learning Agents

Rafael Rodríguez-Sánchez 0002
Benjamin A. Spiegel
Jennifer Wang
Roma Patel
Stefanie Tellex
George Konidaris 0001

We introduce RLang, a domain-specific language (DSL) for communicating domain knowledge to an RL agent. Unlike existing RL DSLs that ground to $\textit{single}$ elements of a decision-making formalism (e. g. , the reward function or policy), RLang can specify information about every element of a Markov decision process. We define precise syntax and grounding semantics for RLang, and provide a parser that grounds RLang programs to an algorithm-agnostic $\textit{partial}$ world model and policy that can be exploited by an RL agent. We provide a series of example RLang programs demonstrating how different RL methods can exploit the resulting knowledge, encompassing model-free and model-based tabular algorithms, policy gradient and value-based methods, hierarchical approaches, and deep methods.

IROS Conference 2023 Conference Paper

Skill Generalization with Verbs

Rachel Ma
Lyndon Lam
Benjamin A. Spiegel
Aditya Ganeshan
Roma Patel
Ben Abbatematteo
David Paulius
Stefanie Tellex

It is imperative that robots can understand natural language commands issued by humans. Such commands typically contain verbs that signify what action should be performed on a given object and that are applicable to many objects. We propose a method for generalizing manipulation skills to novel objects using verbs. Our method learns a probabilistic classifier that determines whether a given object trajectory can be described by a specific verb. We show that this classifier accurately generalizes to novel object categories with an average accuracy of 76. 69% across 13 object categories and 14 verbs. We then perform policy search over the object kinematics to find an object trajectory that maximizes classifier prediction for a given verb. Our method allows a robot to generate a trajectory for a novel object based on a verb, which can then be used as input to a motion planner. We show that our model can generate trajectories that are usable for executing five verb commands applied to novel instances of two different object categories on a real robot.

PRL Workshop 2023 Workshop Paper

Task Scoping: Generating Task-Specific Simplifications of Open-Scope Planning Problems

Michael Fishman
Nishanth Kumar
Cameron Allen
Natasha Danas
Michael Littman
Stefanie Tellex
George Konidaris

A general-purpose agent must learn an open-scope world model: one rich enough to tackle any of the wide range of tasks it may be asked to solve over its operational lifetime. This stands in contrast with typical planning approaches, where the scope of a model is limited to a specific family of tasks that share significant structure. Unfortunately, planning to solve any specific task within an open-scope model is computationally intractable---even for state-of-the-art methods---due to the many states and actions that are necessarily present in the model but irrelevant to that problem. We propose task scoping: a method that exploits knowledge of the initial state, goal conditions, and transition system to automatically and efficiently remove provably irrelevant variables and actions from grounded planning problems. Our approach leverages causal link analysis and backwards reachability over state variables (rather than states) along with operator merging (when effects on relevant variables are identical). Using task scoping as a pre-planning step can shrink the search space by orders of magnitude and dramatically decrease planning time. We empirically demonstrate that these improvements occur across a variety of open-scope domains, including Minecraft, where our approach reduces search time by a factor of $75$ for a state-of-the-art numeric planner, even after including the time required for task scoping itself.

ICRA Conference 2022 Conference Paper

Generalizing to New Domains by Mapping Natural Language to Lifted LTL

Eric Hsiung
Hiloni Mehta
Junchi Chu
Jason Xinyu Liu
Roma Patel
Stefanie Tellex
George Konidaris 0001

Recent work on using natural language to specify commands to robots has grounded that language to LTL. However, mapping natural language task specifications to LTL task specifications using language models require probability distributions over finite vocabulary. Existing state-of-the-art methods have extended this finite vocabulary to include unseen terms from the input sequence to improve output generalization. However, novel out-of-vocabulary atomic propositions cannot be generated using these methods. To overcome this, we introduce an intermediate contextual query representation which can be learned from single positive task specification examples, associating a contextual query with an LTL template. We demonstrate that this intermediate representation allows for generalization over unseen object references, assuming accurate groundings are available. We compare our method of mapping natural language task specifications to intermediate contextual queries against state-of-the-art CopyNet models capable of translating natural language to LTL, by evaluating whether correct LTL for manipulation and navigation task specifications can be output, and show that our method outperforms the CopyNet model on unseen object references. We demonstrate that the grounded LTL our method outputs can be used for planning in a simulated OO-MDP environment. Finally, we discuss some common failure modes encountered when translating natural language task specifications to grounded LTL.

ICRA Conference 2022 Conference Paper

Towards Optimal Correlational Object Search

Kaiyu Zheng
Rohan Chitnis
Yoonchang Sung
George Konidaris 0001
Stefanie Tellex

In realistic applications of object search, robots will need to locate target objects in complex environments while coping with unreliable sensors, especially for small or hard-to-detect objects. In such settings, correlational information can be valuable for planning efficiently. Previous approaches that consider correlational information typically resort to ad-hoc, greedy search strategies. We introduce the Correlational Object Search POMDP (COS-POMDP), which models correlations while preserving optimal solutions with a reduced state space. We propose a hierarchical planning algorithm to scale up COS-POMDPs for practical domains. Our evaluation, conducted with the AI2-THOR household simulator and the YOLOv5 object detector, shows that our method finds objects more successfully and efficiently compared to baselines, particularly for hard-to-detect objects such as srub brush and remote control.

ICRA Conference 2022 Conference Paper

Using Language to Generate State Abstractions for Long-Range Planning in Outdoor Environments

Matthew Berg
George Konidaris 0001
Stefanie Tellex

Robots that process navigation instructions in large outdoor environments will need to operate at different levels of abstraction. For example, a land-surveying aerial robot receiving the instruction “go to Boston and go through the state forest on the way” must reason about a long-range goal like “go to Boston” while also processing a finer-grained constraint like “go through the state forest. ” Existing approaches struggle to plan such commands because of the immense number of locations and constraints that can be expressed in language. We introduce a hierarchical representation of outdoor environments and a planning approach that dynamically compacts the robot's state space to enable tractable planning in city and state-scale environments. Our approach leverages natural abstractions in real-world map data, coupled with abstractions generated from users' instructions, to generate filtered environment views that accelerate planning while supporting a robot's ability to obey complex temporal goals and constraints at different levels of abstraction. We evaluate our approach on seven templates of LTLJ formulas and in an 80 kilometer-radius environment containing over 250, 000 locations downloaded from OpenStreetMap. The results show our approach enables planning in seconds or minutes in a large outdoor environment while still satisfying the task specification.

IROS Conference 2021 Conference Paper

Bootstrapping Motor Skill Learning with Motion Planning

Ben Abbatematteo
Eric Rosen
Stefanie Tellex
George Konidaris 0001

Learning a robot motor skill from scratch is impractically slow; so much so that in practice, learning must typically be bootstrapped using human demonstration. However, relying on human demonstration necessarily degrades the autonomy of robots that must learn a wide variety of skills over their operational lifetimes. We propose using kinematic motion planning as a completely autonomous, sample efficient way to bootstrap motor skill learning for object manipulation. We demonstrate the use of motion planners to bootstrap motor skills in two complex object manipulation scenarios with different policy representations: opening a drawer with a dynamic movement primitive representation, and closing a microwave door with a deep neural network policy. We also show how our method can bootstrap a motor skill for the challenging dynamic task of learning to hit a ball off a tee, where a kinematic plan based on treating the scene as static is insufficient to solve the task, but sufficient to bootstrap a more dynamic policy. In all three cases, our method is competitive with human-demonstrated initialization, and significantly out-performs starting with a random policy. This approach enables robots to to efficiently and autonomously learn motor policies for dynamic tasks without human demonstration.

ICRA Conference 2021 Conference Paper

Learning Collaborative Pushing and Grasping Policies in Dense Clutter

Bingjie Tang
Matt Corsaro
George Konidaris 0001
Stefanos Nikolaidis
Stefanie Tellex

Robots must reason about pushing and grasping in order to engage in flexible manipulation in cluttered environments. Earlier works on learning pushing and grasping only consider each operation in isolation or are limited to top-down grasping and bin-picking. We train a robot to learn joint planar pushing and 6-degree-of-freedom (6-DoF) grasping policies by self-supervision. Two separate deep neural networks are trained to map from 3D visual observations to actions with a Q-learning framework. With collaborative pushes and expanded grasping action space, our system can deal with cluttered scenes with a wide variety of objects (e. g. grasping a plate from the side after pushing away surrounding obstacles). We compare our system to the state-of-the-art baseline model VPG [1] in simulation and outperform it with 10% higher action efficiency and 20% higher grasp success rate. We then demonstrate our system on a KUKA LBR iiwa arm with a Robotiq 3-finger gripper.

IROS Conference 2021 Conference Paper

Learning to Detect Multi-Modal Grasps for Dexterous Grasping in Dense Clutter

Matt Corsaro
Stefanie Tellex
George Konidaris 0001

We propose an approach to multi-modal grasp detection that jointly predicts the probabilities that several types of grasps succeed at a given grasp pose. Given a partial point cloud of a scene, the algorithm proposes a set of feasible grasp candidates, then estimates the probabilities that a grasp of each type would succeed at each candidate pose. Predicting grasp success probabilities directly from point clouds makes our approach agnostic to the number and placement of depth sensors at execution time. We evaluate our system both in simulation and on a real robot with a Robotiq 3-Finger Adaptive Gripper and compare our network against several baselines that perform fewer types of grasps. Our experiments show that a system that explicitly models grasp type achieves an object retrieval rate 8. 5% higher in a complex cluttered environment than our highest-performing baseline.

IROS Conference 2021 Conference Paper

Multi-Resolution POMDP Planning for Multi-Object Search in 3D

Kaiyu Zheng
Yoonchang Sung
George Konidaris 0001
Stefanie Tellex

Robots operating in households must find objects on shelves, under tables, and in cupboards. In such environments, it is crucial to search efficiently at 3D scale while coping with limited field of view and the complexity of searching for multiple objects. Principled approaches to object search frequently use Partially Observable Markov Decision Process (POMDP) as the underlying framework for computing search strategies, but constrain the search space in 2D. In this paper, we present a POMDP formulation for multi-object search in a 3D region with a frustum-shaped field-of-view. To efficiently solve this POMDP, we propose a multi-resolution planning algorithm based on online Monte-Carlo tree search. In this approach, we design a novel octree-based belief representation to capture uncertainty of the target objects at different resolution levels, then derive abstract POMDPs at lower resolutions with dramatically smaller state and observation spaces. Evaluation in a simulated 3D domain shows that our approach finds objects more efficiently and successfully compared to a set of baselines without resolution hierarchy in larger instances under the same computational requirement. We demonstrate our approach on a mobile robot to find objects placed at different heights in two 10m 2 ×2m regions by moving its base and actuating its torso.

IROS Conference 2020 Conference Paper

Building Plannable Representations with Mixed Reality

Eric Rosen
Nishanth Kumar
Nakul Gopalan
Daniel Ullman 0002
George Konidaris 0001
Stefanie Tellex

We propose Action-Oriented Semantic Maps (AOSMs), a representation that enables a robot to acquire object manipulation behaviors and semantic information about the environment from a human teacher with a Mixed Reality Head-Mounted Display (MR-HMD). AOSMs are a representation that captures both: a) high-level object manipulation actions in an object class's local frame, and b) semantic representations of objects in the robot's global map that are grounded for navigation. Humans can use a MR-HMD to teach the agent the information necessary for planning object manipulation and navigation actions by interacting with virtual 3D meshes overlaid on the physical workspace. We demonstrate that our system enables users to quickly and accurately teach a robot the knowledge required to autonomously plan and execute three household tasks: picking up a bottle and throwing it in the trash, closing a sink faucet, and flipping a light switch off.

ICRA Conference 2020 Conference Paper

Grounding Language to Landmarks in Arbitrary Outdoor Environments

Matthew Berg
Deniz Bayazit
Rebecca Mathew
Ariel Rotter-Aboyoun
Ellie Pavlick
Stefanie Tellex

Robots operating in outdoor, urban environments need the ability to follow complex natural language commands which refer to never-before-seen landmarks. Existing approaches to this problem are limited because they require training a language model for the landmarks of a particular environment before a robot can understand commands referring to those landmarks. To generalize to new environments outside of the training set, we present a framework that parses references to landmarks, then assesses semantic similarities between the referring expression and landmarks in a predefined semantic map of the world, and ultimately translates natural language commands to motion plans for a drone. This framework allows the robot to ground natural language phrases to landmarks in a map when both the referring expressions to landmarks and the landmarks themselves have not been seen during training. We test our framework with a 14-person user evaluation demonstrating an end-to-end accuracy of 76. 19% in an unseen environment. Subjective measures show that users find our system to have high performance and low workload. These results demonstrate our approach enables untrained users to control a robot in large unseen outdoor environments with unconstrained natural language.

IROS Conference 2020 Conference Paper

Mixed Reality as a Bidirectional Communication Interface for Human-Robot Interaction

Eric Rosen
David Whitney
Michael Fishman 0001
Daniel Ullman 0002
Stefanie Tellex

We present a decision-theoretic model and robot system that interprets multimodal human communication to disambiguate item references by asking questions via a mixed reality (MR) interface. Existing approaches have either chosen to use physical behaviors, like pointing and eye gaze, or virtual behaviors, like mixed reality. However, there is a gap of research on how MR compares to physical actions for reducing robot uncertainty. We test the hypothesis that virtual deictic gestures are better for human-robot interaction (HRI) than physical behaviors. To test this hypothesis, we propose the Physio-Virtual Deixis Partially Observable Markov Decision Process (PVD-POMDP), which interprets multimodal observations (speech, eye gaze, and pointing gestures) from the human and decides when and how to ask questions (either via physical or virtual deictic gestures) in order to recover from failure states and cope with sensor noise. We conducted a between-subjects user study with 80 participants distributed across three conditions of robot communication: no feedback control, physical feedback, and MR feedback. We tested performance of each condition with objective measures (accuracy, time), as well as evaluated user experience with subjective measures (usability, trust, workload). We found the MR feedback condition was 10% more accurate than the physical condition and a speedup of 160%. We also found that the feedback conditions significantly outperformed the no feedback condition in all subjective metrics.

AAAI Conference 2020 Short Paper

Task Scoping for Efficient Planning in Open Worlds (Student Abstract)

Nishanth Kumar
Michael Fishman
Natasha Danas
Stefanie Tellex
Michael Littman
George Konidaris

We propose an abstraction method for open-world environments expressed as Factored Markov Decision Processes (FMDPs) with very large state and action spaces. Our method prunes state and action variables that are irrelevant to the optimal value function on the state subspace the agent would visit when following any optimal policy from the initial state. This method thus enables tractable fast planning within large open-world FMDPs.

IROS Conference 2019 Conference Paper

Advanced Autonomy on a Low-Cost Educational Drone Platform

Luke Eller
Théo Guérin
Baichuan Huang
Garrett Warren
Sophie Yang
Josh Roy
Stefanie Tellex

PiDrone is a quadrotor platform created to accompany an introductory robotics course. Students build an autonomous flying robot from scratch and learn to program it through assignments and projects. Existing educational robots do not have significant autonomous capabilities, such as high-level planning and mapping. We present a hardware and software framework for an autonomous aerial robot, in which all software for autonomy can run onboard the drone, implemented in Python. We present an Unscented Kalman Filter (UKF) for accurate state estimation. Next, we present an implementation of Monte Carlo (MC) Localization and FastSLAM for Simultaneous Localization and Mapping (SLAM). The performance of UKF, localization, and SLAM is tested and compared to ground truth, provided by a motion-capture system. Our evaluation demonstrates that our autonomous educational framework runs quickly and accurately on a Raspberry Pi in Python, making it ideal for use in educational settings.

ICRA Conference 2019 Conference Paper

End-User Robot Programming Using Mixed Reality

Samir Yitzhak Gadre
Eric Rosen
Gary Chien
Elizabeth Phillips
Stefanie Tellex
George Konidaris 0001

Mixed Reality (MR) is a promising interface for robot programming because it can project an immersive 3D visualization of a robot's intended movement onto the real world. MR can also support hand gestures, which provide an intuitive way for users to construct and modify robot motions. We present a Mixed Reality Head-Mounted Display (MRHMD) interface that enables end-users to easily create and edit robot motions using waypoints. We describe a user study where 20 participants were asked to program a robot arm using 2D and MR interfaces to perform two pick-and-place tasks. In the primitive task, participants created typical pickand-place programs. In the adapted task, participants adapted their primitive programs to address a more complex pickand-place scenario, which included obstacles and conditional reasoning. Compared to the 2D interface, a higher number of users were able to complete both tasks in significantly less time, and reported experiencing lower cognitive workload, higher usability, and higher naturalness with the MR-HMD interface.

ICRA Conference 2019 Conference Paper

Flight, Camera, Action! Using Natural Language and Mixed Reality to Control a Drone

Baichuan Huang
Deniz Bayazit
Daniel Ullman 0002
Nakul Gopalan
Stefanie Tellex

With increasing autonomy, robots like drones are increasingly accessible to untrained users. Most users control drones using a low-level interface, such as a radio-controlled (RC) controller. For a wider adoption of these technologies by the public, a much higher-level interface, such as natural language or mixed reality (MR), allows the automation of the control of the agent in a goal-oriented setting. We present an interface that uses natural language grounding within an MR environment to solve high-level task and navigational instructions given to an autonomous drone. To the best of our knowledge, this is the first work to perform fully autonomous language grounding in an MR setting for a robot. Given a map, our interface first grounds natural language commands to reward specifications within a Markov Decision Process (MDP) framework. Then, it passes the reward specification to an MDP solver. Finally, the drone performs the desired operations in the real world while planning and localizing itself. Our approach uses MR to provide a set of known virtual landmarks, enabling the drone to understand commands referring to objects without being equipped with object detectors for multiple novel objects or a predefined environment model. We conducted an exploratory user study to assess users' experience of our MR interface with and without natural language, as compared to a web interface. We found that users were able to command the drone more quickly via both MR interfaces as compared to the web interface, with roughly equal system usability scores across all three interfaces.

IROS Conference 2019 Conference Paper

Grounding Language Attributes to Objects using Bayesian Eigenobjects

Vanya Cohen
Benjamin Burchfiel
Thao Nguyen
Nakul Gopalan
Stefanie Tellex
George Konidaris 0001

We develop a system to disambiguate object instances within the same class based on simple physical descriptions. The system takes as input a natural language phrase and a depth image containing a segmented object and predicts how similar the observed object is to the object described by the phrase. Our system is designed to learn from only a small amount of human-labeled language data and generalize to viewpoints not represented in the language-annotated depth image training set. By decoupling 3D shape representation from language representation, this method is able to ground language to novel objects using a small amount of language-annotated depth-data and a larger corpus of unlabeled 3D object meshes, even when these objects are partially observed from unusual viewpoints. Our system is able to disambiguate between novel objects, observed via depth images, based on natural language descriptions. Our method also enables viewpoint transfer; trained on human-annotated data on a small set of depth images captured from frontal viewpoints, our system successfully predicted object attributes from rear views despite having no such depth images in its training set. Finally, we demonstrate our approach on a Baxter robot, enabling it to pick specific objects based on human-provided natural language descriptions.

ICRA Conference 2019 Conference Paper

Multi-Object Search using Object-Oriented POMDPs

Arthur Wandzel
Yoonseon Oh
Michael Fishman 0001
Nishanth Kumar
Lawson L. S. Wong
Stefanie Tellex

A core capability of robots is to reason about multiple objects under uncertainty. Partially Observable Markov Decision Processes (POMDPs) provide a means of reasoning under uncertainty for sequential decision making, but are computationally intractable in large domains. In this paper, we propose Object-Oriented POMDPs (OO-POMDPs), which represent the state and observation spaces in terms of classes and objects. The structure afforded by OO-POMDPs support a factorization of the agent's belief into independent object distributions, which enables the size of the belief to scale linearly versus exponentially in the number of objects. We formulate a novel Multi-Object Search (MOS) task as an OO-POMDP for mobile robotics domains in which the agent must find the locations of multiple objects. Our solution exploits the structure of OO-POMDPs by featuring human language to selectively update the belief at task onset. Using this structure, we develop a new algorithm for efficiently solving OO-POMDPs: Object-Oriented Partially Observable Monte-Carlo Planning (OOPOMCP). We show that OO-POMCP with grounded language commands is sufficient for solving challenging MOS tasks both in simulation and on a physical mobile robot.

ICRA Conference 2019 Conference Paper

Scanning the Internet for ROS: A View of Security in Robotics Research

Nicholas DeMarinis
Stefanie Tellex
Vasileios P. Kemerlis
George Konidaris 0001
Rodrigo Fonseca

Security is particularly important in robotics, as robots can directly perceive and affect the physical world. We describe the results of a scan of the entire IPv4 address space of the Internet for instances of the Robot Operating System (ROS), a widely used robotics software platform. We identified a number of hosts supporting ROS that are exposed to the public Internet, thereby allowing anyone to access robotic sensors and actuators. As a proof of concept, and with the consent of the relevant researchers, we were able to read image sensor information from and actuate a physical robot present in a research lab in an American university. This paper gives an overview of our findings, including our methodology, the geographic distribution of publicly-accessible platforms, the sorts of sensor and actuator data that is available, and the different kinds of robots and sensors that our scan uncovered. Additionally, we offer recommendations on best practices to mitigate these security issues in the future.

ICRA Conference 2019 Conference Paper

Teaching Robots To Draw

Atsunobu Kotani
Stefanie Tellex

In this paper, we introduce an approach which enables manipulator robots to write handwritten characters or line drawings. Given an image of just-drawn handwritten characters, the robot infers a plan to replicate the image with a writing utensil, and then reproduces the image. Our approach draws each target stroke in one continuous drawing motion and does not rely on handcrafted rules or on predefined paths of characters. Instead, it learns to write from a dataset of demonstrations. We evaluate our approach in both simulation and on two real robots. Our model can draw handwritten characters in a variety of languages which are disjoint from the training set, such as Greek, Tamil, or Hindi, and also reproduce any stroke-based drawing from an image of the drawing.

AAMAS Conference 2018 Conference Paper

Deep Abstract Q-Networks

Melrose Roderick
Christopher Grimm
Stefanie Tellex

We examine the problem of learning and planning on high-dimensional domains with long horizons and sparse rewards. Recent approaches have shown great successes in many Atari 2600 domains. However, domains with long horizons and sparse rewards, such as Montezuma’s Revenge and Venture, remain challenging for existing methods. Methods using abstraction [5, 13] have shown to be useful in tackling long-horizon problems. We combine recent techniques of deep reinforcement learning with existing model-based approaches using an expert-provided state abstraction. We construct toy domains that elucidate the problem of long horizons, sparse rewards and high-dimensional inputs, and show that our algorithm significantly outperforms previous methods on these domains. Our abstraction-based approach outperforms Deep Q- Networks [11] on Montezuma’s Revenge and Venture, and exhibits backtracking behavior that is absent from previous methods.

ICRA Conference 2018 Conference Paper

Learning to Parse Natural Language to Grounded Reward Functions with Weak Supervision

Edward C. Williams
Nakul Gopalan
Mina Rhee
Stefanie Tellex

In order to intuitively and efficiently collaborate with humans, robots must learn to complete tasks specified using natural language. We represent natural language instructions as goal-state reward functions specified using lambda calculus. Using reward functions as language representations allows robots to plan efficiently in stochastic environments. To map sentences to such reward functions, we learn a weighted linear Combinatory Categorial Grammar (CCG) semantic parser. The parser, including both parameters and the CCG lexicon, is learned from a validation procedure that does not require execution of a planner, annotating reward functions, or labeling parse trees, unlike prior approaches. To learn a CCG lexicon and parse weights, we use coarse lexical generation and validation-driven perceptron weight updates using the approach of Artzi and Zettlemoyer [4]. We present results on the Cleanup World domain [18] to demonstrate the potential of our approach. We report an F1 score of 0. 82 on a collected corpus of 23 tasks containing combinations of nested referential expressions, comparators and object properties with 2037 corresponding sentences. Our goal-condition learning approach enables an improvement of orders of magnitude in computation time over a baseline that performs planning during learning, while achieving comparable results. Further, we conduct an experiment with just 6 labeled demonstrations to show the ease of teaching a robot behaviors using our method. We show that parsing models learned from small data sets can generalize to commands not seen during training.

IROS Conference 2018 Conference Paper

ROS Reality: A Virtual Reality Framework Using Consumer-Grade Hardware for ROS-Enabled Robots

David Whitney
Eric Rosen
Daniel Ullman 0002
Elizabeth Phillips
Stefanie Tellex

Virtual reality (VR)systems let users intuitively interact with 3D environments and have been used extensively for robotic teleoperation tasks. While more immersive than their 2D counterparts, early VR systems were expensive and required specialized hardware. Fortunately, there has been a recent proliferation of consumer-grade VR systems at affordable price points. These systems are inexpensive, relatively portable, and can be integrated into existing robotic frameworks. Our group has designed a VR teleoperation package for the Robot Operating System (ROS), ROS Reality, that can be easily integrated into such frameworks. ROS Reality is an open-source, over-the-Internet teleoperation interface between any ROS-enabled robot and any Unity-compatible VR headset. We completed a pilot study to test the efficacy of our system, with expert human users controlling a Baxter robot via ROS Reality to complete 24 dexterous manipulation tasks, compared to the same users controlling the robot via direct kinesthetic handling. This study provides insight into the feasibility of robotic teleoperation tasks in VR with current consumer-grade resources and exposes issues that need to be addressed in these VR systems. In addition, this paper presents a description of ROS Reality, its components, and architecture. We hope this system will be adopted by other research groups to allow for easy integration of VR teleoperated robots into future experiments.

ICAPS Conference 2017 Conference Paper

Planning with Abstract Markov Decision Processes

Nakul Gopalan
Marie desJardins
Michael L. Littman
James MacGlashan
Shawn Squire
Stefanie Tellex
John Winder
Lawson L. S. Wong

Robots acting in human-scale environments must plan under uncertainty in large state–action spaces and face constantly changing reward functions as requirements and goals change. Planning under uncertainty in large state–action spaces requires hierarchical abstraction for efficient computation. We introduce a new hierarchical planning framework called Abstract Markov Decision Processes (AMDPs) that can plan in a fraction of the time needed for complex decision making in ordinary MDPs. AMDPs provide abstract states, actions, and transition dynamics in multiple layers above a base-level “flat” MDP. AMDPs decompose problems into a series of subtasks with both local reward and local transition functions used to create policies for subtasks. The resulting hierarchical planning method is independently optimal at each level of abstraction, and is recursively optimal when the local reward and transition functions are correct. We present empirical results showing significantly improved planning speed, while maintaining solution quality, in the Taxi domain and in a mobile-manipulation robotics problem. Furthermore, our approach allows specification of a decision-making model for a mobile-manipulation problem on a Turtlebot, spanning from low-level control actions operating on continuous variables all the way up through high-level object manipulation tasks.

RLDM Conference 2017 Conference Abstract

Planning with Abstract Markov Decision Processes

Nakul Gopalan
Michael Littman
Shawn Squire
Stefanie Tellex
John Winder
Lawson Wong

Robots acting in human-scale environments must plan under uncertainty in large state–action spaces and face constantly changing reward functions as requirements and goals change. Planning un- der uncertainty in large state–action spaces requires hierarchical abstraction for efficient computation. We (Gopalan et al. 2017 In Press) introduce a new hierarchical planning framework called Abstract Markov Decision Processes (AMDPs) that can plan in a fraction of the time needed for complex decision making in ordinary MDPs. AMDPs provide abstract states, actions, and transition dynamics in multiple layers above a base-level “flat” MDP. AMDPs decompose problems into a series of subtasks with both local reward and local transition functions used to create policies for subtasks. The resulting hierarchical planning method is independently optimal at each level of abstraction, and is recursively optimal when the local reward and transition functions are correct. We present empirical results showing significantly improved planning speed, while maintaining solution quality, in the Taxi domain and in a mobile-manipulation robotics prob- lem. Furthermore, our approach allows specification of a decision-making model for a mobile-manipulation problem on a Turtlebot, spanning from low-level control actions operating on continuous variables all the way up through high-level object manipulation tasks.

ICRA Conference 2017 Conference Paper

Reducing errors in object-fetching interactions through social feedback

David Whitney
Eric Rosen
James MacGlashan
Lawson L. S. Wong
Stefanie Tellex

Fetching items is an important problem for a social robot. It requires a robot to interpret a person's language and gesture and use these noisy observations to infer what item to deliver. If the robot could ask questions, it would help the robot be faster and more accurate in its task. Existing approaches either do not ask questions, or rely on fixed question-asking policies. To address this problem, we propose a model that makes assumptions about cooperation between agents to perform richer signal extraction from observations. This work defines a mathematical framework for an item-fetching domain that allows a robot to increase the speed and accuracy of its ability to interpret a person's requests by reasoning about its own uncertainty as well as processing implicit information (implicatures). We formalize the item-delivery domain as a Partially Observable Markov Decision Process (POMDP), and approximately solve this POMDP in real time. Our model improves speed and accuracy of fetching tasks by asking relevant clarifying questions only when necessary. To measure our model's improvements, we conducted a real world user study with 16 participants. Our method achieved greater accuracy and a faster interaction time compared to state-of-the-art baselines. Our model is 2. 17 seconds faster (25% faster) than a state-of-the-art baseline, while being 2. 1% more accurate.

ICRA Conference 2016 Conference Paper

Interpreting multimodal referring expressions in real time

David Whitney
Miles Eldon
John Oberlin
Stefanie Tellex

Humans communicate about objects using language, gesture, and context, fusing information from multiple modalities over time. Robots need to interpret this communication in order to collaborate with humans on shared tasks. Processing communicative input incrementally has the potential to increase the speed and accuracy of a robot's reaction. It also enables the robot to incorporate the relative timing of words and gestures into the understanding process. To address this problem, we define a multimodal Bayes filter for interpreting a person's referential expressions to objects. Our approach outputs a distribution over the referent object at 14Hz, updating dynamically as it receives new observations of the person's spoken words and gestures. We collected a new dataset of people referring to one of four objects in a tabletop setting and demonstrate that our approach is able to infer the correct object with 90% accuracy. Additionally, we augment and improve our filter in a simulated home kitchen domain by learning contextual knowledge in an unsupervised manner from existing written text, increasing our maximum accuracy to 96%, even with an increase in the number of objects from four to seventy.

ICAPS Conference 2015 Conference Paper

Goal-Based Action Priors

David Abel
D. Ellis Hershkowitz
Gabriel Barth-Maron
Stephen Brawner
Kevin O'Farrell
James MacGlashan
Stefanie Tellex

Robots that interact with people must flexibly respond to requests by planning in stochastic state spaces that are often too large to solve for optimal behavior. In this work, we develop a framework for goal and state dependent action priors that can be used to prune away irrelevant actions based on the robot’s current goal, thereby greatly accelerating planning in a variety of complex stochastic environments. Our framework allows these goal-based action priors to be specified by an expert or to be learned from prior experience in related problems. We evaluate our approach in the video game Minecraft, whose complexity makes it an effective robot simulator. We also evaluate our approach in a robot cooking domain that is executed on a two-handed manipulator robot. In both cases, goal-based action priors enhance baseline planners by dramatically reducing the time taken to find a near-optimal plan.

ICRA Conference 2014 Conference Paper

A natural language planner interface for mobile manipulators

Thomas M. Howard
Stefanie Tellex
Nicholas Roy

Natural language interfaces for robot control aspire to find the best sequence of actions that reflect the behavior intended by the instruction. This is difficult because of the diversity of language, variety of environments, and heterogeneity of tasks. Previous work has demonstrated that probabilistic graphical models constructed from the parse structure of natural language can be used to identify motions that most closely resemble verb phrases. Such approaches however quickly succumb to computational bottlenecks imposed by construction and search the space of possible actions. Planning constraints, which define goal regions and separate the admissible and inadmissible states in an environment model, provide an interesting alternative to represent the meaning of verb phrases. In this paper we present a new model called the Distributed Correspondence Graph (DCG) to infer the most likely set of planning constraints from natural language instructions. A trajectory planner then uses these planning constraints to find a sequence of actions that resemble the instruction. Separating the problem of identifying the action encoded by the language into individual steps of planning constraint inference and motion planning enables us to avoid computational costs associated with generation and evaluation of many trajectories. We present experimental results from comparative experiments that demonstrate improvements in efficiency in natural language understanding without loss of accuracy.

ICRA Conference 2014 Conference Paper

Learning spatial-semantic representations from natural language descriptions and scene classifications

Sachithra Hemachandra
Matthew R. Walter
Stefanie Tellex
Seth J. Teller

We describe a semantic mapping algorithm that learns human-centric environment models by interpreting natural language utterances. Underlying the approach is a coupled metric, topological, and semantic representation of the environment that enables the method to fuse information from natural language descriptions with low-level metric and appearance data. We extend earlier work with a novel formulation that incorporates spatial layout into a topological representation of the environment. We also describe a factor graph formulation of the semantic properties that encodes human-centric concepts such as type and colloquial name for each mapped region. The algorithm infers these properties by combining the user's natural language descriptions with image- and laser-based scene classification. We also propose a mechanism to more effectively ground natural language descriptions of distant regions using semantic cues from other modalities. We describe how the algorithm employs this learned semantic information to propose valid topological hypotheses, leading to more accurate topological and metric maps. We demonstrate that integrating language with other sensor data increases the accuracy of the achieved spatial-semantic representation of the environment.

AAAI Conference 2011 Conference Paper

Understanding Natural Language Commands for Robotic Navigation and Mobile Manipulation

Stefanie Tellex
Thomas Kollar
Steven Dickerson
Matthew Walter
Ashis Banerjee
Seth Teller
Nicholas Roy

This paper describes a new model for understanding natural language commands given to autonomous systems that perform navigation and mobile manipulation in semi-structured environments. Previous approaches have used models with ﬁxed structure to infer the likelihood of a sequence of actions given the environment and the command. In contrast, our framework, called Generalized Grounding Graphs (G3 ), dynamically instantiates a probabilistic graphical model for a particular natural language command according to the command’s hierarchical and compositional semantic structure. Our system performs inference in the model to successfully ﬁnd and execute plans corresponding to natural language commands such as “Put the tire pallet on the truck. ” The model is trained using a corpus of commands collected using crowdsourcing. We pair each command with robot actions and use the corpus to learn the parameters of the model. We evaluate the robot’s performance by inferring plans from natural language commands, executing each plan in a realistic robot simulator, and asking users to evaluate the system’s performance. We demonstrate that our system can successfully follow many natural language commands from the corpus.

IROS Conference 2010 Conference Paper

Natural language command of an autonomous micro-air vehicle

Albert S. Huang
Stefanie Tellex
Abraham Bachrach
Thomas Kollar
Deb Roy
Nicholas Roy

Natural language is a flexible and intuitive modality for conveying directions and commands to a robot but presents a number of computational challenges. Diverse words and phrases must be mapped into structures that the robot can understand, and elements in those structures must be grounded in an uncertain environment. In this paper we present a micro-air vehicle (MAV) capable of following natural language directions through a previously mapped and labeled environment. We extend our previous work in understanding 2D natural language directions to three dimensions, accommodating new verb modifiers such as go up and go down, and commands such as turn around and face the windows. We demonstrate the robot following directions created by a human for another human, and interactively executing commands in the context of surveillance and search and rescue in confined spaces. In an informal study, 71% of the paths computed from directions given by one user terminated within 10m of the desired destination.