Author name cluster

Thomas Kollar

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

32 papers

2 author rows

ICRA Conference 2025 Conference Paper

GHIL-Glue: Hierarchical Control with Filtered Subgoal Images

Kyle Beltran Hatch
Ashwin Balakrishna
Oier Mees
Suraj Nair 0003
Seohong Park
Blake Wulfe
Masha Itkina
Benjamin Eysenbach

Image and video generative models that are pretrained on Internet-scale data can greatly increase the generalization capacity of robot learning systems. These models can function as high-level planners, generating intermediate sub-goals for low-level goal-conditioned policies to reach. However, the performance of these systems can be greatly bottlenecked by the interface between generative models and low-level controllers. For example, generative models may predict photo-realistic yet physically infeasible frames that confuse low-level policies. Low-level policies may also be sensitive to subtle visual artifacts in generated goal images. This paper addresses these two facets of generalization, providing an interface to effectively “glue together” language-conditioned image or video prediction models with low-level goal-conditioned policies. Our method, Generative Hierarchical Imitation Learning-Glue (GHIL-Glue), filters out subgoals that do not lead to task progress and improves the robustness of goal-conditioned policies to generated subgoals with harmful visual artifacts. We find in extensive experiments in both simulated and real environments that GHIL-Glue achieves a 25% improvement across several hierarchical models that leverage generative subgoals, achieving a new state-of-the-art on the CALVIN simulation benchmark for policies using observations from a single RGB camera. GHIL-Glue also outperforms other generalist robot policies across 3/4 language-conditioned manipulation tasks testing zero-shot generalization in physical experiments. Code, model checkpoints, videos, and supplementary materials can be found at https://ghil-glue.github.io.

ICLR Conference 2025 Conference Paper

Language models scale reliably with over-training and on downstream tasks

Samir Yitzhak Gadre
Georgios Smyrnis
Vaishaal Shankar
Suchin Gururangan
Mitchell Wortsman
Rulin Shao
Jean Mercat
Alex Fang

Scaling laws are useful guides for derisking expensive training runs, as they predict performance of large models using cheaper, small-scale experiments. However, there remain gaps between current scaling studies and how language models are ultimately trained and evaluated. For instance, scaling is usually studied in the compute-optimal training regime (i.e., "Chinchilla optimal" regime). In contrast, models are often over-trained to reduce inference costs. Moreover, scaling laws mostly predict loss on next-token prediction, but models are usually compared on downstream task performance. To address both shortcomings, we create a testbed of 104 models with 0.011B to 6.9B parameters trained with various numbers of tokens on three data distributions. First, we fit scaling laws that extrapolate in both the amount of over-training and the number of model parameters. This enables us to predict the validation loss of a 1.4B parameter, 900B token run (i.e., 32$\times$ over-trained) and a 6.9B parameter, 138B token run (i.e., a compute-optimal run)––each from experiments that take 300$\times$ less compute. Second, we relate the perplexity of a language model to its downstream task performance by proposing a power law. We use this law to predict top-1 error averaged over downstream tasks for the two aforementioned models, using experiments that take 20$\times$ less compute.

ICLR Conference 2025 Conference Paper

Should VLMs be Pre-trained with Image Data?

Sedrick Keh
Jean Mercat
Samir Yitzhak Gadre
Kushal Arora
Igor Vasiljevic
Benjamin Burchfiel
Shuran Song
Russ Tedrake

Pre-trained LLMs that are further trained with image data perform well on vision-language tasks. While adding images during a second training phase effectively unlocks this capability, it is unclear how much of a gain or loss this two-step pipeline gives over VLMs which integrate images earlier into the training process. To investigate this, we train models spanning various datasets, scales, image-text ratios, and amount of pre-training done before introducing vision tokens. We then fine-tune these models and evaluate their downstream performance on a suite of vision-language and text-only tasks. We find that pre-training with a mixture of image and text data allows models to perform better on vision-language tasks while maintaining strong performance on text-only evaluations. On an average of 6 diverse tasks, we find that for a 1B model, introducing visual tokens 80\% of the way through pre-training results in a 2\% average improvement over introducing visual tokens to a fully pre-trained model.

ICML Conference 2025 Conference Paper

Understanding Complexity in VideoQA via Visual Program Generation

Cristóbal Eyzaguirre
Igor Vasiljevic
Achal Dave
Jiajun Wu 0001
Rares Ambrus
Thomas Kollar
Juan Carlos Niebles
Pavel Tokmakov

We propose a data-driven approach to analyzing query complexity in Video Question Answering (VideoQA). Previous efforts in benchmark design have relied on human expertise to design challenging questions, yet we experimentally show that humans struggle to predict which questions are difficult for machine learning models. Our automatic approach leverages recent advances in code generation for visual question answering, using the complexity of generated code as a proxy for question difficulty. We demonstrate that this measure correlates significantly better with model performance than human estimates. To operationalize this insight, we propose an algorithm for estimating question complexity from code. It identifies fine-grained primitives that correlate with the hardest questions for any given set of models, making it easy to scale to new approaches in the future. Finally, to further illustrate the utility of our method, we extend it to automatically generate complex questions, constructing a new benchmark that is 1. 9 times harder than the popular NExT-QA.

NeurIPS Conference 2024 Conference Paper

A Critical Evaluation of AI Feedback for Aligning Large Language Models

Archit Sharma
Sedrick Keh
Eric Mitchell
Chelsea Finn
Kushal Arora
Thomas Kollar

Learning from AI feedback (LAIF) is a popular paradigm for improving the instruction-following abilities of powerful pre-trained language models. LAIF first performs supervised fine-tuning (SFT) using demonstrations from a teacher model and then further fine-tunes the model with reinforcement learning (RL) or direct preference optimization (DPO), using feedback from a critic model. While recent popular open-source models have demonstrated substantial improvements in performance from the RL step, in this paper we question whether the complexity of this RL step is truly warranted for AI feedback. We show that the improvements of the RL step are virtually entirely due to the widespread practice of using a weaker teacher model (e. g. GPT-3. 5) for SFT data collection than the critic (e. g. , GPT-4) used for AI feedback generation. Specifically, we show that simple supervised fine-tuning with GPT-4 as the teacher outperforms existing LAIF pipelines. More generally, we find that the gains from LAIF vary substantially across base model families, test-time evaluation protocols, and critic models. Finally, we provide a mechanistic explanation for when SFT may outperform the full two-step LAIF pipeline as well as suggestions for making LAIF maximally useful in practice.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

DataComp-LM: In search of the next generation of training sets for language models

Jeffrey Li
Alex Fang
Georgios Smyrnis
Maor Ivgi
Matt Jordan
Samir Gadre
Hritik Bansal
Etash Guha

We introduce DataComp for Language Models, a testbed for controlled dataset experiments with the goal of improving language models. As part of DCLM, we provide a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations. Participants in the DCLM benchmark can experiment with data curation strategies such as deduplication, filtering, and data mixing atmodel scales ranging from 412M to 7B parameters. As a baseline for DCLM, we conduct extensive experiments and find that model-based filtering is key to assembling a high-quality training set. The resulting dataset, DCLM-Baseline, enables training a 7B parameter language model from scratch to 63% 5-shot accuracy on MMLU with 2T training tokens. Compared to MAP-Neo, the previous state-of-the-art in open-data language models, DCLM-Baseline represents a 6 percentage point improvement on MMLU while being trained with half the compute. Our results highlight the importance of dataset design for training language models and offer a starting point for further research on data curation. We release the \dclm benchmark, framework, models, and datasets at https: //www. datacomp. ai/dclm/

PDF Details DOI

IROS Conference 2024 Conference Paper

Language-Embedded Gaussian Splats (LEGS): Incrementally Building Room-Scale Representations with a Mobile Robot

Justin Yu
Kush Hari
Kishore Srinivas
Karim El-Refai
Adam Rashid
Chung Min Kim
Justin Kerr
Richard Cheng

Building semantic 3D maps is valuable for searching for objects of interest in offices, warehouses, stores, and homes. We present a mapping system that incrementally builds a Language-Embedded Gaussian Splat (LEGS): a detailed 3D scene representation that encodes both appearance and semantics in a unified representation. LEGS is trained online as a robot traverses its environment to enable localization of open-vocabulary object queries. We evaluate LEGS on 4 room-scale scenes where we query for objects in the scene to assess how LEGS can capture semantic meaning. We compare LEGS to LERF [1] and find that while both systems have comparable object query success rates, LEGS trains over 3. 5x faster than LERF. Results suggest that a multi-camera setup and incremental bundle adjustment can boost visual reconstruction quality in constrained robot trajectories, and suggest LEGS can localize open-vocabulary and long-tail object queries with up to 66% accuracy. See project website at: berkeleyautomation.github.io/LEGS

IROS Conference 2024 Conference Paper

MANIP: A Modular Architecture for Integrating Interactive Perception for Robot Manipulation

Justin Yu
Tara Sadjadpour
Abby O'Neill
Mehdi Khfifi
Lawrence Yunliang Chen
Richard Cheng
Muhammad Zubair Irshad
Ashwin Balakrishna

We propose a modular systems architecture, MANIP, that can facilitate the design and development of robot manipulation systems by systematically combining learned subpolicies with well-established procedural algorithmic primitives such as Inverse Kinematics, Kalman Filters, RANSAC outlier rejection, PID modules, etc. (aka "Good Old Fashioned Engineering (GOFE)"). The MANIP architecture grew from our lab’s experience developing robot systems for folding clothes, routing cables, and untangling knots. To address failure modes, MANIP can facilitate inclusion of "interactive perception" subpolicies that execute robot actions to modify system state to bring the system into alignment with the training distribution and / or to disambiguate system state when system state confidence is low. We demonstrate how MANIP can be applied with 3 case studies and then describe a detailed case study in cable tracing with experiments that suggest MANIP can improve performance by up to 88%. Code and details are available at: https://berkeleyautomation.github.io/MANIP/

IROS Conference 2024 Conference Paper

Multi-Modal Representation Learning with Tactile Data

Hyung-Gun Chi
Jose A. Barreiros
Jean Mercat
Karthik Ramani
Thomas Kollar

Advancements in embodied language models like PALM-E and RT-2 have significantly enhanced language-conditioned robotic manipulation. However, these advances remain predominantly focused on vision and language, often overlooking the pivotal role of tactile feedback which is advantageous in contact-rich interactions. Our research introduces a novel approach that synergizes tactile information with vision and language. We present the Multi-Modal Wand (MMWand) dataset enriched with linguistic descriptions and tactile data. By integrating tactile feedback, we aim to bridge the divide between human linguistic understanding and robotic sensory interpretation. Our multi-modal representation model is trained on these datasets by employing the multi-modal embedding alignment principle from ImageBind which has shown promising results, emphasizing the potential of tactile data in robotic applications. The validation of our approach in downstream robotics tasks, such as texture-based object classification, cross-modality retrieval, and the dense reward function for visuomotor control, attests to its effectiveness. Our contributions underscore the importance of tactile feedback in multi-modal robotic learning and its potential to enhance robotic tasks. The MMWand dataset is publicly available at https://hyung-gun.me/mmwand/.

ICRA Conference 2024 Conference Paper

Open X-Embodiment: Robotic Learning Datasets and RT-X Models: Open X-Embodiment Collaboration

Abby O'Neill
Abdul Rehman
Abhiram Maddukuri
Abhishek Gupta 0004
Abhishek Padalkar
Abraham Lee
Acorn Pooley
Agrim Gupta

Large, high-capacity models trained on diverse datasets have shown remarkable successes on efficiently tackling downstream applications. In domains from NLP to Computer Vision, this has led to a consolidation of pretrained models, with general pretrained backbones serving as a starting point for many applications. Can such a consolidation happen in robotics? Conventionally, robotic learning methods train a separate model for every application, every robot, and even every environment. Can we instead train "generalist" X-robot policy that can be adapted efficiently to new robots, tasks, and environments? In this paper, we provide datasets in standardized data formats and models to make it possible to explore this possibility in the context of robotic manipulation, alongside experimental results that provide an example of effective X-robot policies. We assemble a dataset from 22 different robots collected through a collaboration between 21 institutions, demonstrating 527 skills (160266 tasks). We show that a high-capacity model trained on this data, which we call RT-X, exhibits positive transfer and improves the capabilities of multiple robots by leveraging experience from other platforms. The project website is robotics-transformer-x. github.io.

ICML Conference 2024 Conference Paper

Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models

Siddharth Karamcheti
Suraj Nair 0003
Ashwin Balakrishna
Percy Liang
Thomas Kollar
Dorsa Sadigh

Visually-conditioned language models (VLMs) have seen growing adoption in applications such as visual dialogue, scene understanding, and robotic task planning; adoption that has fueled a wealth of new models such as LLaVa, InstructBLIP, and PaLI-3. Despite the volume of new releases, key design decisions around image preprocessing, architecture, and optimization are under-explored, making it challenging to understand what factors account for model performance – a challenge further complicated by the lack of objective, consistent evaluations. To address these gaps, we first compile a suite of standardized evaluations spanning visual question answering, object localization, and challenge sets that probe properties such as hallucination; evaluations that provide fine-grained insight VLM capabilities. Second, we rigorously investigate VLMs along key design axes, including pretrained visual representations and training from base vs. instruct-tuned language models, amongst others. We couple our analysis with three resource contributions: (1) a unified framework for evaluating VLMs, (2) optimized, flexible training code, and (3) checkpoints for all models, including a family of VLMs at the 7-13B scale that strictly outperform InstructBLIP and LLaVa v1. 5, the state-of-the-art in open VLMs.

ICRA Conference 2023 Conference Paper

AutoBag: Learning to Open Plastic Bags and Insert Objects

Lawrence Yunliang Chen
Baiyu Shi
Daniel Seita
Richard Cheng
Thomas Kollar
David Held
Ken Goldberg

Thin plastic bags are ubiquitous in retail stores, healthcare, food handling, recycling, homes, and school lunchrooms. They are challenging both for perception (due to specularities and occlusions) and for manipulation (due to the dynamics of their 3D deformable structure). We formulate the task of “bagging: ” manipulating common plastic shopping bags with two handles from an unstructured initial state to an open state where at least one solid object can be inserted into the bag and lifted for transport. We propose a self-supervised learning framework where a dual-arm robot learns to recognize the handles and rim of plastic bags using UV-fluorescent markings; at execution time, the robot does not use UV markings or UV light. We propose the AutoBag algorithm, where the robot uses the learned perception model to open a plastic bag through iterative manipulation. We present novel metrics to evaluate the quality of a bag state and new motion primitives for reorienting and opening bags based on visual observations. In physical experiments, a YuMi robot using AutoBag is able to open bags and achieve a success rate of 16/30 for inserting at least one item across a variety of initial bag configurations. Supplementary material is available at https://sites.google.com/view/autobag.

IROS Conference 2023 Conference Paper

Bagging by Learning to Singulate Layers Using Interactive Perception

Lawrence Yunliang Chen
Baiyu Shi
Roy Lin
Daniel Seita
Ayah Ahmad
Richard Cheng
Thomas Kollar
David Held

Many fabric handling and 2D deformable material tasks in homes and industries require singulating layers of material such as opening a bag or arranging garments for sewing. In contrast to methods requiring specialized sensing or end effectors, we use only visual observations with ordinary parallel jaw grippers. We propose SLIP: Singulating Layers using Interactive Perception, and apply SLIP to the task of autonomous bagging. We develop SLIP-Bagging, a bagging algorithm that manipulates a plastic or fabric bag from an unstructured state and uses SLIP to grasp the top layer of the bag to open it for object insertion. In physical experiments, a YuMi robot achieves a success rate of 67% to 81% across bags of a variety of materials, shapes, and sizes, significantly improving in success rate and generality over prior work. Experiments also suggest that SLIP can be applied to tasks such as singulating layers of folded cloth and garments. Supplementary material is available at https://sites.google.com/view/slip-bagging/.

ICRA Conference 2023 Conference Paper

SGTM 2. 0: Autonomously Untangling Long Cables using Interactive Perception

Kaushik Shivakumar
Vainavi Viswanath
Anrui Gu
Yahav Avigal
Justin Kerr
Jeffrey Ichnowski
Richard Cheng
Thomas Kollar

Cables are commonplace in homes, hospitals, and industrial warehouses and are prone to tangling. This paper extends prior work on autonomously untangling long cables by introducing novel uncertainty quantification metrics and actions that interact with the cable to reduce perception uncertainty. We present Sliding and Grasping for Tangle Manipulation 2. 0 (SGTM 2. 0), a system that autonomously untangles cables approximately 3 meters in length with a bilateral robot using estimates of uncertainty at each step to inform actions. By interactively reducing uncertainty, SGTM 2. 0 significantly reduces run-time. Physical experiments with 84 trials suggest that SGTM $2. 0$ can achieve 83% untangling success on cables with 1 or 2 overhand and figure-8 knots, and 70% termination detection success across these configurations, outperforming SGTM 1. 0 by 43% in untangling accuracy and 200% in completion time. Supplementary material, visualizations, and videos can be found at sites.google.com/view/sgtm2.

ICRA Conference 2022 Conference Paper

CenterSnap: Single-Shot Multi-Object 3D Shape Reconstruction and Categorical 6D Pose and Size Estimation

Muhammad Zubair Irshad
Thomas Kollar
Michael Laskey
Kevin Stone
Zsolt Kira

This paper studies the complex task of simultaneous multi-object 3D reconstruction, 6D pose and size estimation from a single-view RGB-D observation. In contrast to instance- level pose estimation, we focus on a more challenging problem where CAD models are not available at inference time. Existing approaches mainly follow a complex multi-stage pipeline which first localizes and detects each object instance in the image and then regresses to either their 3D meshes or 6D poses. These approaches suffer from high-computational cost and low performance in complex multi-object scenarios, where occlusions can be present. Hence, we present a simple one- stage approach to predict both the 3D shape and estimate the 6D pose and size jointly in a bounding-box free manner. In particular, our method treats object instances as spatial centers where each center denotes the complete shape of an object along with its 6D pose and size. Through this per- pixel representation, our approach can reconstruct in real- time (40 FPS) multiple novel object instances and predict their 6D pose and sizes in a single-forward pass. Through extensive experiments, we demonstrate that our approach significantly outperforms all shape completion and categorical 6D pose and size estimation baselines on multi-object ShapeNet and NOCS datasets respectively with a 12. 6% absolute improvement in mAP for 6D pose for novel real-world object instances.

ICRA Conference 2020 Conference Paper

A Mobile Manipulation System for One-Shot Teaching of Complex Tasks in Homes

Max Bajracharya
James Borders
Daniel M. Helmick
Thomas Kollar
Michael Laskey
John Leichty
Jeremy Ma
Umashankar Nagarajan

We describe a mobile manipulation hardware and software system capable of autonomously performing complex human-level tasks in real homes, after being taught the task with a single demonstration from a person in virtual reality. This is enabled by a highly capable mobile manipulation robot, whole-body task space hybrid position/force control, teaching of parameterized primitives linked to a robust learned dense visual embeddings representation of the scene, and a task graph of the taught behaviors. We demonstrate the robustness of the approach by presenting results for performing a variety of tasks, under different environmental conditions, in multiple real homes. Our approach achieves 85% overall success rate on three tasks that consist of an average of 45 behaviors each. The video is available at: https://youtu.be/HSyAGMGikLk.

AAAI Conference 2018 Conference Paper

Multi-Task Learning For Parsing The Alexa Meaning Representation Language

Vittorio Perera
Tagyoung Chung
Thomas Kollar
Emma Strubell

The Alexa Meaning Representation Language (AMRL) is a compositional graph-based semantic representation that includes ﬁne-grained types, properties, actions, and roles and can represent a wide variety of spoken language. AMRL increases the ability of virtual assistants to represent more complex requests, including logical and conditional statements as well as ones with nested clauses. Due to this representational capacity, the acquisition of large scale data resources is challenging, which limits the accuracy of resulting models. This paper has two primary contributions. The ﬁrst contribution is a linearization of the AMRL parses that aligns it to a related task of spoken language understanding (SLU) and a deep neural network architecture that uses multi-task learning to predict AMRL ﬁne-grained types, properties and intents. The second contribution is a deep neural network architecture that leverages embeddings from the large-scale data resources that are available for SLU. When combined, these contributions enable the training of accurate models of AMRL parsing, even in the presence of data sparsity. The proposed models, which use the linearized AMRL parse, multi-task learning, residual connections and embeddings from SLU, decrease the error rates in the prediction of the full AMRL parse by 3. 56% absolute.

ICRA Conference 2013 Conference Paper

Imitation learning for natural language direction following through unknown environments

Felix Duvallet
Thomas Kollar
Anthony Stentz

The use of spoken instructions in human-robot teams holds the promise of enabling untrained users to effectively control complex robotic systems in a natural and intuitive way. Providing robots with the capability to understand natural language directions would enable effortless coordination in human robot teams that operate in non-specialized unknown environments. However, natural language direction following through unknown environments requires understanding the meaning of language, using a partial semantic world model to generate actions in the world, and reasoning about the environment and landmarks that have not yet been detected. We address the problem of robots following natural language directions through complex unknown environments. By exploiting the structure of spatial language, we can frame direction following as a problem of sequential decision making under uncertainty. We learn a policy which predicts a sequence of actions that follow the directions by exploring the environment and discovering landmarks, backtracking when necessary, and explicitly declaring when it has reached the destination. We use imitation learning to train the policy, using demonstrations of people following directions. By training explicitly in unknown environments, we can generalize to situations that have not been encountered previously.

ICRA Conference 2013 Conference Paper

Learning environmental knowledge from task-based human-robot dialog

Thomas Kollar
Vittorio Perera
Daniele Nardi
Manuela Veloso

This paper presents an approach for learning environmental knowledge from task-based human-robot dialog. Previous approaches to dialog use domain knowledge to constrain the types of language people are likely to use. In contrast, by introducing a joint probabilistic model over speech, the resulting semantic parse and the mapping from each element of the parse to a physical entity in the building (e. g. , grounding), our approach is flexible to the ways that untrained people interact with robots, is robust to speech to text errors and is able to learn referring expressions for physical locations in a map (e. g. , to create a semantic map). Our approach has been evaluated by having untrained people interact with a service robot. Starting with an empty semantic map, our approach is able ask 50% fewer questions than a baseline approach, thereby enabling more effective and intuitive human robot dialog.

IROS Conference 2012 Conference Paper

CoBots: Collaborative robots servicing multi-floor buildings

Manuela Veloso
Joydeep Biswas
Brian Coltin
Stephanie Rosenthal
Thomas Kollar
Çetin Meriçli
Mehdi Samadi
Susana Brandão

In this video we briefly illustrate the progress and contributions made with our mobile, indoor, service robots CoBots (Collaborative Robots), since their creation in 2009. Many researchers, present authors included, aim for autonomous mobile robots that robustly perform service tasks for humans in our indoor environments. The efforts towards this goal have been numerous and successful, and we build upon them. However, there are clearly many research challenges remaining until we can experience intelligent mobile robots that are fully functional and capable in our human environments.

AAMAS Conference 2012 Conference Paper

Enabling Robots to Find and Fetch Objects by Querying the Web

Thomas Kollar
Mehdi Samadi
Manuela Veloso

This paper describes an algorithm that enables a mobile robot to find an arbitrary object and take it to a destination location. Previous approaches have been able to search for a fixed set of objects. In contrast, our approach is able to dynamically construct a cost function to find any object by querying the web. The performance of our approach has been evaluated in a realistic simulator, and has been demonstrated on a companion robot, which can successfully execute plans such as finding a“coffee”and taking it to a destination location like, “Gates-Hillman Center, Room 7002. ”

AAAI Conference 2012 Conference Paper

Using the Web to Interactively Learn to Find Objects

Mehdi Samadi
Thomas Kollar
Manuela Veloso

In order for robots to intelligently perform tasks with humans, they must be able to access a broad set of background knowledge about the environments in which they operate. Unlike other approaches, which tend to manually define the knowledge of the robot, our approach enables robots to actively query the World Wide Web (WWW) to learn background knowledge about the physical environment. We show that our approach is able to search the Web to infer the probability that an object, such as a “coffee, ” can be found in a location, such as a “kitchen. ” Our approach, called ObjectEval, is able to dynamically instantiate a utility function using this probability, enabling robots to find arbitrary objects in indoor environments. Our experimental results show that the interactive version of ObjectEval visits 28% fewer locations than the version trained offline and 71% fewer locations than a baseline approach which uses no background knowledge.

ICRA Conference 2011 Conference Paper

Following and interpreting narrated guided tours

Sachithra Hemachandra
Thomas Kollar
Nicholas Roy
Seth J. Teller

We describe a robotic tour-taking capability enabling a robot to acquire local knowledge of a human-occupied environment. A tour-taking robot autonomously follows a human guide through an environment, interpreting the guide's spoken utterances and the shared spatiotemporal context in order to acquire a spatially segmented and semantically labeled metrical-topological representation of the environment. The described tour-taking capability enables scalable deployment of mobile robots into human-occupied environments, and natural human-robot interaction for commanded mobility. Our primary contributions are an efficient, socially acceptable autonomous tour-following behavior and a tour interpretation algorithm that partitions a map into spaces labeled according to the guide's utterances. The tour-taking behavior is demonstrated in a multi-floor office building and evaluated by assessing the comfort of the tour guides, and by comparing the robot's map partitions to those produced by humans.

AAAI Conference 2011 Conference Paper

Understanding Natural Language Commands for Robotic Navigation and Mobile Manipulation

Stefanie Tellex
Thomas Kollar
Steven Dickerson
Matthew Walter
Ashis Banerjee
Seth Teller
Nicholas Roy

This paper describes a new model for understanding natural language commands given to autonomous systems that perform navigation and mobile manipulation in semi-structured environments. Previous approaches have used models with ﬁxed structure to infer the likelihood of a sequence of actions given the environment and the command. In contrast, our framework, called Generalized Grounding Graphs (G3 ), dynamically instantiates a probabilistic graphical model for a particular natural language command according to the command’s hierarchical and compositional semantic structure. Our system performs inference in the model to successfully ﬁnd and execute plans corresponding to natural language commands such as “Put the tire pallet on the truck. ” The model is trained using a corpus of commands collected using crowdsourcing. We pair each command with robot actions and use the corpus to learn the parameters of the model. We evaluate the robot’s performance by inferring plans from natural language commands, executing each plan in a realistic robot simulator, and asking users to evaluate the system’s performance. We demonstrate that our system can successfully follow many natural language commands from the corpus.

ICRA Conference 2010 Conference Paper

Indoor scene recognition through object detection

Pablo Espinace
Thomas Kollar
Alvaro Soto
Nicholas Roy

Scene recognition is a highly valuable perceptual ability for an indoor mobile robot, however, current approaches for scene recognition present a significant drop in performance for the case of indoor scenes. We believe that this can be explained by the high appearance variability of indoor environments. This stresses the need to include high-level semantic information in the recognition process. In this work we propose a new approach for indoor scene recognition based on a generative probabilistic hierarchical model that uses common objects as an intermediate semantic representation. Under this model, we use object classifiers to associate low-level visual features to objects, and at the same time, we use contextual relations to associate objects to scenes. As a further contribution, we improve the performance of current state-of-the-art category-level object classifiers by including geometrical information obtained from a 3D range sensor that facilitates the implementation of a focus of attention mechanism within a Monte Carlo sampling scheme. We test our approach using real data, showing significant advantages with respect to previous state-of-the-art methods.

IROS Conference 2010 Conference Paper

Natural language command of an autonomous micro-air vehicle

Albert S. Huang
Stefanie Tellex
Abraham Bachrach
Thomas Kollar
Deb Roy
Nicholas Roy

Natural language is a flexible and intuitive modality for conveying directions and commands to a robot but presents a number of computational challenges. Diverse words and phrases must be mapped into structures that the robot can understand, and elements in those structures must be grounded in an uncertain environment. In this paper we present a micro-air vehicle (MAV) capable of following natural language directions through a previously mapped and labeled environment. We extend our previous work in understanding 2D natural language directions to three dimensions, accommodating new verb modifiers such as go up and go down, and commands such as turn around and face the windows. We demonstrate the robot following directions created by a human for another human, and interactively executing commands in the context of surveillance and search and rescue in confined spaces. In an informal study, 71% of the paths computed from directions given by one user terminated within 10m of the desired destination.

ICRA Conference 2009 Conference Paper

Utilizing object-object and object-scene context when planning to find things

Thomas Kollar
Nicholas Roy

In this paper, our goal is to search for a novel object, where we have a prior map of the environment and knowledge of some of the objects in it, but no information about the location of the specific novel object. We develop a probabilistic model over possible object locations that utilizes object-object and object-scene context. This model can be queried for any of over 25, 000 naturally occurring objects in the world and is trained from labeled data acquired from the captions of photos on the Flickr Website. We show that these simple models based on object co-occurrences perform surprisingly well at localizing arbitrary objects in an office setting. In addition, we show how to compute paths that minimize the expected distance to the query object and show that this approach performs better than a greedy approach. Finally, we give preliminary results for grounding our approach in object classifiers.

ICRA Conference 2009 Conference Paper

Where to go: Interpreting natural directions using global inference

Yuan Wei
Emma Brunskill
Thomas Kollar
Nicholas Roy

An important component of human-robot interaction is that people need to be able to instruct robots to move to other locations using naturally given directions. When giving directions, people often make mistakes such as labelling errors (e. g. , left vs. right) and errors of omission (skipping important decision points in a sequence). Furthermore, people often use multiple levels of granularity in specifying directions, referring to locations using single object landmarks, multiple landmarks in a given location, or identifying large regions as a single location. The challenge is to identify the correct path to a destination from a sequence of noisy, possibly erroneous directions. In our work we cast this problem as probabilistic inference: given a set of directions, an agent should automatically find the path with the geometry and physical appearance to maximize the likelihood of those directions. We use a specific variant of a Markov Random Field (MRF) to represent our model, and gather multi-granularity representation information using existing large tagged datasets. On a dataset of route directions collected in a large third floor university building, we found that our algorithm correctly inferred the true final destination in 47 out of the 55 cases successfully followed by humans volunteers. These results suggest that our algorithm is performing well relative to human users. In the future this work will be included in a broader system for autonomously constructing environmental representations that support natural human-robot interaction for direction giving.

AAAI Conference 2008 Conference Paper

Efficient Optimization of Information-Theoretic Exploration in SLAM

Thomas Kollar

We present a novel method for information-theoretic exploration, leveraging recent work on mapping and localization. We describe exploration as the constrained optimization problem of computing a trajectory to minimize posterior map error, subject to the constraints of traveling through a set of sensing locations to ensure map coverage. This trajectory is found by reducing the map to a skeleton graph and searching for a minimum entropy tour through the graph. We describe how a specific factorization of the map covariance allows the reuse of EKF updates during the optimization, giving an efficient gradient ascent search for the maximum information gain tour through sensing locations, where each tour naturally incorporates revisiting well-known map regions. By generating incrementally larger tours as the exploration finds new regions of the environment, we demonstrate that our approach can perform autonomous exploration with improved accuracy.

IROS Conference 2007 Conference Paper

Collision detection in legged locomotion using supervised learning

Finale Doshi
Emma Brunskill
Alexander C. Shkolnik
Thomas Kollar
Khashayar Rohanimanesh
Russ Tedrake
Nicholas Roy

We propose a fast approach for detecting collision- free swing-foot trajectories for legged locomotion over extreme terrains. Instead of simulating the swing trajectories and checking for collisions along them, our approach uses machine learning techniques to predict whether a swing trajectory is collision-free. Using a set of local terrain features, we apply supervised learning to train a classifier to predict collisions. Both in simulation and on a real quadruped platform, our results show that our classifiers can improve the accuracy of collision detection compared to a real-time geometric approach without significantly increasing the computation time.

IROS Conference 2007 Conference Paper

Topological mapping using spectral clustering and classification

Emma Brunskill
Thomas Kollar
Nicholas Roy

In this work we present an online method for generating topological maps from raw sensor information. We first describe an algorithm to automatically decompose a map into submap segments using a graph partitioning technique known as spectral clustering. We then describe how to train a classifier to recognize graph submaps from laser signatures using the AdaBoost machine learning algorithm. We demonstrate that the we can perform topological mapping by incrementally segmenting the world as the robot moves through its environment, and we can close the loop when the learned classifier recognizes that the robot has returned to a previously visited location.

ICRA Conference 2006 Conference Paper

Using Reinforcement Learning to Improve Exploration Trajectories for Error Minimization

Thomas Kollar
Nicholas Roy

The mapping and localization problems have received considerable attention in robotics recently. The exploration problem that drives mapping has started to generate similar attention, as the ease of construction and quality of map is strongly dependent on the strategy used to acquire sensor data for the map. Most exploration strategies concentrate on selecting the next best measurement to take, trading off information gathering for regular relocalization. What has not been studied so far is the effect the robot controller has on the map quality while executing exploration plans. Certain kinds of robot motion (e. g, sharp turns) are hard to estimate correctly, and increase the likelihood of errors in the mapping process. We show how reinforcement learning can be used to generate good motion control while executing a simple information gathering exploration strategy. We show that the learned policy reduces the overall map uncertainty by reducing the amount of uncertainty generated by robot motion