Author name cluster

Pete Florence

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

16 papers

2 author rows

ICRA Conference 2024 Conference Paper

RoboVQA: Multimodal Long-Horizon Reasoning for Robotics

Pierre Sermanet
Tianli Ding
Jeffrey Zhao
Fei Xia 0002
Debidatta Dwibedi
Keerthana Gopalakrishnan
Christine Chan
Gabriel Dulac-Arnold

We present a scalable, bottom-up and intrinsically diverse data collection scheme that can be used for high-level reasoning with long and medium horizons and that has 2. 2x higher throughput compared to traditional narrow top-down step-by-step collection. We collect realistic data by performing any user requests within the entirety of 3 office buildings and using multiple embodiments (robot, human, human with grasping tool). With this data, we show that models trained on all embodiments perform better than ones trained on the robot data only, even when evaluated solely on robot episodes. We explore the economics of collection costs and find that for a fixed budget it is beneficial to take advantage of the cheaper human collection along with robot collection. We release a large and highly diverse (29, 520 unique instructions) dataset dubbed RoboVQA containing 829, 502 (video, text) pairs for robotics-focused visual question answering. We also demonstrate how evaluating real robot experiments with an intervention mechanism enables performing tasks to completion, making it deployable with human oversight even if imperfect while also providing a single performance metric. We demonstrate a single video-conditioned model named RoboVQA-VideoCoCa trained on our dataset that is capable of performing a variety of grounded high-level reasoning tasks in broad realistic settings with a cognitive intervention rate 46% lower than the zeroshot state of the art visual language model (VLM) baseline and is able to guide real robots through long-horizon tasks. The performance gap with zero-shot state-of-the-art models indicates that a lot of grounded data remains to be collected for real-world deployment, emphasizing the critical need for scalable data collection approaches. Finally, we show that video VLMs significantly outperform single-image VLMs with an average error rate reduction of 19% across all VQA tasks. Thanks to video conditioning and dataset diversity, the model can be used as general video value functions (e. g. success and affordance) in situations where actions needs to be recognized rather than states, expanding capabilities and environment understanding for robots. Data and videos are available at robovqa. github.io

Details

ICLR Conference 2024 Conference Paper

Video Language Planning

Yilun Du
Sherry Yang 0001
Pete Florence
Fei Xia 0002
Ayzaan Wahid
Brian Ichter
Pierre Sermanet
Tianhe Yu

We are interested in enabling visual planning for complex long-horizon tasks in the space of generated videos and language, leveraging recent advances in large generative models pretrained on Internet-scale data. To this end, we present video language planning (VLP), an algorithm that consists of a tree search procedure, where we train (i) vision-language models to serve as both policies and value functions, and (ii) text-to-video models as dynamics models. VLP takes as input a long-horizon task instruction and current image observation, and outputs a long video plan that provides detailed multimodal (video and language) specifications that describe how to complete the final task. VLP scales with increasing computation budget where more computation time results in improved video plans, and is able to synthesize long-horizon video plans across different robotics domains -- from multi-object rearrangement, to multi-camera bi-arm dexterous manipulation. Generated video plans can be translated into real robot actions via goal-conditioned policies, conditioned on each intermediate frame of the generated video. Experiments show that VLP substantially improves long-horizon task success rates compared to prior methods on both simulated and real robots (across 3 hardware platforms).

Details

ICRA Conference 2023 Conference Paper

Code as Policies: Language Model Programs for Embodied Control

Jacky Liang
Wenlong Huang
Fei Xia 0002
Peng Xu 0010
Karol Hausman
Brian Ichter
Pete Florence
Andy Zeng 0001

Large language models (LLMs) trained on code-completion have been shown to be capable of synthesizing simple Python programs from docstrings [1]. We find that these code-writing LLMs can be re-purposed to write robot policy code, given natural language commands. Specifically, policy code can express functions or feedback loops that process perception outputs (e. g. , from object detectors [2], [3]) and parameterize control primitive APIs. When provided as input several example language commands (formatted as comments) followed by corresponding policy code (via few-shot prompting), LLMs can take in new commands and autonomously re-compose API calls to generate new policy code respectively. By chaining classic logic structures and referencing third-party libraries (e. g. , NumPy, Shapely) to perform arithmetic, LLMs used in this way can write robot policies that (i) exhibit spatial-geometric reasoning, (ii) generalize to new instructions, and (iii) prescribe precise values (e. g. , velocities) to ambiguous descriptions (‘faster’) depending on context (i. e. , behavioral commonsense). This paper presents Code as Policies: a robot-centric formulation of language model generated programs (LMPs) that can represent reactive policies (e. g. , impedance controllers), as well as waypoint-based policies (vision-based pick and place, trajectory-based control), demonstrated across multiple real robot platforms. Central to our approach is prompting hierarchical code-gen (recursively defining undefined functions), which can write more complex code and also improves state-of-the-art to solve 39. 8% of problems on the HumanEval [1] benchmark. Code and videos are available at https://code-as-policies.github.io

Details

NeurIPS Conference 2023 Conference Paper

Grounded Decoding: Guiding Text Generation with Grounded Models for Embodied Agents

Wenlong Huang
Fei Xia
Dhruv Shah
Danny Driess
Andy Zeng
Yao Lu
Pete Florence
Igor Mordatch

Recent progress in large language models (LLMs) has demonstrated the ability to learn and leverage Internet-scale knowledge through pre-training with autoregressive models. Unfortunately, applying such models to settings with embodied agents, such as robots, is challenging due to their lack of experience with the physical world, inability to parse non-language observations, and ignorance of rewards or safety constraints that robots may require. On the other hand, language-conditioned robotic policies that learn from interaction data can provide the necessary grounding that allows the agent to be correctly situated in the real world, but such policies are limited by the lack of high-level semantic understanding due to the limited breadth of the interaction data available for training them. Thus, if we want to make use of the semantic knowledge in a language model while still situating it in an embodied setting, we must construct an action sequence that is both likely according to the language model and also realizable according to grounded models of the environment. We frame this as a problem similar to probabilistic filtering: decode a sequence that both has high probability under the language model and high probability under a set of grounded model objectives. We demonstrate how such grounded models can be obtained across three simulation and real-world domains, and that the proposed decoding strategy is able to solve complex, long-horizon embodiment tasks in a robotic setting by leveraging the knowledge of both models.

PDF Details

ICML Conference 2023 Conference Paper

PaLM-E: An Embodied Multimodal Language Model

Danny Driess
Fei Xia 0002
Mehdi S. M. Sajjadi
Corey Lynch
Aakanksha Chowdhery
Brian Ichter
Ayzaan Wahid
Jonathan Tompson

Large language models excel at a wide range of complex tasks. However, enabling general inference in the real world, e. g. for robotics problems, raises the challenge of grounding. We propose embodied language models to directly incorporate real-world continuous sensor modalities into language models and thereby establish the link between words and percepts. Input to our embodied language model are multimodal sentences that interleave visual, continuous state estimation, and textual input encodings. We train these encodings end-to-end, in conjunction with a pre-trained large language model, for multiple embodied tasks including sequential robotic manipulation planning, visual question answering, and captioning. Our evaluations show that PaLM-E, a single large embodied multimodal model, can address a variety of embodied reasoning tasks, from a variety of observation modalities, on multiple embodiments, and further, exhibits positive transfer: the model benefits from diverse joint training across internet-scale language, vision, and visual-language domains. Our largest model with 562B parameters, in addition to being trained on robotics tasks, is a visual-language generalist with state-of-the-art performance on OK-VQA, and retains generalist language capabilities with increasing scale.

Details

ICLR Conference 2023 Conference Paper

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

Andy Zeng 0001
Maria Attarian
Brian Ichter
Krzysztof Choromanski
Adrian Wong
Stefan Welker
Federico Tombari
Aveek Purohit

We investigate how multimodal prompt engineering can use language as the intermediate representation to combine complementary knowledge from different pretrained (potentially multimodal) language models for a variety of tasks. This approach is both distinct from and complementary to the dominant paradigm of joint multimodal training. It also recalls a traditional systems-building view as in classical NLP pipelines, but with prompting large pretrained multimodal models. We refer to these as Socratic Models (SMs): a modular class of systems in which multiple pretrained models may be composed zero-shot via multimodal-informed prompting to capture new multimodal capabilities, without additional finetuning. We show that these systems provide competitive state-of-the-art performance for zero-shot image captioning and video-to-text retrieval, and also enable new applications such as (i) answering free-form questions about egocentric video, (ii) engaging in multimodal assistive dialogue with people (e.g., for cooking recipes), and (iii) robot perception and planning. We hope this work provides (a) results for stronger zero-shot baseline performance with analysis also highlighting their limitations, (b) new perspectives for building multimodal systems powered by large pretrained models, and (c) practical application advantages in certain regimes limited by data scarcity, training compute, or model access.

Details

ICRA Conference 2023 Conference Paper

Visuomotor Control in Multi-Object Scenes Using Object-Aware Representations

Negin Heravi
Ayzaan Wahid
Corey Lynch
Pete Florence
Travis Armstrong
Jonathan Tompson
Pierre Sermanet
Jeannette Bohg

Perceptual understanding of the scene and the relationship between its different components is important for successful completion of robotic tasks. Representation learning has been shown to be a powerful technique for this, but most of the current methodologies learn task specific representations that do not necessarily transfer well to other tasks. Furthermore, representations learned by supervised methods require large, labeled datasets for each task that are expensive to collect in the real-world. Using self-supervised learning to obtain representations from unlabeled data can mitigate this problem. However, current self-supervised representation learning methods are mostly object agnostic, and we demonstrate that the resulting representations are insufficient for general purpose robotics tasks as they fail to capture the complexity of scenes with many components. In this paper, we show the effectiveness of using object-aware representation learning techniques for robotic tasks. Our self-supervised representations are learned by observing the agent freely interacting with different parts of the environment and are queried in two different settings: (i) policy learning and (ii) object location prediction. We show that our model learns control policies in a sample-efficient manner and outperforms state-of-the-art object agnostic techniques as well as methods trained on raw RGB images. Our results show a 20% increase in performance in low data regimes (1000 trajectories) in policy training using implicit behavioral cloning (IBC). Furthermore, our method outperforms the baselines for the task of object localization in multi-object scenes. Further qualitative results are available at https://sites.google.com/view/slots4robots.

Details

ICRA Conference 2022 Conference Paper

Implicit Kinematic Policies: Unifying Joint and Cartesian Action Spaces in End-to-End Robot Learning

Aditya Ganapathi
Pete Florence
Jake Varley
Kaylee Burns
Ken Goldberg
Andy Zeng 0001

Action representation is an important yet often overlooked aspect in end-to-end robot learning with deep networks. Choosing one action space over another (e. g. target joint positions, or Cartesian end-effector poses) can result in surprisingly stark performance differences between various downstream tasks - and as a result, considerable research has been devoted to finding the right action space for a given application. However, in this work, we instead investigate how our models can discover and learn for themselves which action space to use. Leveraging recent work on implicit behavioral cloning, which takes both observations and actions as input, we demonstrate that it is possible to present the same action in multiple different spaces to the same policy - allowing it to learn inductive patterns from each space. Specifically, we study the benefits of combining Cartesian and joint action spaces in the context of learning manipulation skills. To this end, we present Implicit Kinematic Policies (IKP), which incorporates the kinematic chain as a differentiable module within the deep network. Quantitative experiments across several simulated continuous control tasks-from scooping piles of small objects, to lifting boxes with elbows, to precise block insertion with miscalibrated robots-suggest IKP not only learns complex prehensile and non-prehensile manipulation from pixels better than baseline alternatives, but also can learn to compensate for small joint encoder offset errors. Finally, we also run qualitative experiments on a real UR5e to demonstrate the feasibility of our algorithm on a physical robotic system with real data. See https://tinyurl.com/4wz3nf86 for code and supplementary material.

Details

ICRA Conference 2022 Conference Paper

NeRF-Supervision: Learning Dense Object Descriptors from Neural Radiance Fields

Yen-Chen Lin
Pete Florence
Jonathan T. Barron
Tsung-Yi Lin
Alberto Rodriguez 0003
Phillip Isola

Thin, reflective objects such as forks and whisks are common in our daily lives, but they are particularly chal-lenging for robot perception because it is hard to reconstruct them using commodity RGB-D cameras or multi-view stereo techniques. While traditional pipelines struggle with objects like these, Neural Radiance Fields (NeRFs) have recently been shown to be remarkably effective for performing view synthesis on objects with thin structures or reflective materials. In this paper we explore the use of NeRF as a new source of supervision for robust robot vision systems. In particular, we demonstrate that a NeRF representation of a scene can be used to train dense object descriptors. We use an optimized NeRF to extract dense correspondences between multiple views of an object, and then use these correspondences as training data for learning a view-invariant representation of the object. NeRF's usage of a density field allows us to reformulate the correspondence problem with a novel distribution-of-depths formulation, as opposed to the conventional approach of using a depth map. Dense correspondence models supervised with our method significantly outperform off-the-shelf learned descriptors by 106% (PCK@3px metric, more than doubling performance) and outperform our baseline supervised with multi-view stereo by 29%. Furthermore, we demonstrate the learned dense descriptors enable robots to perform accurate 6-degree of freedom (6-DoF) pick and place of thin and reflective objects.

Details

NeurIPS Conference 2022 Conference Paper

Reinforcement Learning with Neural Radiance Fields

Danny Driess
Ingmar Schubert
Pete Florence
Yunzhu Li
Marc Toussaint

It is a long-standing problem to find effective representations for training reinforcement learning (RL) agents. This paper demonstrates that learning state representations with supervision from Neural Radiance Fields (NeRFs) can improve the performance of RL compared to other learned representations or even low-dimensional, hand-engineered state information. Specifically, we propose to train an encoder that maps multiple image observations to a latent space describing the objects in the scene. The decoder built from a latent-conditioned NeRF serves as the supervision signal to learn the latent space. An RL algorithm then operates on the learned latent space as its state representation. We call this NeRF-RL. Our experiments indicate that NeRF as supervision leads to a latent space better suited for the downstream RL tasks involving robotic object manipulations like hanging mugs on hooks, pushing objects, or opening doors. Video: https: //dannydriess. github. io/nerf-rl

PDF Details

ICRA Conference 2022 Conference Paper

VIRDO: Visio-tactile Implicit Representations of Deformable Objects

Youngsun Wi
Pete Florence
Andy Zeng 0001
Nima Fazeli

Deformable object manipulation requires computationally efficient representations that are compatible with robotic sensing modalities. In this paper, we present VIRDO: an implicit, multi-modal, and continuous representation for deformable-elastic objects. VIRDO operates directly on visual (point cloud) and tactile (reaction forces) modalities and learns rich latent embeddings of contact locations and forces to predict object deformations subject to external contacts. Here, we demonstrate VIRDOs ability to: i) produce high-fidelity cross-modal reconstructions with dense unsupervised correspondences, ii) generalize to unseen contact formations, and iii) state-estimation with partial visio-tactile feedback. https://github.com/MMintLab/VIRDO

Details

IROS Conference 2021 Conference Paper

iNeRF: Inverting Neural Radiance Fields for Pose Estimation

Yen-Chen Lin
Pete Florence
Jonathan T. Barron
Alberto Rodriguez 0003
Phillip Isola
Tsung-Yi Lin

We present iNeRF, a framework that performs mesh-free pose estimation by "inverting" a Neural Radiance Field (NeRF). NeRFs have been shown to be remarkably effective for the task of view synthesis — synthesizing photorealistic novel views of real-world scenes or objects. In this work, we investigate whether we can apply analysis-by-synthesis via NeRF for mesh-free, RGB-only 6DoF pose estimation – given an image, find the translation and rotation of a camera relative to a 3D object or scene. Our method assumes that no object mesh models are available during either training or test time. Starting from an initial pose estimate, we use gradient descent to minimize the residual between pixels rendered from a NeRF and pixels in an observed image. In our experiments, we first study 1) how to sample rays during pose refinement for iNeRF to collect informative gradients and 2) how different batch sizes of rays affect iNeRF on a synthetic dataset. We then show that for complex real-world scenes from the LLFF dataset [21], iNeRF can improve NeRF by estimating the camera poses of novel images and using these images as additional training data for NeRF. Finally, we show iNeRF can perform categorylevel object pose estimation, including object instances not seen during training, with RGB images by inverting a NeRF model inferred from a single view.

Details

ICRA Conference 2021 Conference Paper

Learning to Rearrange Deformable Cables, Fabrics, and Bags with Goal-Conditioned Transporter Networks

Daniel Seita
Pete Florence
Jonathan Tompson
Erwin Coumans
Vikas Sindhwani
Ken Goldberg
Andy Zeng 0001

Rearranging and manipulating deformable objects such as cables, fabrics, and bags is a long-standing challenge in robotic manipulation. The complex dynamics and high-dimensional configuration spaces of deformables, compared to rigid objects, make manipulation difficult not only for multi-step planning, but even for goal specification. Goals cannot be as easily specified as rigid object poses, and may involve complex relative spatial relations such as "place the item inside the bag". In this work, we develop a suite of simulated benchmarks with 1D, 2D, and 3D deformable structures, including tasks that involve image-based goal-conditioning and multi-step deformable manipulation. We propose embedding goal-conditioning into Transporter Networks, a recently proposed model architecture for learning robotic manipulation that rearranges deep features to infer displacements that can represent pick and place actions. We demonstrate that goal-conditioned Transporter Networks enable agents to manipulate deformable structures into flexibly specified configurations without test-time visual anchors for target locations. We also significantly extend prior results using Transporter Networks for manipulating deformable objects by testing on tasks with 2D and 3D deformables. Supplementary material is available at https://berkeleyautomation.github.io/bags/.

Details

ICRA Conference 2018 Conference Paper

Label Fusion: A Pipeline for Generating Ground Truth Labels for Real RGBD Data of Cluttered Scenes

Pat Marion
Pete Florence
Lucas Manuelli
Russ Tedrake

Deep neural network (DNN) architectures have been shown to outperform traditional pipelines for object segmentation and pose estimation using RGBD data, but the performance of these DNN pipelines is directly tied to how representative the training data is of the true data. Hence a key requirement for employing these methods in practice is to have a large set of labeled data for your specific robotic manipulation task, a requirement that is not generally satisfied by existing datasets. In this paper we develop a pipeline to rapidly generate high quality RGBD data with pixelwise labels and object poses. We use an RGBD camera to collect video of a scene from multiple viewpoints and leverage existing reconstruction techniques to produce a 3D dense reconstruction. We label the 3D reconstruction using a human assisted ICP-fitting of object meshes. By reprojecting the results of labeling the 3D scene we can produce labels for each RGBD image of the scene. This pipeline enabled us to collect over 1, 000, 000 labeled object instances in just a few days. We use this dataset to answer questions related to how much training data is required, and of what quality the data must be, to achieve high performance from a DNN architecture. Our dataset and annotation pipeline are available at labelfusion. csail. mit.edu.

Details

ICRA Conference 2018 Conference Paper

NanoMap: Fast, Uncertainty-Aware Proximity Queries with Lazy Search Over Local 3D Data

Pete Florence
John Carter
Jake Ware
Russ Tedrake

We would like robots to be able to safely navigate at high speed, efficiently use local 3D information, and robustly plan motions that consider pose uncertainty of measurements in a local map structure. This is hard to do with previously existing mapping approaches, like occupancy grids, that are focused on incrementally fusing 3D data into a common world frame. In particular, both their fragile sensitivity to state estimation errors and computational cost can be limiting. We develop an alternative framework, NanoMap, which alleviates the need for global map fusion and enables a motion planner to efficiently query pose-uncertainty-aware local 3D geometric information. The key idea of NanoMap is to store a history of noisy relative pose transforms and search over a corresponding set of depth sensor measurements for the minimum-uncertainty view of a queried point in space. This approach affords a variety of capabilities not offered by traditional mapping techniques: (a) the pose uncertainty associated with 3D data can be incorporated in motion planning, (b) poses can be updated (i. e. , from loop closures) with minimal computational effort, and (c) 3D data can be fused lazily for the purpose of planning. We provide an open-source implementation of NanoMap, and analyze its capabilities and computational efficiency in simulation experiments. Finally, we demonstrate in hardware its effectiveness for fast 3D obstacle avoidance onboard a quadrotor flying up to 10 m/s.

Details

ICRA Conference 2016 Conference Paper

Aggressive quadrotor flight through cluttered environments using mixed integer programming

Benoit Landry
Robin Deits
Pete Florence
Russ Tedrake

Quadrotor flight has typically been limited to sparse environments due to numerical complications that arise when dealing with large numbers of obstacles. We hypothesized that it would be possible to plan and robustly execute trajectories in obstacle-dense environments using the novel Iterative Regional Inflation by Semidefinite programming algorithm (IRIS), mixed-integer semidefinite programs (MISDP), and model-based control. Unlike sampling-based approaches, the planning algorithm first introduced by Deits theoretically guarantees non-penetration of the trajectories even with small obstacles such as strings. We present experimental validation of this claim by aggressively flying a small quadrotor (34g, 92mm rotor to rotor) in a series of indoor environments including a cubic meter volume containing 20 interwoven strings, and present the control architecture we developed to do so.

Details