Arrow Research search

Author name cluster

Danfei Xu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

28 papers
2 author rows

Possible papers

28

TMLR Journal 2026 Journal Article

Continual Robot Learning via Language-Guided Skill Acquisition

  • Shuo Cheng
  • Zhaoyi Li
  • Kelin Yu
  • Danfei Xu

To support daily human tasks, robots need to tackle complex, long-horizon tasks and continuously acquire new skills to handle new problems. Deep Reinforcement Learning (DRL) offers potential for learning fine-grained skills but relies heavily on human-defined rewards and faces challenges with long-horizon goals. Task and Motion Planning (TAMP) are adept at handling long-horizon tasks but often need tailored domain-specific skills, resulting in practical limitations and inefficiencies. To overcome these complementary limitations, we propose LG-SAIL (Language Models Guided Sequential, Adaptive, and Incremental Skill Learning), a framework that leverages Large Language Models (LLMs) to synergistically integrate TAMP and DRL for continuous skill learning in long-horizon tasks. Our framework achieves automatic task decomposition, operator creation, and dense reward generation for efficiently acquiring the desired skills. To facilitate new skill learning, our framework maintains a symbolic skill library and utilizes the existing model from semantic-related skills to warm start the training. LG-SAIL demonstrates superior performance compared to baselines across six challenging simulated task domains across two benchmarks. Furthermore, we demonstrate the ability to reuse learned skills to expedite learning in new task domains, and deploy the system on a physical robot platform. More results on website: https://sites.google.com/view/continuallearning.

ICRA Conference 2025 Conference Paper

DreamDrive: Generative 4D Scene Modeling from Street View Images

  • Jiageng Mao
  • Boyi Li 0001
  • Boris Ivanovic
  • Yuxiao Chen 0008
  • Yan Wang 0051
  • Yurong You
  • Chaowei Xiao
  • Danfei Xu

Synthesizing photo-realistic visual observations from an ego vehicle's driving trajectory is a critical step towards scalable training of self-driving models. Reconstruction-based methods create 3D scenes from driving logs and synthesize geometry-consistent driving videos through neural rendering, but their dependence on costly object annotations limits their ability to generalize to in-the-wild driving scenarios. On the other hand, generative models can synthesize action-conditioned driving videos in a more generalizable way but often struggle with maintaining 3D visual consistency. In this paper, we present DreamDrive, a 4D spatial-temporal scene generation approach that combines the merits of generation and reconstruction, to synthesize generalizable 4D driving scenes and dynamic driving videos with 3D consistency. Specifically, we leverage the generative power of video diffusion models to synthesize a sequence of visual references and further elevate them to 4D with a novel hybrid Gaussian representation. Given a driving trajectory, we then render 3D-consistent driving videos via Gaussian splatting. The use of generative priors allows our method to produce high-quality 4D scenes from in-the-wild driving data, while neural rendering ensures 3D-consistent video generation from the 4D scenes. Extensive experiments on nuScenes and in-the-wild driving data demonstrate that DreamDrive can generate controllable and generalizable 4D driving scenes, synthesize novel views of driving videos with high fidelity and 3D consistency, decompose static and dynamic elements in a self-supervised manner, and enhance perception and planning tasks for autonomous driving.

NeurIPS Conference 2025 Conference Paper

EgoBridge: Domain Adaptation for Generalizable Imitation from Egocentric Human Data

  • Ryan Punamiya
  • Dhruv Patel
  • Patcharapong Aphiwetsa
  • Pranav Kuppili
  • Lawrence Zhu
  • Simar Kareer
  • Judy Hoffman
  • Danfei Xu

Egocentric human experience data presents a vast resource for scaling up end-to-end imitation learning for robotic manipulation. However, significant domain gaps in visual appearance, sensor modalities, and kinematics between human and robot impede knowledge transfer. This paper presents EgoBridge, a unified co-training framework that explicitly aligns the policy latent spaces between human and robot data using domain adaptation. Through a measure of discrepancy on the joint policy latent features and actions based on Optimal Transport (OT), we learn observation representations that not only align between the human and robot domain but also preserve the action-relevant information critical for policy learning. EgoBridge achieves a significant absolute policy success rate improvement by 44% over human-augmented cross-embodiment baselines in three real-world single-arm and bimanual manipulation tasks. EgoBridge also generalizes to new objects, scenes, and tasks seen only in human data, where baselines fail entirely. Videos and additional information can be found at https: //ego-bridge. github. io/

ICRA Conference 2025 Conference Paper

EgoMimic: Scaling Imitation Learning via Egocentric Video

  • Simar Kareer
  • Dhruv Patel
  • Ryan Punamiya
  • Pranay Mathur
  • Shuo Cheng
  • Chen Wang 0053
  • Judy Hoffman
  • Danfei Xu

The scale and diversity of demonstration data required for imitation learning is a significant challenge. We present EgoMimic, a full-stack framework which scales manipulation via human embodiment data, specifically egocentric human videos paired with 3D hand tracking. EgoMimic achieves this through: (1) a system to capture human embodiment data using the ergonomic Project Aria glasses, (2) a low-cost bimanual manipulator that minimizes the kinematic gap to human data, (3) cross-domain data alignment techniques, and (4) an imitation learning architecture that co-trains on human and robot data. Compared to prior works that only extract high-level intent from human videos, our approach treats human and robot data equally as embodied demonstration data and learns a unified policy from both data sources. EgoMimic achieves significant improvement on a diverse set of long-horizon, single-arm and bimanual manipulation tasks over state-of-the-art imitation learning methods and enables generalization to entirely new scenes. Finally, we show a favorable scaling trend for EgoMimic, where adding 1 hour of additional hand data is significantly more valuable than 1 hour of additional robot data. Videos and additional information can be found at https://egomimic.github.io/

NeurIPS Conference 2025 Conference Paper

Generalizable Domain Adaptation for Sim-and-Real Policy Co-Training

  • Shuo Cheng
  • Liqian Ma
  • Zhenyang Chen
  • Ajay Mandlekar
  • Caelan Garrett
  • Danfei Xu

Behavior cloning has shown promise for robot manipulation, but real-world demonstrations are costly to acquire at scale. While simulated data offers a scalable alternative, particularly with advances in automated demonstration generation, transferring policies to the real world is hampered by various simulation and real domain gaps. In this work, we propose a unified sim-and-real co-training framework for learning generalizable manipulation policies that primarily leverages simulation and only requires a few real-world demonstrations. Central to our approach is learning a domain-invariant, task-relevant feature space. Our key insight is that aligning the joint distributions of observations and their corresponding actions across domains provides a richer signal than aligning observations (marginals) alone. We achieve this by embedding an Optimal Transport (OT)-inspired loss within the co-training framework, and extend this to an Unbalanced OT framework to handle the imbalance between abundant simulation data and limited real-world examples. We validate our method on challenging manipulation tasks, showing it can leverage abundant simulation data to achieve up to a 30\% improvement in the real-world success rate and even generalize to scenarios seen only in simulation.

NeurIPS Conference 2025 Conference Paper

Generative Trajectory Stitching through Diffusion Composition

  • Yunhao Luo
  • Utkarsh Mishra
  • Yilun Du
  • Danfei Xu

Effective trajectory stitching for long-horizon planning is a significant challenge in robotic decision-making. While diffusion models have shown promise in planning, they are limited to solving tasks similar to those seen in their training data. We propose CompDiffuser, a novel generative approach that can solve new tasks by learning to compositionally stitch together shorter trajectory chunks from previously seen tasks. Our key insight is modeling the trajectory distribution by subdividing it into overlapping chunks and learning their conditional relationships through a single bidirectional diffusion model. This allows information to propagate between segments during generation, ensuring physically consistent connections. We conduct experiments on benchmark tasks of various difficulties, covering different environment sizes, agent state dimension, trajectory types, training data quality, and show that CompDiffuser significantly outperforms existing methods.

ICLR Conference 2025 Conference Paper

LoRA3D: Low-Rank Self-Calibration of 3D Geometric Foundation models

  • Ziqi Lu
  • Heng Yang
  • Danfei Xu
  • Boyi Li 0001
  • Boris Ivanovic
  • Marco Pavone 0001
  • Yue Wang 0041

Emerging 3D geometric foundation models, such as DUSt3R, offer a promising approach for in-the-wild 3D vision tasks. However, due to the high-dimensional nature of the problem space and scarcity of high-quality 3D data, these pre-trained models still struggle to generalize to many challenging circumstances, such as limited view overlap or low lighting. To address this, we propose LoRA3D, an efficient self-calibration pipeline to *specialize* the pre-trained models to target scenes using their own multi-view predictions. Taking sparse RGB images as input, we leverage robust optimization techniques to refine multi-view predictions and align them into a global coordinate frame. In particular, we incorporate prediction confidence into the geometric optimization process, automatically re-weighting the confidence to better reflect point estimation accuracy. We use the calibrated confidence to generate high-quality pseudo labels for the calibrating views and fine-tune the models using low-rank adaptation (LoRA) on the pseudo-labeled data. Our method does not require any external priors or manual labels. It completes the self-calibration process on a **single standard GPU within just 5 minutes**. Each low-rank adapter requires only **18MB** of storage. We evaluated our method on **more than 160 scenes** from the Replica, TUM and Waymo Open datasets, achieving up to **88\% performance improvement** on 3D reconstruction, multi-view pose estimation and novel-view rendering. For more details, please visit our project page at https://520xyxyzq.github.io/lora3d/.

ICRA Conference 2025 Conference Paper

RAIL: Reachability-Aided Imitation Learning for Safe Policy Execution

  • Wonsuhk Jung
  • Dennis Anthony
  • Utkarsh A. Mishra
  • Nadun Ranawaka Arachchige
  • Matthew Bronars
  • Danfei Xu
  • Shreyas Kousik

Imitation learning (IL) has shown great success in learning complex robot manipulation tasks. However, there remains a need for practical safety methods to justify widespread deployment. In particular, it is important to certify that a system obeys hard constraints on unsafe behavior in settings when it is unacceptable to design a tradeoff between performance and safety via tuning the policy (i. e. soft constraints). This leads to the question, how does enforcing hard constraints impact the performance (meaning safely completing tasks) of an IL policy? To answer this question, this paper builds a reach ability - based safety filter to enforce hard constraints on IL, which we call Reachability-Aided Imitation Learning (RAIL). Through evaluations with state-of-the-art IL policies in mobile robots and manipulation tasks, we make two key findings. First, the highest-performing policies are sometimes only so because they frequently violate constraints, and significantly lose performance under hard constraints. Second, surprisingly, hard constraints on the lower-performing policies can occasionally increase their ability to perform tasks safely. Finally, hardware evaluation confirms the method can operate in real time. More results can be found at our website: https://safe-robotics-lab-gt.github.io/rail/.

ICLR Conference 2025 Conference Paper

STORM: Spatio-TempOral Reconstruction Model For Large-Scale Outdoor Scenes

  • Jiawei Yang 0002
  • Jiahui Huang
  • Boris Ivanovic
  • Yuxiao Chen 0008
  • Yan Wang 0051
  • Boyi Li 0001
  • Yurong You
  • Apoorva Sharma

We present STORM, a spatio-temporal reconstruction model designed for reconstructing dynamic outdoor scenes from sparse observations. Existing dynamic reconstruction methods often rely on per-scene optimization, dense observations across space and time, and strong motion supervision, resulting in lengthy optimization times, limited generalization to novel views or scenes, and degenerated quality caused by noisy pseudo-labels for dynamics. To address these challenges, STORM leverages a data-driven Transformer architecture that directly infers dynamic 3D scene representations—parameterized by 3D Gaussians and their velocities—in a single forward pass. Our key design is to aggregate 3D Gaussians from all frames using self-supervised scene flows, transforming them to the target timestep to enable complete (i.e., "amodal") reconstructions from arbitrary viewpoints at any moment in time. As an emergent property, STORM automatically captures dynamic instances and generates high-quality masks using only reconstruction losses. Extensive experiments on public datasets show that STORM achieves precise dynamic scene reconstruction, surpassing state-of-the-art per-scene optimization methods (+4.3 to 6.6 PSNR) and existing feed-forward approaches (+2.1 to 4.7 PSNR) in dynamic regions. STORM reconstructs large-scale outdoor scenes in 200ms, supports real-time rendering, and outperforms competitors in scene flow estimation, improving 3D EPE by 0.422m and Acc5 by 28.02%. Beyond reconstruction, we showcase four additional applications of our model, illustrating the potential of self-supervised learning for broader dynamic scene understanding. For more details, please visit our project at https://jiawei-yang.github.io/STORM/.

ICLR Conference 2025 Conference Paper

What Matters in Learning from Large-Scale Datasets for Robot Manipulation

  • Vaibhav Saxena
  • Matthew Bronars
  • Nadun Ranawaka Arachchige
  • Kuancheng Wang
  • Woo-Chul Shin
  • Soroush Nasiriany
  • Ajay Mandlekar
  • Danfei Xu

Imitation learning from large multi-task demonstration datasets has emerged as a promising path for building generally-capable robots. As a result, 1000s of hours have been spent on building such large-scale datasets around the globe. Despite the continuous growth of such efforts, we still lack a systematic understanding of what data should be collected to improve the utility of a robotics dataset and facilitate downstream policy learning. In this work, we conduct a large-scale dataset composition study to answer this question. We develop a data generation framework to procedurally emulate common sources of diversity in existing datasets (such as sensor placements and object types and arrangements), and use it to generate large-scale robot datasets with controlled compositions, enabling a suite of dataset composition studies that would be prohibitively expensive in the real world. We focus on two practical settings: (1) what types of diversity should be emphasized when future researchers collect large-scale datasets for robotics, and (2) how should current practitioners retrieve relevant demonstrations from existing datasets to maximize downstream policy performance on tasks of interest. Our study yields several critical insights -- for example, we find that camera poses and spatial arrangements are crucial dimensions for both diversity in collection and alignment in retrieval. In real-world robot learning settings, we find that not only do our insights from simulation carry over, but our retrieval strategies on existing datasets such as DROID allow us to consistently outperform existing training strategies by up to 70\%.

TMLR Journal 2024 Journal Article

C3DM: Constrained-Context Conditional Diffusion Models for Imitation Learning

  • Vaibhav Saxena
  • Yotto koga
  • Danfei Xu

Behavior Cloning (BC) methods are effective at learning complex manipulation tasks. However, they are prone to spurious correlation - expressive models may focus on distractors that are irrelevant to action prediction - and are thus fragile in real-world deployment. Prior methods have addressed this challenge by exploring different model architectures and action representations. However, none were able to balance between sample efficiency and robustness against distractors for solving manipulation tasks with a complex action space. We present Constrained-Context Conditional Diffusion Model (C3DM), a diffusion model policy for solving 6-DoF robotic manipulation tasks with robustness to distractions that can learn deployable robot policies from as little as five demonstrations. A key component of C3DM is a \emph{fixation} step that helps the action denoiser to focus on task-relevant regions around a predicted \emph{fixation point} while ignoring distractors in the context. We empirically show that C3DM is robust to out-of-distribution distractors, and consistently achieves high success rates on a wide array of tasks, ranging from table-top manipulation to industrial kitting that require varying levels of precision and robustness to distractors.

ICLR Conference 2024 Conference Paper

EmerNeRF: Emergent Spatial-Temporal Scene Decomposition via Self-Supervision

  • Jiawei Yang 0002
  • Boris Ivanovic
  • Or Litany
  • Xinshuo Weng
  • Seung Wook Kim 0001
  • Boyi Li 0001
  • Tong Che
  • Danfei Xu

We present EmerNeRF, a simple yet powerful approach for learning spatial-temporal representations of dynamic driving scenes. Grounded in neural fields, EmerNeRF simultaneously captures scene geometry, appearance, motion, and semantics via self-bootstrapping. EmerNeRF hinges upon two core components: First, it stratifies scenes into static and dynamic fields. This decomposition emerges purely from self-supervision, enabling our model to learn from general, in-the-wild data sources. Second, EmerNeRF parameterizes an induced flow field from the dynamic field and uses this flow field to further aggregate multi-frame features, amplifying the rendering precision of dynamic objects. Coupling these three fields (static, dynamic, and flow) enables EmerNeRF to represent highly-dynamic scenes self-sufficiently, without relying on ground truth object annotations or pre-trained models for dynamic object segmentation or optical flow estimation. Our method achieves state-of-the-art performance in sensor simulation, significantly outperforming previous methods when reconstructing static (+2.93 PSNR) and dynamic (+3.70 PSNR) scenes. In addition, to bolster EmerNeRF's semantic generalization, we lift 2D visual foundation model features into 4D space-time and address a general positional bias in modern Transformers, significantly boosting 3D perception performance (e.g., 37.50% relative improvement in occupancy prediction accuracy on average). Finally, we construct a diverse and challenging 120-sequence dataset to benchmark neural fields under extreme and highly-dynamic settings. See the project page for code, data, and request pre-trained models: https://emernerf.github.io

NeurIPS Conference 2024 Conference Paper

Large Spatial Model: End-to-end Unposed Images to Semantic 3D

  • Zhiwen Fan
  • Jian Zhang
  • Wenyan Cong
  • Peihao Wang
  • Renjie Li
  • Kairun Wen
  • shijie zhou
  • Achuta Kadambi

Reconstructing and understanding 3D structures from a limited number of images is a classical problem in computer vision. Traditional approaches typically decompose this task into multiple subtasks, involving several stages of complex mappings between different data representations. For example, dense reconstruction using Structure-from-Motion (SfM) requires transforming images into key points, optimizing camera parameters, and estimating structures. Following this, accurate sparse reconstructions are necessary for further dense modeling, which is then input into task-specific neural networks. This multi-stage paradigm leads to significant processing times and engineering complexity. In this work, we introduce the Large Spatial Model (LSM), which directly processes unposed RGB images into semantic radiance fields. LSM simultaneously estimates geometry, appearance, and semantics in a single feed-forward pass and can synthesize versatile label maps by interacting through language at novel views. Built on a general Transformer-based framework, LSM predicts global geometry via pixel-aligned point maps. To improve spatial attribute regression, we adopt local context aggregation with multi-scale fusion, enhancing the accuracy of fine local details. To address the scarcity of labeled 3D semantic data and enable natural language-driven scene manipulation, we incorporate a pre-trained 2D language-based segmentation model into a 3D-consistent semantic feature field. An efficient decoder parameterizes a set of semantic anisotropic Gaussians, allowing supervised end-to-end learning. Comprehensive experiments on various tasks demonstrate that LSM unifies multiple 3D vision tasks directly from unposed images, achieving real-time semantic 3D reconstruction for the first time.

ICRA Conference 2024 Conference Paper

Open X-Embodiment: Robotic Learning Datasets and RT-X Models: Open X-Embodiment Collaboration

  • Abby O'Neill
  • Abdul Rehman
  • Abhiram Maddukuri
  • Abhishek Gupta 0004
  • Abhishek Padalkar
  • Abraham Lee
  • Acorn Pooley
  • Agrim Gupta

Large, high-capacity models trained on diverse datasets have shown remarkable successes on efficiently tackling downstream applications. In domains from NLP to Computer Vision, this has led to a consolidation of pretrained models, with general pretrained backbones serving as a starting point for many applications. Can such a consolidation happen in robotics? Conventionally, robotic learning methods train a separate model for every application, every robot, and even every environment. Can we instead train "generalist" X-robot policy that can be adapted efficiently to new robots, tasks, and environments? In this paper, we provide datasets in standardized data formats and models to make it possible to explore this possibility in the context of robotic manipulation, alongside experimental results that provide an example of effective X-robot policies. We assemble a dataset from 22 different robots collected through a collaboration between 21 institutions, demonstrating 527 skills (160266 tasks). We show that a high-capacity model trained on this data, which we call RT-X, exhibits positive transfer and improves the capabilities of multiple robots by leveraging experience from other platforms. The project website is robotics-transformer-x. github.io.

ICRA Conference 2023 Conference Paper

BITS: Bi-level Imitation for Traffic Simulation

  • Danfei Xu
  • Yuxiao Chen 0008
  • Boris Ivanovic
  • Marco Pavone 0001

Simulation is the key to scaling up validation and verification for robotic systems such as autonomous vehicles. Despite advances in high-fidelity physics and sensor simulation, a critical gap remains in simulating realistic behaviors of road users. This is because devising first principle models for human-like behaviors is generally infeasible. In this work, we take a data-driven approach to generate traffic behaviors from real-world driving logs. The method achieves high sample efficiency and behavior diversity by exploiting the bi-level hierarchy of high-level intent inference and low-level driving behavior imitation. The method also incorporates a planning module to obtain stable long-horizon behaviors. We empirically validate our method with scenarios from two large-scale driving datasets and show our method achieves balanced traffic simulation performance in realism, diversity, and long-horizon stability. We also explore ways to evaluate behavior realism and introduce a suite of evaluation metrics for traffic simulation. Finally, as part of our core contributions, we develop and open source a software tool that unifies data formats across different driving datasets and converts scenes from existing datasets into interactive simulation environments. For video results and code release, see https://bit.ly/3L9uzj3.

ICRA Conference 2023 Conference Paper

Guided Conditional Diffusion for Controllable Traffic Simulation

  • Ziyuan Zhong
  • Davis Rempe
  • Danfei Xu
  • Yuxiao Chen 0008
  • Sushant Veer
  • Tong Che
  • Baishakhi Ray
  • Marco Pavone 0001

Controllable and realistic traffic simulation is critical for developing and verifying autonomous vehicles. Typical heuristic-based traffic models offer flexible control to make vehicles follow specific trajectories and traffic rules. On the other hand, data-driven approaches generate realistic and human-like behaviors, improving transfer from simulated to real-world traffic. However, to the best of our knowledge, no traffic model offers both controllability and realism. In this work, we develop a conditional diffusion model for controllable traffic generation (CTG) that allows users to control desired properties of trajectories at test time (e. g. , reach a goal or follow a speed limit) while maintaining realism and physical feasibility through enforced dynamics. The key technical idea is to leverage recent advances from diffusion modeling and differentiable logic to guide generated trajectories to meet rules defined using signal temporal logic (STL). We further extend guidance to multi-agent settings and enable interaction-based rules like collision avoidance. CTG is extensively evaluated on the nuScenes dataset for diverse and composite rules, demonstrating improvement over strong baselines in terms of the controllability-realism tradeoff. Demo videos can be found at https://aiasd.github.io/ctg.github.io

ICRA Conference 2023 Conference Paper

ProgPrompt: Generating Situated Robot Task Plans using Large Language Models

  • Ishika Singh
  • Valts Blukis
  • Arsalan Mousavian
  • Ankit Goyal 0001
  • Danfei Xu
  • Jonathan Tremblay
  • Dieter Fox
  • Jesse Thomason

Task planning can require defining myriad domain knowledge about the world in which a robot needs to act. To ameliorate that effort, large language models (LLMs) can be used to score potential next actions during task planning, and even generate action sequences directly, given an instruction in natural language with no additional domain information. However, such methods either require enumerating all possible next steps for scoring, or generate free-form text that may contain actions not possible on a given robot in its current context. We present a programmatic LLM prompt structure that enables plan generation functional across situated environments, robot capabilities, and tasks. Our key insight is to prompt the LLM with program-like specifications of the available actions and objects in an environment, as well as with example programs that can be executed. We make concrete recommendations about prompt structure and generation constraints through ablation experiments, demonstrate state of the art success rates in VirtualHome household tasks, and deploy our method on a physical robot arm for tabletop tasks. Website at progprompt.github.io

ICRA Conference 2021 Conference Paper

Deep Affordance Foresight: Planning Through What Can Be Done in the Future

  • Danfei Xu
  • Ajay Mandlekar
  • Roberto Martín-Martín
  • Yuke Zhu
  • Silvio Savarese
  • Li Fei-Fei 0001

Planning in realistic environments requires searching in large planning spaces. Affordances are a powerful concept to simplify this search, because they model what actions can be successful in a given situation. However, the classical notion of affordance is not suitable for long horizon planning because it only informs the robot about the immediate outcome of actions instead of what actions are best for achieving a long-term goal. In this paper, we introduce a new affordance representation that enables the robot to reason about the longterm effects of actions through modeling what actions are afforded in the future. Based on the new representation, we develop a learning-to-plan method, Deep Affordance Foresight (DAF), that learns partial environment models of affordances of parameterized motor skills through trial-and-error. We evaluate DAF on two challenging manipulation domains and show that it can effectively learn to carry out multi-step tasks, share learned affordance representations among different tasks, and learn to plan with high-dimensional image inputs.

IROS Conference 2021 Conference Paper

Generalization Through Hand-Eye Coordination: An Action Space for Learning Spatially-Invariant Visuomotor Control

  • Chen Wang 0053
  • Rui Wang
  • Ajay Mandlekar
  • Li Fei-Fei 0001
  • Silvio Savarese
  • Danfei Xu

Imitation Learning (IL) is an effective framework to learn visuomotor skills from offline demonstration data. However, IL methods often fail to generalize to new scene configurations not covered by training data. On the other hand, humans can manipulate objects in varying conditions. Key to such capability is hand-eye coordination, a cognitive ability that enables humans to adaptively direct their movements at task-relevant objects and be invariant to the objects’ absolute spatial location. In this work, we present a learnable action space, Hand-eye Action Networks (HAN) that learns coordinated hand-eye movements from human teleoperated demonstrations. Through a set of challenging multi-stage manipulation tasks, we show that a visuomotor policy equipped with HAN is able to inherit the key spatial invariance property of handeye coordination and achieve generalization to new scene configurations. Additional materials available at https://sites.google.com/stanford.edu/han

ICRA Conference 2020 Conference Paper

6-PACK: Category-level 6D Pose Tracker with Anchor-Based Keypoints

  • Chen Wang 0053
  • Roberto Martín-Martín
  • Danfei Xu
  • Jun Lv
  • Cewu Lu
  • Li Fei-Fei 0001
  • Silvio Savarese
  • Yuke Zhu

We present 6-PACK, a deep learning approach to category-level 6D object pose tracking on RGB-D data. Our method tracks in real time novel object instances of known object categories such as bowls, laptops, and mugs. 6-PACK learns to compactly represent an object by a handful of 3D keypoints, based on which the interframe motion of an object instance can be estimated through keypoint matching. These keypoints are learned end-to-end without manual supervision in order to be most effective for tracking. Our experiments show that our method substantially outperforms existing methods on the NOCS category-level 6D pose estimation benchmark and supports a physical robot to perform simple vision-based closed-loop manipulation tasks. Our code and video are available at https://sites.google.com/view/6packtracking.

IROS Conference 2019 Conference Paper

Continuous Relaxation of Symbolic Planner for One-Shot Imitation Learning

  • De-An Huang
  • Danfei Xu
  • Yuke Zhu
  • Animesh Garg
  • Silvio Savarese
  • Li Fei-Fei 0001
  • Juan Carlos Niebles

We address one-shot imitation learning, where the goal is to execute a previously unseen task based on a single demonstration. While there has been exciting progress in this direction, most of the approaches still require a few hundred tasks for meta-training, which limits the scalability of the approaches. Our main contribution is to formulate one-shot imitation learning as a symbolic planning problem along with the symbol grounding problem. This formulation disentangles the policy execution from the inter-task generalization and leads to better data efficiency. The key technical challenge is that the symbol grounding is prone to error with limited training data and leads to subsequent symbolic planning failures. We address this challenge by proposing a continuous relaxation of the discrete symbolic planner that directly plans on the probabilistic outputs of the symbol grounding model. Our continuous relaxation of the planner can still leverage the information contained in the probabilistic symbol grounding and significantly improve over the baseline planner for the one-shot imitation learning tasks without using large training data.

NeurIPS Conference 2019 Conference Paper

Regression Planning Networks

  • Danfei Xu
  • Roberto Martín-Martín
  • De-An Huang
  • Yuke Zhu
  • Silvio Savarese
  • Li Fei-Fei

Recent learning-to-plan methods have shown promising results on planning directly from observation space. Yet, their ability to plan for long-horizon tasks is limited by the accuracy of the prediction model. On the other hand, classical symbolic planners show remarkable capabilities in solving long-horizon tasks, but they require predefined symbolic rules and symbolic states, restricting their real-world applicability. In this work, we combine the benefits of these two paradigms and propose a learning-to-plan method that can directly generate a long-term symbolic plan conditioned on high-dimensional observations. We borrow the idea of regression (backward) planning from classical planning literature and introduce Regression Planning Networks (RPN), a neural network architecture that plans backward starting at a task goal and generates a sequence of intermediate goals that reaches the current observation. We show that our model not only inherits many favorable traits from symbolic planning --including the ability to solve previously unseen tasks-- but also can learn from visual inputs in an end-to-end manner. We evaluate the capabilities of RPN in a grid world environment and a simulated 3D kitchen environment featuring complex visual scenes and long task horizon, and show that it achieves near-optimal performance in completely new task instances.

ICRA Conference 2018 Conference Paper

Neural Task Programming: Learning to Generalize Across Hierarchical Tasks

  • Danfei Xu
  • Suraj Nair 0003
  • Yuke Zhu
  • Julian Gao
  • Animesh Garg
  • Li Fei-Fei 0001
  • Silvio Savarese

In this work, we propose a novel robot learning framework called Neural Task Programming (NTP), which bridges the idea of few-shot learning from demonstration and neural program induction. NTP takes as input a task specification (e. g. , video demonstration of a task) and recursively decomposes it into finer sub-task specifications. These specifications are fed to a hierarchical neural program, where bottom-level programs are callable subroutines that interact with the environment. We validate our method in three robot manipulation tasks. NTP achieves strong generalization across sequential tasks that exhibit hierarchal and compositional structures. The experimental results show that NTP learns to generalize well towards unseen tasks with increasing lengths, variable topologies, and changing objectives. stanfordvl.github.io/ntp/.

ICRA Conference 2016 Conference Paper

Multi-sensor surface analysis for robotic ironing

  • Yinxiao Li
  • Xiuhan Hu
  • Danfei Xu
  • Yonghao Yue
  • Eitan Grinspun
  • Peter K. Allen

Robotic manipulation of deformable objects remains a challenging task. One such task is to iron a piece of cloth autonomously. Given a roughly flattened cloth, the goal is to have an ironing plan that can iteratively apply a regular iron to remove all the major wrinkles by a robot. We present a novel solution to analyze the cloth surface by fusing two surface scan techniques: a curvature scan and a discontinuity scan. The curvature scan can estimate the height deviation of the cloth surface, while the discontinuity scan can effectively detect sharp surface features, such as wrinkles. We use this information to detect the regions that need to be pulled and extended before ironing, and the other regions where we want to detect wrinkles and apply ironing to remove the wrinkles. We demonstrate that our hybrid scan technique is able to capture and classify wrinkles over the surface robustly. Given detected wrinkles, we enable a robot to iron them using shape features. Experimental results show that using our wrinkle analysis algorithm, our robot is able to iron the cloth surface and effectively remove the wrinkles.

IROS Conference 2015 Conference Paper

Folding deformable objects using predictive simulation and trajectory optimization

  • Yinxiao Li
  • Yonghao Yue
  • Danfei Xu
  • Eitan Grinspun
  • Peter K. Allen

Robotic manipulation of deformable objects remains a challenging task. One such task is folding a garment autonomously. Given start and end folding positions, what is an optimal trajectory to move the robotic arm to fold a garment? Certain trajectories will cause the garment to move, creating wrinkles, and gaps, other trajectories will fail altogether. We present a novel solution to find an optimal trajectory that avoids such problematic scenarios. The trajectory is optimized by minimizing a quadratic objective function in an off-line simulator, which includes material properties of the garment and frictional force on the table. The function measures the dissimilarity between a user folded shape and the folded garment in simulation, which is then used as an error measurement to create an optimal trajectory. We demonstrate that our two-arm robot can follow the optimized trajectories, achieving accurate and efficient manipulations of deformable objects.

ICRA Conference 2015 Conference Paper

Regrasping and unfolding of garments using predictive thin shell modeling

  • Yinxiao Li
  • Danfei Xu
  • Yonghao Yue
  • Yan Wang 0059
  • Shih-Fu Chang
  • Eitan Grinspun
  • Peter K. Allen

Deformable objects such as garments are highly unstructured, making them difficult to recognize and manipulate. In this paper, we propose a novel method to teach a two-arm robot to efficiently track the states of a garment from an unknown state to a known state by iterative regrasping. The problem is formulated as a constrained weighted evaluation metric for evaluating the two desired grasping points during regrasping, which can also be used for a convergence criterion The result is then adopted as an estimation to initialize a regrasping, which is then considered as a new state for evaluation. The process stops when the predicted thin shell conclusively agrees with reconstruction. We show experimental results for regrasping a number of different garments including sweater, knitwear, pants, and leggings, etc.

IROS Conference 2014 Conference Paper

Topometric localization on a road network

  • Danfei Xu
  • Hernán Badino
  • Daniel F. Huber

Current GPS-based devices have difficulty localizing in cases where the GPS signal is unavailable or insufficiently accurate. This paper presents an algorithm for localizing a vehicle on an arbitrary road network using vision, road curvature estimates, or a combination of both. The method uses an extension of topometric localization, which is a hybrid between topological and metric localization. The extension enables localization on a network of roads rather than just a single, non-branching route. The algorithm, which does not rely on GPS, is able to localize reliably in situations where GPS-based devices fail, including “urban canyons” in downtown areas and along ambiguous routes with parallel roads. We demonstrate the algorithm experimentally on several road networks in urban, suburban, and highway scenarios. We also evaluate the road curvature descriptor and show that it is effective when imagery is sparsely available.

ICRA Conference 2013 Conference Paper

Tactile identification of objects using Bayesian exploration

  • Danfei Xu
  • Gerald E. Loeb
  • Jeremy A. Fishel

In order to endow robots with human-like tactile sensory abilities, they must be provided with tactile sensors and intelligent algorithms to select and control useful exploratory movements and interpret data from all available sensors. Current robotic systems do not possess such sensors or algorithms. In this study we integrate multimodal tactile sensing (force, vibration and temperature) from the BioTac® with a Shadow Dexterous Hand and program the robot to make exploratory movements similar to those humans make when identifying objects by their compliance, texture, and thermal properties. Signal processing strategies were developed to provide measures of these perceptual properties. When identifying an object, exploratory movements are intelligently selected using a process we have previously developed called Bayesian exploration [1], whereby exploratory movements that provide the most disambiguation between likely candidates of objects are automatically selected. The exploration algorithm was augmented with reinforcement learning whereby its internal representations of objects evolved according to its cumulative experience with them. This allowed the algorithm to compensate for drift in the performance of the anthropomorphic robot hand and the ambient conditions of testing, improving accuracy while reducing the number of exploratory movements required to identify an object. The robot correctly identified 10 different objects on 99 out of 100 presentations.