Arrow Research search

Author name cluster

Aravind Rajeswaran

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

31 papers
2 author rows

Possible papers

31

ICML Conference 2025 Conference Paper

From Thousands to Billions: 3D Visual Language Grounding via Render-Supervised Distillation from 2D VLMs

  • Ang Cao
  • Sergio Arnaud
  • Oleksandr Maksymets
  • Jianing Yang
  • Ayush Jain
  • Ada Martin
  • Vincent-Pierre Berges
  • Paul McVay

3D vision-language grounding faces a fundamental data bottleneck: while 2D models train on billions of images, 3D models have access to only thousands of labeled scenes–a six-order-of-magnitude gap that severely limits performance. We introduce LIFT-GS, a practical distillation technique that overcomes this limitation by using differentiable rendering to bridge 3D and 2D supervision. LIFT-GS predicts 3D Gaussian representations from point clouds and uses them to render predicted language-conditioned 3D masks into 2D views, enabling supervision from 2D foundation models (SAM, CLIP, LLaMA) without requiring any 3D annotations. This render-supervised formulation enables end-to-end training of complete encoder-decoder architectures and is inherently model-agnostic. LIFT-GS achieves state-of-the-art results with 25. 7% mAP on open-vocabulary instance segmentation (vs. 20. 2% prior SOTA) and consistent 10-30% improvements on referential grounding tasks. Remarkably, pretraining effectively multiplies fine-tuning datasets by 2$\times$, demonstrating strong scaling properties that suggest 3D VLG currently operates in a severely data-scarce regime. Project page: https: //liftgs. github. io.

ICML Conference 2025 Conference Paper

LOCATE 3D: Real-World Object Localization via Self-Supervised Learning in 3D

  • Paul McVay
  • Sergio Arnaud
  • Ada Martin
  • Arjun Majumdar
  • Krishna Murthy Jatavallabhula
  • Phillip Thomas
  • Ruslan Partsey
  • Daniel Dugas

We present LOCATE 3D, a model for localizing objects in 3D scenes from referring expressions like "the small coffee table between the sofa and the lamp. " LOCATE 3D sets a new state-of-the-art on standard referential grounding benchmarks and showcases robust generalization capabilities. Notably, LOCATE 3D operates directly on sensor observation streams (posed RGB-D frames), enabling real-world deployment on robots and AR devices. Key to our approach is 3D-JEPA, a novel self-supervised learning (SSL) algorithm applicable to sensor point clouds. It takes as input a 3D pointcloud featurized using 2D foundation models (CLIP, DINO). Subsequently, masked prediction in latent space is employed as a pretext task to aid the self-supervised learning of contextualized pointcloud features. Once trained, the 3D-JEPA encoder is finetuned alongside a language-conditioned decoder to jointly predict 3D masks and bounding boxes. Additionally, we introduce LOCATE 3D DATASET, a new dataset for 3D referential grounding, spanning multiple capture setups with over 130K annotations. This enables a systematic study of generalization capabilities as well as a stronger model. Code, models and dataset can be found at the project website: locate3d. atmeta. com

IROS Conference 2024 Conference Paper

From LLMs to Actions: Latent Codes as Bridges in Hierarchical Robot Control

  • Yide Shentu
  • Philipp Wu
  • Aravind Rajeswaran
  • Pieter Abbeel

Hierarchical control for robotics has long been plagued by the need to have a well defined interface layer to communicate between high-level task planners and low-level policies. With the advent of LLMs, language has been emerging as a prospective interface layer. However, this has several limitations. Not all tasks can be decomposed into steps that are easily expressible in natural language (e. g. performing a dance routine). Further, it makes end-to-end finetuning on embodied data challenging due to domain shift and catastrophic forgetting. We introduce our method – Latent Codes as Bridges (LCB) – as an alternate architecture to overcome these limitations. LCB uses a learnable latent code to act as a bridge between LLMs and low-level policies. This enables LLMs to flexibly communicate goals in the task plan without being entirely constrained by language limitations. Additionally, it enables end-to-end finetuning without destroying the embedding space of word tokens learned during pre-training. Through experiments on Language Table and Calvin, two common language based benchmarks for embodied agents, we find that LCB outperforms baselines (including those w/ GPT-4V) that leverage pure language as the interface layer on tasks that require reasoning and multi-step behaviors.

ICRA Conference 2024 Conference Paper

MoDem-V2: Visuo-Motor World Models for Real-World Robot Manipulation

  • Patrick Lancaster
  • Nicklas Hansen 0001
  • Aravind Rajeswaran
  • Vikash Kumar

Robotic systems that aspire to operate in uninstrumented real-world environments must perceive the world directly via onboard sensing. Vision-based learning systems aim to eliminate the need for environment instrumentation by building an implicit understanding of the world based on raw pixels, but navigating the contact-rich high-dimensional search space from solely sparse visual reward signals significantly exacerbates the challenge of exploration. The applicability of such systems is thus typically restricted to simulated or heavily engineered environments since agent exploration in the real-world without the guidance of explicit state estimation and dense rewards can lead to unsafe behavior and safety faults that are catastrophic. In this study, we isolate the root causes behind these limitations to develop a system, called MoDem-V2, capable of learning contact-rich manipulation directly in the uninstrumented real world. Building on the latest algorithmic advancements in model-based reinforcement learning (MBRL), demo-bootstrapping, and effective exploration, MoDem-V2 can acquire contact-rich dexterous manipulation skills directly in the real world. We identify key ingredients for leveraging demonstrations in model learning while respecting real-world safety considerations – exploration centering, agency handover, and actor-critic ensembles. We empirically demonstrate the contribution of these ingredients in four complex visuo-motor manipulation problems in both simulation and the real world. To the best of our knowledge, our work presents the first successful system for demonstration-augmented visual MBRL trained directly in the real world. Visit sites. google.com/view/modem-v2 for videos and more details.

RLC Conference 2024 Conference Paper

Policy Architectures for Compositional Generalization in Control

  • Allan Zhou
  • Vikash Kumar
  • Chelsea Finn
  • Aravind Rajeswaran

Many tasks in control, robotics, and planning can be specified using desired goal configurations for various entities in the environment. Learning goal-conditioned policies is a natural paradigm to solve such tasks. However, current approaches struggle to learn and generalize as task complexity increases, such as variations in number of environment entities or compositions of goals. In this work, we introduce a framework for modeling entity-based compositional structure in tasks, and create suitable policy designs that can leverage this structure. Our policies, which utilize architectures like Deep Sets and Self Attention, are flexible and can be trained end-to-end without requiring any action primitives. When trained using standard reinforcement and imitation learning methods on a suite of simulated robot manipulation tasks, we find that these architectures achieve significantly higher success rates with less data. We also find these architectures enable broader and compositional generalization, producing policies that extrapolate to different numbers of entities than seen in training, and stitch together (i. e. compose) learned skills in novel ways.

RLJ Journal 2024 Journal Article

Policy Architectures for Compositional Generalization in Control

  • Allan Zhou
  • Vikash Kumar
  • Chelsea Finn
  • Aravind Rajeswaran

Many tasks in control, robotics, and planning can be specified using desired goal configurations for various entities in the environment. Learning goal-conditioned policies is a natural paradigm to solve such tasks. However, current approaches struggle to learn and generalize as task complexity increases, such as variations in number of environment entities or compositions of goals. In this work, we introduce a framework for modeling entity-based compositional structure in tasks, and create suitable policy designs that can leverage this structure. Our policies, which utilize architectures like Deep Sets and Self Attention, are flexible and can be trained end-to-end without requiring any action primitives. When trained using standard reinforcement and imitation learning methods on a suite of simulated robot manipulation tasks, we find that these architectures achieve significantly higher success rates with less data. We also find these architectures enable broader and compositional generalization, producing policies that extrapolate to different numbers of entities than seen in training, and stitch together (i.e. compose) learned skills in novel ways.

RLC Conference 2024 Conference Paper

Semi-Supervised One Shot Imitation Learning

  • Philipp Wu
  • Kourosh Hakhamaneshi
  • Yuqing Du
  • Igor Mordatch
  • Aravind Rajeswaran
  • Pieter Abbeel

One-shot Imitation Learning (OSIL) aims to imbue AI agents with the ability to learn a new task from a single demonstration. To supervise the learning, OSIL requires a prohibitively large number of paired expert demonstrations: trajectories corresponding to different variations of the same semantic task. To overcome this limitation, we introduce the semi-supervised OSIL problem setting, where the learning agent is presented with a large dataset of tasks with only one demonstration each (unpaired dataset), along with a small dataset of tasks with multiple demonstrations (paired dataset). This presents a more realistic and practical embodiment of few-shot learning and requires the agent to effectively leverage weak supervision. Subsequently, we develop an algorithm applicable to this semi-supervised OSIL setting. Our approach first learns an embedding space where different tasks cluster uniquely. We utilize this embedding space and the clustering it supports to self-generate pairings between trajectories in the large unpaired dataset. Through empirical results, we demonstrate that OSIL models trained on such self-generated pairings (labels) are competitive with OSIL models trained with ground-truth labels, presenting a major advancement in the label-efficiency of OSIL.

RLJ Journal 2024 Journal Article

Semi-Supervised One Shot Imitation Learning

  • Philipp Wu
  • Kourosh Hakhamaneshi
  • Yuqing Du
  • Igor Mordatch
  • Aravind Rajeswaran
  • Pieter Abbeel

One-shot Imitation Learning (OSIL) aims to imbue AI agents with the ability to learn a new task from a single demonstration. To supervise the learning, OSIL requires a prohibitively large number of paired expert demonstrations: trajectories corresponding to different variations of the same semantic task. To overcome this limitation, we introduce the semi-supervised OSIL problem setting, where the learning agent is presented with a large dataset of tasks with only one demonstration each (unpaired dataset), along with a small dataset of tasks with multiple demonstrations (paired dataset). This presents a more realistic and practical embodiment of few-shot learning and requires the agent to effectively leverage weak supervision. Subsequently, we develop an algorithm applicable to this semi-supervised OSIL setting. Our approach first learns an embedding space where different tasks cluster uniquely. We utilize this embedding space and the clustering it supports to self-generate pairings between trajectories in the large unpaired dataset. Through empirical results, we demonstrate that OSIL models trained on such self-generated pairings (labels) are competitive with OSIL models trained with ground-truth labels, presenting a major advancement in the label-efficiency of OSIL.

ICRA Conference 2024 Conference Paper

What Do We Learn from a Large-Scale Study of Pre-Trained Visual Representations in Sim and Real Environments?

  • Sneha Silwal
  • Karmesh Yadav
  • Tingfan Wu
  • Jay Vakil
  • Arjun Majumdar
  • Sergio Arnaud
  • Claire Chen
  • Vincent-Pierre Berges

We present a large empirical investigation on the use of pre-trained visual representations (PVRs) for training downstream policies that execute real-world tasks. Our study involves five different PVRs, each trained for five distinct manipulation or indoor navigation tasks. We performed this evaluation using three different robots and two different policy learning paradigms. From this e ort, we can arrive at three insights: 1) the performance trends of PVRs in the simulation are generally indicative of their trends in the real world, 2) the use of PVRs enables a first-of-its-kind result with indoor ImageNav (zero-shot transfer to a held-out scene in the real world), and 3) the benefits from variations in PVRs, primarily data-augmentation and fine-tuning, also transfer to the real-world performance. See project website 1 for additional details and visuals.

ICML Conference 2023 Conference Paper

Masked Trajectory Models for Prediction, Representation, and Control

  • Philipp Wu
  • Arjun Majumdar
  • Kevin Stone
  • Yixin Lin
  • Igor Mordatch
  • Pieter Abbeel
  • Aravind Rajeswaran

We introduce Masked Trajectory Models (MTM) as a generic abstraction for sequential decision making. MTM takes a trajectory, such as a state-action sequence, and aims to reconstruct the trajectory conditioned on random subsets of the same trajectory. By training with a highly randomized masking pattern, MTM learns versatile networks that can take on different roles or capabilities, by simply choosing appropriate masks at inference time. For example, the same MTM network can be used as a forward dynamics model, inverse dynamics model, or even an offline RL agent. Through extensive experiments in several continuous control tasks, we show that the same MTM network – i. e. same weights – can match or outperform specialized networks trained for the aforementioned capabilities. Additionally, we find that state representations learned by MTM can significantly accelerate the learning speed of traditional RL algorithms. Finally, in offline RL benchmarks, we find that MTM is competitive with specialized offline RL algorithms, despite MTM being a generic self-supervised learning method without any explicit RL components. Code is available at https: //github. com/facebookresearch/mtm.

ICLR Conference 2023 Conference Paper

MoDem: Accelerating Visual Model-Based Reinforcement Learning with Demonstrations

  • Nicklas Hansen 0001
  • Yixin Lin
  • Hao Su 0001
  • Xiaolong Wang 0004
  • Vikash Kumar
  • Aravind Rajeswaran

Poor sample efficiency continues to be the primary challenge for deployment of deep Reinforcement Learning (RL) algorithms for real-world applications, and in particular for visuo-motor control. Model-based RL has the potential to be highly sample efficient by concurrently learning a world model and using synthetic rollouts for planning and policy improvement. However, in practice, sample-efficient learning with model-based RL is bottlenecked by the exploration challenge. In this work, we find that leveraging just a handful of demonstrations can dramatically improve the sample-efficiency of model-based RL. Simply appending demonstrations to the interaction dataset, however, does not suffice. We identify key ingredients for leveraging demonstrations in model learning -- policy pretraining, targeted exploration, and oversampling of demonstration data -- which forms the three phases of our model-based RL framework. We empirically study three complex visuo-motor control domains and find that our method is 160%-250% more successful in completing sparse reward tasks compared to prior approaches in the low data regime (100k interaction steps, 5 demonstrations). Code and videos are available at https://nicklashansen.github.io/modemrl.

ICML Conference 2023 Conference Paper

On Pre-Training for Visuo-Motor Control: Revisiting a Learning-from-Scratch Baseline

  • Nicklas Hansen 0001
  • Zhecheng Yuan
  • Yanjie Ze
  • Tongzhou Mu
  • Aravind Rajeswaran
  • Hao Su 0001
  • Huazhe Xu
  • Xiaolong Wang 0004

In this paper, we examine the effectiveness of pre-training for visuo-motor control tasks. We revisit a simple Learning-from-Scratch (LfS) baseline that incorporates data augmentation and a shallow ConvNet, and find that this baseline is surprisingly competitive with recent approaches (PVR, MVP, R3M) that leverage frozen visual representations trained on large-scale vision datasets – across a variety of algorithms, task domains, and metrics in simulation and on a real robot. Our results demonstrate that these methods are hindered by a significant domain gap between the pre-training datasets and current benchmarks for visuo-motor control, which is alleviated by finetuning. Based on our findings, we provide recommendations for future research in pre-training for control and hope that our simple yet strong baseline will aid in accurately benchmarking progress in this area. Code: https: //github. com/gemcollector/learning-from-scratch.

ICRA Conference 2023 Conference Paper

Real World Offline Reinforcement Learning with Realistic Data Source

  • Gaoyue Zhou
  • Liyiming Ke
  • Siddhartha S. Srinivasa
  • Abhinav Gupta 0001
  • Aravind Rajeswaran
  • Vikash Kumar

Offline reinforcement learning (ORL) holds great promise for robot learning due to its ability to learn from arbitrary pre-generated experience. However, current ORL benchmarks are almost entirely in simulation and utilize contrived datasets like replay buffers of online RL agents or sub-optimal trajectories, and thus hold limited relevance for real-world robotics. In this work (Real-ORL), we posit that data collected from safe operations of closely related tasks are more practical data sources for real-world robot learning. Under these settings, we perform an extensive (6500+ trajectories collected over 800+ robot hours and 270+ human labor hour) empirical study evaluating generalization and transfer capabilities of representative ORL methods on four real-world tabletop manipulation tasks. Our study finds that ORL and imitation learning prefer different action spaces, and that ORL algorithms can generalize from leveraging offline heterogeneous data sources and outperform imitation learning. We release our dataset and implementations at URL: https://sites.google.com/view/real-orl.

NeurIPS Conference 2023 Conference Paper

RoboHive: A Unified Framework for Robot Learning

  • Vikash Kumar
  • Rutav Shah
  • Gaoyue Zhou
  • Vincent Moens
  • Vittorio Caggiano
  • Abhishek Gupta
  • Aravind Rajeswaran

We present RoboHive, a comprehensive software platform and ecosystem for research in the field of Robot Learning and Embodied Artificial Intelligence. Our platform encompasses a diverse range of pre-existing and novel environments, including dexterous manipulation with the Shadow Hand, whole-arm manipulation tasks with Franka and Fetch robots, quadruped locomotion, among others. Included environments are organized within and cover multiple domains such as hand manipulation, locomotion, multi-task, multi-agent, muscles, etc. In comparison to prior works, RoboHive offers a streamlined and unified task interface taking dependency on only a minimal set of well-maintained packages, features tasks with high physics fidelity and rich visual diversity, and supports common hardware drivers for real-world deployment. The unified interface of RoboHive offers a convenient and accessible abstraction for algorithmic research in imitation, reinforcement, multi-task, and hierarchical learning. Furthermore, RoboHive includes expert demonstrations and baseline results for most environments, providing a standard for benchmarking and comparisons. Details: https: //sites. google. com/view/robohive

ICRA Conference 2023 Conference Paper

Train Offline, Test Online: A Real Robot Learning Benchmark

  • Gaoyue Zhou
  • Victoria Dean
  • Mohan Kumar Srirama
  • Aravind Rajeswaran
  • Jyothish Pari
  • Kyle Beltran Hatch
  • Aryan Jain
  • Tianhe Yu

Three challenges limit the progress of robot learning research: robots are expensive (few labs can participate), everyone uses different robots (findings do not generalize across labs), and we lack internet-scale robotics data. We take on these challenges via a new benchmark: Train Offline, Test Online (TOTO). TOTO provides remote users with access to shared robots for evaluating methods on common tasks and an open-source dataset of these tasks for offline training. Its manipulation task suite requires challenging generalization to unseen objects, positions, and lighting. We present initial results on TOTO comparing five pretrained visual representations and four offline policy learning baselines, remotely contributed by five institutions. The real promise of TOTO, however, lies in the future: we release the benchmark for additional submissions from any user, enabling easy, direct comparison to several methods without the need to obtain hardware or collect data.

NeurIPS Conference 2023 Conference Paper

Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence?

  • Arjun Majumdar
  • Karmesh Yadav
  • Sergio Arnaud
  • Jason Ma
  • Claire Chen
  • Sneha Silwal
  • Aryan Jain
  • Vincent-Pierre Berges

We present the largest and most comprehensive empirical study of pre-trained visual representations (PVRs) or visual ‘foundation models’ for Embodied AI. First, we curate CortexBench, consisting of 17 different tasks spanning locomotion, navigation, dexterous, and mobile manipulation. Next, we systematically evaluate existing PVRs and find that none are universally dominant. To study the effect of pre-training data size and diversity, we combine over 4, 000 hours of egocentric videos from 7 different sources (over 4. 3M images) and ImageNet to train different-sized vision transformers using Masked Auto-Encoding (MAE) on slices of this data. Contrary to inferences from prior work, we find that scaling dataset size and diversity does not improve performance universally (but does so on average). Our largest model, named VC-1, outperforms all prior PVRs on average but does not universally dominate either. Next, we show that task- or domain-specific adaptation of VC-1 leads to substantial gains, with VC-1 (adapted) achieving competitive or superior performance than the best known results on all of the benchmarks in CortexBench. Finally, we present real-world hardware experiments, in which VC-1 and VC-1 (adapted) outperform the strongest pre-existing PVR. Overall, this paper presents no new techniques but a rigorous systematic evaluation, a broad set of findings about PVRs (that in some cases, refute those made in narrow domains in prior work), and open-sourced code and models (that required over 10, 000 GPU-hours to train) for the benefit of the research community.

ICML Conference 2022 Conference Paper

The Unsurprising Effectiveness of Pre-Trained Vision Models for Control

  • Simone Parisi
  • Aravind Rajeswaran
  • Senthil Purushwalkam
  • Abhinav Gupta 0001

Recent years have seen the emergence of pre-trained representations as a powerful abstraction for AI applications in computer vision, natural language, and speech. However, policy learning for control is still dominated by a tabula-rasa learning paradigm, with visuo-motor policies often trained from scratch using data from deployment environments. In this context, we revisit and study the role of pre-trained visual representations for control, and in particular representations trained on large-scale computer vision datasets. Through extensive empirical evaluation in diverse control domains (Habitat, DeepMind Control, Adroit, Franka Kitchen), we isolate and study the importance of different representation training methods, data augmentations, and feature hierarchies. Overall, we find that pre-trained visual representations can be competitive or even better than ground-truth state representations to train control policies. This is in spite of using only out-of-domain data from standard vision datasets, without any in-domain data from the deployment environments.

ICML Conference 2022 Conference Paper

Translating Robot Skills: Learning Unsupervised Skill Correspondences Across Robots

  • Tanmay Shankar
  • Yixin Lin
  • Aravind Rajeswaran
  • Vikash Kumar
  • Stuart Anderson
  • Jean Oh

In this paper, we explore how we can endow robots with the ability to learn correspondences between their own skills, and those of morphologically different robots in different domains, in an entirely unsupervised manner. We make the insight that different morphological robots use similar task strategies to solve similar tasks. Based on this insight, we frame learning skill correspondences as a problem of matching distributions of sequences of skills across robots. We then present an unsupervised objective that encourages a learnt skill translation model to match these distributions across domains, inspired by recent advances in unsupervised machine translation. Our approach is able to learn semantically meaningful correspondences between skills across multiple robot-robot and human-robot domain pairs despite being completely unsupervised. Further, the learnt correspondences enable the transfer of task strategies across robots and domains. We present dynamic visualizations of our results at https: //sites. google. com/view/translatingrobotskills/home.

NeurIPS Conference 2022 Conference Paper

Unsupervised Reinforcement Learning with Contrastive Intrinsic Control

  • Michael Laskin
  • Hao Liu
  • Xue Bin Peng
  • Denis Yarats
  • Aravind Rajeswaran
  • Pieter Abbeel

We introduce Contrastive Intrinsic Control (CIC), an unsupervised reinforcement learning (RL) algorithm that maximizes the mutual information between state-transitions and latent skill vectors. CIC utilizes contrastive learning between state-transitions and skills vectors to learn behaviour embeddings and maximizes the entropy of these embeddings as an intrinsic reward to encourage behavioural diversity. We evaluate our algorithm on the Unsupervised RL Benchmark (URLB) in the asymptotic state-based setting, which consists of a long reward-free pre-training phase followed by a short adaptation phase to downstream tasks with extrinsic rewards. We find that CIC improves over prior exploration algorithms in terms of adaptation efficiency to downstream tasks on state-based URLB.

NeurIPS Conference 2021 Conference Paper

COMBO: Conservative Offline Model-Based Policy Optimization

  • Tianhe Yu
  • Aviral Kumar
  • Rafael Rafailov
  • Aravind Rajeswaran
  • Sergey Levine
  • Chelsea Finn

Model-based reinforcement learning (RL) algorithms, which learn a dynamics model from logged experience and perform conservative planning under the learned model, have emerged as a promising paradigm for offline reinforcement learning (offline RL). However, practical variants of such model-based algorithms rely on explicit uncertainty quantification for incorporating conservatism. Uncertainty estimation with complex models, such as deep neural networks, can be difficult and unreliable. We empirically find that uncertainty estimation is not accurate and leads to poor performance in certain scenarios in offline model-based RL. We overcome this limitation by developing a new model-based offline RL algorithm, COMBO, that trains a value function using both the offline dataset and data generated using rollouts under the model while also additionally regularizing the value function on out-of-support state-action tuples generated via model rollouts. This results in a conservative estimate of the value function for out-of-support state-action tuples, without requiring explicit uncertainty estimation. Theoretically, we show that COMBO satisfies a policy improvement guarantee in the offline setting. Through extensive experiments, we find that COMBO attains greater performance compared to prior offline RL on problems that demand generalization to related but previously unseen tasks, and also consistently matches or outperforms prior offline RL methods on widely studied offline RL benchmarks, including image-based tasks.

NeurIPS Conference 2021 Conference Paper

Decision Transformer: Reinforcement Learning via Sequence Modeling

  • Lili Chen
  • Kevin Lu
  • Aravind Rajeswaran
  • Kimin Lee
  • Aditya Grover
  • Misha Laskin
  • Pieter Abbeel
  • Aravind Srinivas

We introduce a framework that abstracts Reinforcement Learning (RL) as a sequence modeling problem. This allows us to draw upon the simplicity and scalability of the Transformer architecture, and associated advances in language modeling such as GPT-x and BERT. In particular, we present Decision Transformer, an architecture that casts the problem of RL as conditional sequence modeling. Unlike prior approaches to RL that fit value functions or compute policy gradients, Decision Transformer simply outputs the optimal actions by leveraging a causally masked Transformer. By conditioning an autoregressive model on the desired return (reward), past states, and actions, our Decision Transformer model can generate future actions that achieve the desired return. Despite its simplicity, Decision Transformer matches or exceeds the performance of state-of-the-art model-free offline RL baselines on Atari, OpenAI Gym, and Key-to-Door tasks.

NeurIPS Conference 2021 Conference Paper

Reinforcement Learning with Latent Flow

  • Wenling Shang
  • Xiaofei Wang
  • Aravind Srinivas
  • Aravind Rajeswaran
  • Yang Gao
  • Pieter Abbeel
  • Misha Laskin

Temporal information is essential to learning effective policies with Reinforcement Learning (RL). However, current state-of-the-art RL algorithms either assume that such information is given as part of the state space or, when learning from pixels, use the simple heuristic of frame-stacking to implicitly capture temporal information present in the image observations. This heuristic is in contrast to the current paradigm in video classification architectures, which utilize explicit encodings of temporal information through methods such as optical flow and two-stream architectures to achieve state-of-the-art performance. Inspired by leading video classification architectures, we introduce the Flow of Latents for Reinforcement Learning (Flare), a network architecture for RL that explicitly encodes temporal information through latent vector differences. We show that Flare recovers optimal performance in state-based RL without explicit access to the state velocity, solely with positional state information. Flare is the most sample efficient model-free pixel-based RL algorithm on the DeepMind Control suite when evaluated on the 500k and 1M step benchmarks across 5 challenging control tasks, and, when used with Rainbow DQN, outperforms the competitive baseline on Atari games at 100M time step benchmark across 8 challenging games.

NeurIPS Conference 2021 Conference Paper

Visual Adversarial Imitation Learning using Variational Models

  • Rafael Rafailov
  • Tianhe Yu
  • Aravind Rajeswaran
  • Chelsea Finn

Reward function specification, which requires considerable human effort and iteration, remains a major impediment for learning behaviors through deep reinforcement learning. In contrast, providing visual demonstrations of desired behaviors presents an easier and more natural way to teach agents. We consider a setting where an agent is provided a fixed dataset of visual demonstrations illustrating how to perform a task, and must learn to solve the task using the provided demonstrations and unsupervised environment interactions. This setting presents a number of challenges including representation learning for visual observations, sample complexity due to high dimensional spaces, and learning instability due to the lack of a fixed reward or learning signal. Towards addressing these challenges, we develop a variational model-based adversarial imitation learning (V-MAIL) algorithm. The model-based approach provides a strong signal for representation learning, enables sample efficiency, and improves the stability of adversarial training by enabling on-policy learning. Through experiments involving several vision-based locomotion and manipulation tasks, we find that V-MAIL learns successful visuomotor policies in a sample-efficient manner, has better stability compared to prior work, and also achieves higher asymptotic performance. We further find that by transferring the learned models, V-MAIL can learn new tasks from visual demonstrations without any additional environment interactions. All results including videos can be found online at https: //sites. google. com/view/variational-mail

ICML Conference 2020 Conference Paper

A Game Theoretic Framework for Model Based Reinforcement Learning

  • Aravind Rajeswaran
  • Igor Mordatch
  • Vikash Kumar

Designing stable and efficient algorithms for model-based reinforcement learning (MBRL) with function approximation has remained challenging despite growing interest in the field. To help expose the practical challenges in MBRL and simplify algorithm design from the lens of abstraction, we develop a new framework that casts MBRL as a game between: (1) a policy player, which attempts to maximize rewards under the learned model; (2) a model player, which attempts to fit the real-world data collected by the policy player. We show that a near-optimal policy for the environment can be obtained by finding an approximate equilibrium for aforementioned game, and we develop two families of algorithms to find the game equilibrium by drawing upon ideas from Stackelberg games. Experimental studies suggest that the proposed algorithms achieve state of the art sample efficiency, match the asymptotic performance of model-free policy gradient, and scale gracefully to high-dimensional tasks like dexterous hand manipulation. Project page: \url{https: //sites. google. com/view/mbrl-game}.

NeurIPS Conference 2020 Conference Paper

MOReL: Model-Based Offline Reinforcement Learning

  • Rahul Kidambi
  • Aravind Rajeswaran
  • Praneeth Netrapalli
  • Thorsten Joachims

In offline reinforcement learning (RL), the goal is to learn a highly rewarding policy based solely on a dataset of historical interactions with the environment. This serves as an extreme test for an agent's ability to effectively use historical data which is known to be critical for efficient RL. Prior work in offline RL has been confined almost exclusively to model-free RL approaches. In this work, we present MOReL, an algorithmic framework for model-based offline RL. This framework consists of two steps: (a) learning a pessimistic MDP using the offline dataset; (b) learning a near-optimal policy in this pessimistic MDP. The design of the pessimistic MDP is such that for any policy, the performance in the real environment is approximately lower-bounded by the performance in the pessimistic MDP. This enables the pessimistic MDP to serve as a good surrogate for purposes of policy evaluation and learning. Theoretically, we show that MOReL is minimax optimal (up to log factors) for offline RL. Empirically, MOReL matches or exceeds state-of-the-art results on widely used offline RL benchmarks. Overall, the modular design of MOReL enables translating advances in its components (for e. g. , in model learning, planning etc. ) to improvements in offline RL.

ICRA Conference 2019 Conference Paper

Dexterous Manipulation with Deep Reinforcement Learning: Efficient, General, and Low-Cost

  • Henry Zhu
  • Abhishek Gupta 0004
  • Aravind Rajeswaran
  • Sergey Levine
  • Vikash Kumar

Dexterous multi-fingered robotic hands can perform a wide range of manipulation skills, making them an appealing component for general-purpose robotic manipulators. However, such hands pose a major challenge for autonomous control, due to the high dimensionality of their configuration space and complex intermittent contact interactions. In this work, we propose deep reinforcement learning (deep RL) as a scalable solution for learning complex, contact rich behaviors with multi-fingered hands. Deep RL provides an end-to-end approach to directly map sensor readings to actions, without the need for task specific models or policy classes. We show that contact-rich manipulation behavior with multi-fingered hands can be learned by directly training with model-free deep RL algorithms in the real world, with minimal additional assumption and without the aid of simulation. We learn to perform a variety of tasks on two different low-cost hardware platforms entirely from scratch, and further study how the learning can be accelerated by using a small number of human demonstrations. Our experiments demonstrate that complex multi-fingered manipulation skills can be learned in the real world in about 4-7 hours for most tasks, and that demonstrations can decrease this to 2-3 hours, indicating that direct deep RL training in the real world is a viable and practical alternative to simulation and model-based control. https:// sites.google.com/view/deeprl-handmanipulation.

ICRA Conference 2019 Conference Paper

Learning Deep Visuomotor Policies for Dexterous Hand Manipulation

  • Divye Jain
  • Andrew Li
  • Shivam Singhal
  • Aravind Rajeswaran
  • Vikash Kumar
  • Emanuel Todorov

Multi-fingered dexterous hands are versatile and capable of acquiring a diverse set of skills such as grasping, in-hand manipulation, and tool use. To fully utilize their versatility in real-world scenarios, we require algorithms and policies that can control them using on-board sensing capabilities, without relying on external tracking or motion capture systems. Cameras and tactile sensors are the most widely used on-board sensors that do not require instrumentation of the world. In this work, we demonstrate an imitation learning based approach to train deep visuomotor policies for a variety of manipulation tasks with a simulated five fingered dexterous hand. These policies directly control the hand using high dimensional visual observations of the world and propreoceptive observations from the robot, and can be trained efficiently with a few hundred expert demonstration trajectories. We also find that using touch sensing information enables faster learning and better asymptotic performance for tasks with high degree of occlusions. Video demonstration of our results are available at: https://sites.google.com/view/hand-vil/

NeurIPS Conference 2019 Conference Paper

Meta-Learning with Implicit Gradients

  • Aravind Rajeswaran
  • Chelsea Finn
  • Sham Kakade
  • Sergey Levine

A core capability of intelligent systems is the ability to quickly learn new tasks by drawing on prior experience. Gradient (or optimization) based meta-learning has recently emerged as an effective approach for few-shot learning. In this formulation, meta-parameters are learned in the outer loop, while task-specific models are learned in the inner-loop, by using only a small amount of data from the current task. A key challenge in scaling these approaches is the need to differentiate through the inner loop learning process, which can impose considerable computational and memory burdens. By drawing upon implicit differentiation, we develop the implicit MAML algorithm, which depends only on the solution to the inner level optimization and not the path taken by the inner loop optimizer. This effectively decouples the meta-gradient computation from the choice of inner loop optimizer. As a result, our approach is agnostic to the choice of inner loop optimizer and can gracefully handle many gradient steps without vanishing gradients or memory constraints. Theoretically, we prove that implicit MAML can compute accurate meta-gradients with a memory footprint that is, up to small constant factors, no more than that which is required to compute a single inner loop gradient and at no overall increase in the total computational cost. Experimentally, we show that these benefits of implicit MAML translate into empirical gains on few-shot image recognition benchmarks.

ICML Conference 2019 Conference Paper

Online Meta-Learning

  • Chelsea Finn
  • Aravind Rajeswaran
  • Sham M. Kakade
  • Sergey Levine

A central capability of intelligent systems is the ability to continuously build upon previous experiences to speed up and enhance learning of new tasks. Two distinct research paradigms have studied this question. Meta-learning views this problem as learning a prior over model parameters that is amenable for fast adaptation on a new task, but typically assumes the tasks are available together as a batch. In contrast, online (regret based) learning considers a setting where tasks are revealed one after the other, but conventionally trains a single model without task-specific adaptation. This work introduces an online meta-learning setting, which merges ideas from both paradigms to better capture the spirit and practice of continual lifelong learning. We propose the follow the meta leader (FTML) algorithm which extends the MAML algorithm to this setting. Theoretically, this work provides an O(log T) regret guarantee with one additional higher order smoothness assumption (in comparison to the standard online setting). Our experimental evaluation on three different large-scale problems suggest that the proposed algorithm significantly outperforms alternatives based on traditional online learning approaches.

ICLR Conference 2018 Conference Paper

Variance Reduction for Policy Gradient with Action-Dependent Factorized Baselines

  • Cathy Wu 0002
  • Aravind Rajeswaran
  • Yan Duan
  • Vikash Kumar
  • Alexandre M. Bayen
  • Sham M. Kakade
  • Igor Mordatch
  • Pieter Abbeel

Policy gradient methods have enjoyed great success in deep reinforcement learning but suffer from high variance of gradient estimates. The high variance problem is particularly exasperated in problems with long horizons or high-dimensional action spaces. To mitigate this issue, we derive a bias-free action-dependent baseline for variance reduction which fully exploits the structural form of the stochastic policy itself and does not make any additional assumptions about the MDP. We demonstrate and quantify the benefit of the action-dependent baseline through both theoretical analysis as well as numerical results, including an analysis of the suboptimality of the optimal state-dependent baseline. The result is a computationally efficient policy gradient algorithm, which scales to high-dimensional control problems, as demonstrated by a synthetic 2000-dimensional target matching task. Our experimental results indicate that action-dependent baselines allow for faster learning on standard reinforcement learning benchmarks and high-dimensional hand manipulation and synthetic tasks. Finally, we show that the general idea of including additional information in baselines for improved variance reduction can be extended to partially observed and multi-agent tasks.

NeurIPS Conference 2017 Conference Paper

Towards Generalization and Simplicity in Continuous Control

  • Aravind Rajeswaran
  • Kendall Lowrey
  • Emanuel Todorov
  • Sham Kakade

The remarkable successes of deep learning in speech recognition and computer vision have motivated efforts to adapt similar techniques to other problem domains, including reinforcement learning (RL). Consequently, RL methods have produced rich motor behaviors on simulated robot tasks, with their success largely attributed to the use of multi-layer neural networks. This work is among the first to carefully study what might be responsible for these recent advancements. Our main result calls this emerging narrative into question by showing that much simpler architectures -- based on linear and RBF parameterizations -- achieve comparable performance to state of the art results. We not only study different policy representations with regard to performance measures at hand, but also towards robustness to external perturbations. We again find that the learned neural network policies --- under the standard training scenarios --- are no more robust than linear (or RBF) policies; in fact, all three are remarkably brittle. Finally, we then directly modify the training scenarios in order to favor more robust policies, and we again do not find a compelling case to favor multi-layer architectures. Overall, this study suggests that multi-layer architectures should not be the default choice, unless a side-by-side comparison to simpler architectures shows otherwise. More generally, we hope that these results lead to more interest in carefully studying the architectural choices, and associated trade-offs, for training generalizable and robust policies.