Arrow Research search

Author name cluster

Silvio Savarese

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

72 papers
2 author rows

Possible papers

72

TMLR Journal 2026 Journal Article

CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions

  • Kung-Hsiang Huang
  • Akshara Prabhakar
  • Onkar Thorat
  • Divyansh Agarwal
  • Prafulla Kumar Choubey
  • Yixin Mao
  • Silvio Savarese
  • Caiming Xiong

While AI agents hold transformative potential in business, effective performance benchmarking is hindered by the scarcity of public, realistic business data on widely used platforms. Existing benchmarks often lack fidelity in their environments, data, and agent-user interactions, with limited coverage of diverse business scenarios and industries. To address these gaps, we introduce CRMArena-Pro, a novel benchmark for holistic, realistic assessment of LLM agents in diverse professional settings. CRMArena-Pro expands on CRMArena with nineteen expert-validated tasks across sales, service, and 'configure, price, and quote' processes, for both Business-to-Business and Business-to-Customer scenarios. It distinctively incorporates multi-turn interactions guided by diverse personas and robust confidentiality awareness assessments. Experiments reveal leading LLM agents achieve only around 58% single-turn success on CRMArena-Pro, with performance dropping significantly to approximately 35% in multi-turn settings. While Workflow Execution proves more tractable for top agents (over 83% single-turn success), other evaluated business skills present greater challenges. Furthermore, agents exhibit near-zero inherent confidentiality awareness; though targeted prompting can improve this, it often compromises task performance. These findings highlight a substantial gap between current LLM capabilities and enterprise demands, underscoring the need for advancements in multi-turn reasoning, confidentiality adherence, and versatile skill acquisition.

TMLR Journal 2026 Journal Article

LZ Penalty: An information-theoretic repetition penalty for autoregressive language models.

  • Tony A Ginart
  • Naveen Kodali
  • Jason Lee
  • Caiming Xiong
  • Silvio Savarese
  • John Emmons

We introduce the Lempel-Ziv (LZ) penalty, a penalty specialized for reducing degenerate repetitions in autoregressive language models without loss of capability. The penalty is based on the codelengths in the LZ77 universal lossless compression algorithm. Through the lens of the prediction-compression duality, decoding with the LZ penalty has the interpretation of sampling from the residual distribution after removing the information that is highly compressible. We demonstrate the LZ penalty enables open-source reasoning models to operate with greedy decoding without loss of capability and without instances of degenerate repetition. In contrast, both the industry-standard frequency penalty and repetition penalty are ineffective, incurring degenerate repetition rates of up to 4% or more.

TMLR Journal 2025 Journal Article

A Survey of Frontiers in LLM Reasoning: Inference Scaling, Learning to Reason, and Agentic Systems

  • Zixuan Ke
  • Fangkai Jiao
  • Yifei Ming
  • Xuan-Phi Nguyen
  • Austin Xu
  • Do Xuan Long
  • Minzhi Li
  • Chengwei Qin

Reasoning is a fundamental cognitive process that enables logical inference, problem-solving, and decision-making. With the rapid advancement of large language models (LLMs), reasoning has emerged as a key capability that distinguishes advanced AI systems from conventional models that empower chatbots. In this survey, we categorize existing methods along two orthogonal dimensions: (1) Regimes, which define the stage at which reasoning is achieved (either at inference time or through dedicated training); and (2) Architectures, which determine the components involved in the reasoning process, distinguishing between standalone LLMs and agentic compound systems that incorporate external tools, and multiagent collaborations. Within each dimension, we analyze two key perspectives: (1) Input level, which focuses on techniques that construct high-quality prompts that the LLM condition on; and (2) Output level, which methods that refine multiple sampled candidates to enhance reasoning quality. This categorization provides a systematic understanding of the evolving landscape of LLM reasoning, highlighting emerging trends such as the shift from inference-scaling to learning-to-reason (e.g., DeepSeek-R1), and the transition to agentic workflows (e.g., OpenAI Deep Research, Manus Agent). Additionally, we cover a broad spectrum of learning algorithms, from supervised fine-tuning to reinforcement learning such as PPO and GRPO, and the training of reasoners and verifiers. We also examine key designs of agentic workflows, from established patterns like generator-evaluator and LLM debate to recent innovations. Finally, we identify emerging trends, such as domain-specific reasoning systems, and open challenges, such as evaluation and data quality. This survey aims to provide AI researchers and practitioners with a comprehensive foundation for advancing reasoning in LLMs, paving the way for more sophisticated and reliable AI systems.

NeurIPS Conference 2025 Conference Paper

APIGen-MT: Agentic Pipeline for Multi-Turn Data Generation via Simulated Agent-Human Interplay

  • Akshara Prabhakar
  • Zuxin Liu
  • Ming Zhu
  • Jianguo Zhang
  • Tulika Manoj Awalgaonkar
  • Shiyu Wang
  • Zhiwei Liu
  • Haolin Chen

Training effective AI agents for multi-turn interactions requires high-quality data that captures realistic human-agent dynamics, yet such data is scarce and expensive to collect manually. We introduce APIGen-MT, a two-phase framework that generates verifiable and diverse multi-turn agent data. In the first phase, our agentic pipeline produces detailed task blueprints with ground-truth actions, leveraging a committee of LLM reviewers and iterative feedback loops. These blueprints are then transformed into complete interaction trajectories through simulated human-agent interplay. We train a family of models---the xLAM-2-fc-r series with sizes ranging from 1B to 70B parameters. Our models outperform frontier models such as GPT-4o and Claude 3. 5 on $\tau$-bench and BFCL benchmarks, with the smaller models surpassing their larger counterparts, particularly in multi-turn settings, while maintaining superior consistency across multiple trials. Comprehensive experiments demonstrate that our verified blueprint-to-details approach yields high-quality training data, enabling the development of more reliable, efficient, and capable agents. We open-source both the synthetic data collected and the trained xLAM-2-fc-r models to advance research in AI agents. Dataset: https: //huggingface. co/datasets/Salesforce/APIGen-MT-5k & Models: https: //huggingface. co/collections/Salesforce/xlam-2-67ef5be12949d8dcdae354c4

ICLR Conference 2025 Conference Paper

Diversity Empowers Intelligence: Integrating Expertise of Software Engineering Agents

  • Kexun Zhang
  • Weiran Yao
  • Zuxin Liu
  • Yihao Feng
  • Zhiwei Liu 0001
  • Rithesh R. N.
  • Tian Lan 0006
  • Lei Li 0005

Large language model (LLM) agents have shown great potential in solving real-world software engineering (SWE) problems. The most advanced open-source SWE agent can resolve over 27% of real GitHub issues in SWE-Bench Lite. However, these sophisticated agent frameworks exhibit varying strengths, excelling in certain tasks while underperforming in others. To fully harness the diversity of these agents, we propose DEI (Diversity Empowered Intelligence), a framework that leverages their unique expertise. DEI functions as a meta-module atop existing SWE agent frameworks, managing agent collectives for enhanced problem-solving. Experimental results show that a DEI-guided committee of agents is able to surpass the best individual agent's performance by a large margin. For instance, a group of open-source SWE agents, with a maximum individual resolve rate of 27.3% on SWE-Bench Lite, can achieve a 34.3% resolve rate with DEI, making a 25% improvement and beating most closed-source solutions. Our best-performing group excels with a 55% resolve rate, securing the highest ranking on SWE-Bench Lite. Our findings contribute to the growing body of research on collaborative AI systems and their potential to solve complex software engineering challenges.

NeurIPS Conference 2025 Conference Paper

DyMU: Dynamic Merging and Virtual Unmerging for Efficient Variable-Length VLMs

  • Zhenhailong Wang
  • Senthil Purushwalkam
  • Caiming Xiong
  • Silvio Savarese
  • Heng Ji
  • Ran Xu

We present DyMU, an efficient, training-free framework that dynamically reduces the computational burden of vision-language models (VLMs) while maintaining high task performance. Our approach comprises two key components. First, Dynamic Token Merging (DToMe) reduces the number of visual token embeddings by merging similar tokens based on image complexity, addressing the inherent inefficiency of fixed-length outputs in vision transformers. Second, Virtual Token Unmerging (VTU) simulates the expected token sequence for large language models (LLMs) by efficiently reconstructing the attention dynamics of a full sequence, thus preserving the downstream performance without additional fine-tuning. Unlike previous approaches, our method dynamically determines token length based on the image content —not just resolution—and operates completely training-free, making it readily applicable to most state-of-the-art VLM architectures. Extensive experiments on image and video understanding tasks, demonstrate that DyMU can reduce the average visual token count by 32%-85% while achieving comparable performance to full-length models, across diverse VLM architectures. Furthermore, qualitative analyses show that the adaptive token reduction from DToMe aligns well with human perception and enables users to better control computational costs through flexible integration with additional vision tools and models.

ICML Conference 2025 Conference Paper

Moirai-MoE: Empowering Time Series Foundation Models with Sparse Mixture of Experts

  • Xu Liu 0014
  • Juncheng Liu
  • Gerald Woo
  • Taha Aksu
  • Yuxuan Liang 0002
  • Roger Zimmermann
  • Chenghao Liu
  • Junnan Li 0001

Achieving effective unified pretraining on large time series corpora remains an open challenge in developing time series foundation models. Existing methods, such as Moirai, introduce multiple projection layers for time series of different frequencies to account for high data heterogeneity. We identify major drawbacks to this human-imposed frequency-level model specialization. First, frequency is not a reliable indicator for grouping pretraining data. Second, time series can display varied distributions even within a short window. Frequency-level specialization overlooks the diversity at this granularity. To address these issues, this paper introduces Moirai-MoE, excluding human-defined data groupings while delegating the modeling of diverse time series patterns to the sparse mixture of experts (MoE) within Transformers. With this design, Moirai-MoE eliminates reliance on heuristics and enables automatic token-level specialization. Extensive evaluations on 39 datasets demonstrate the superiority of Moirai-MoE over state-of-the-art foundation models. This study also conducts comprehensive model analyses to explore the inner workings of time series MoE foundation models.

ICML Conference 2025 Conference Paper

Reward-Guided Speculative Decoding for Efficient LLM Reasoning

  • Baohao Liao
  • Yuhui Xu
  • Hanze Dong
  • Junnan Li 0001
  • Christof Monz
  • Silvio Savarese
  • Doyen Sahoo
  • Caiming Xiong

We introduce Reward-Guided Speculative Decoding (RSD), a novel framework aimed at improving the efficiency of inference in large language models (LLMs). RSD synergistically combines a lightweight draft model with a more powerful target model, incorporating a controlled bias to prioritize high-reward outputs, in contrast to existing speculative decoding methods that enforce strict unbiasedness. RSD employs a process reward model to evaluate intermediate decoding steps and dynamically decide whether to invoke the target model, optimizing the trade-off between computational cost and output quality. We theoretically demonstrate that a threshold-based mixture strategy achieves an optimal balance between resource utilization and performance. Extensive evaluations on challenging reasoning benchmarks, including Olympiad-level tasks, show that RSD delivers significant efficiency gains against decoding with the target model only (up to 4. 4X fewer FLOPs), while achieving significant better accuracy than parallel decoding method on average (up to +3. 5). These results highlight RSD as a robust and cost-effective approach for deploying LLMs in resource-intensive scenarios.

TMLR Journal 2025 Journal Article

Shared Imagination: LLMs Hallucinate Alike

  • Yilun Zhou
  • Caiming Xiong
  • Silvio Savarese
  • Chien-Sheng Wu

Despite the recent proliferation of large language models (LLMs), their training recipes -- model architecture, pre-training data and optimization algorithm -- are often very similar. This naturally raises the question of the similarity among the resulting models. In this paper, we propose a novel setting, imaginary question answering (IQA), to better understand model similarity. In IQA, we ask one model to generate purely imaginary questions (e.g., on completely made-up concepts in physics) and prompt another model to answer. Surprisingly, despite the total fictionality of these questions, all models can answer each other's questions with remarkable consistency, suggesting a "shared imagination space" in which these models operate during such hallucinations. We conduct a series of investigations into this phenomenon and discuss the implications of such model homogeneity on hallucination detection and computational creativity.

AAAI Conference 2025 Conference Paper

Text2Data: Low-Resource Data Generation with Textual Control

  • Shiyu Wang
  • Yihao Feng
  • Tian Lan
  • Ning Yu
  • Yu Bai
  • Ran Xu
  • Huan Wang
  • Caiming Xiong

Natural language serves as a common and straightforward control signal for humans to interact seamlessly with machines. Recognizing the importance of this interface, the machine learning community is investing considerable effort in generating data that is semantically coherent with textual instructions. While strides have been made in text-to-data generation spanning image editing, audio synthesis, video creation, and beyond, low-resource areas characterized by expensive annotations or complex data structures, such as molecules, motion dynamics, and time series, often lack textual labels. This deficiency impedes supervised learning, thereby constraining the application of advanced generative models for text-to-data tasks. In response to these challenges in the low-resource scenario, we propose Text2Data, a novel approach that utilizes unlabeled data to understand the underlying data distribution through an unsupervised diffusion model. Subsequently, it undergoes controllable finetuning via a novel constraint optimization-based learning objective that ensures controllability and effectively counteracts catastrophic forgetting. Comprehensive experiments demonstrate that Text2Data is able to achieve enhanced performance regarding controllability across various modalities, including molecules, motions and time series, when compared to existing baselines.

NeurIPS Conference 2024 Conference Paper

APIGen: Automated PIpeline for Generating Verifiable and Diverse Function-Calling Datasets

  • Zuxin Liu
  • Thai Hoang
  • Jianguo Zhang
  • Ming Zhu
  • Tian Lan
  • Shirley Kokane
  • Juntao Tan
  • Weiran Yao

The advancement of function-calling agent models requires diverse, reliable, and high-quality datasets. This paper presents APIGen, an automated data generation pipeline designed to synthesize high-quality datasets for function-calling applications. We leverage APIGen and collect 3, 673 executable APIs across 21 different categories to generate diverse function-calling datasets in a scalable and structured manner. Each data in our dataset is verified through three hierarchical stages: format checking, actual function executions, and semantic verification, improving its reliability and correctness. We demonstrate that models trained with our curated datasets, even with only 7B parameters, can achieve state-of-the-art performance on the Berkeley Function-Calling Benchmark, outperforming multiple GPT-4 models. Moreover, our 1B model achieves exceptional performance, surpassing GPT-3. 5-Turbo and Claude-3 Haiku. We release a dataset containing 60, 000 high-quality entries, aiming to advance the field of function-calling agent domains. The dataset and models are available on the project homepage \url{https: //apigen-pipeline. github. io/}.

ICLR Conference 2024 Conference Paper

How Do Transformers Learn In-Context Beyond Simple Functions? A Case Study on Learning with Representations

  • Tianyu Guo 0004
  • Wei Hu
  • Song Mei
  • Huan Wang 0016
  • Caiming Xiong
  • Silvio Savarese
  • Yu Bai 0017

While large language models based on the transformer architecture have demonstrated remarkable in-context learning (ICL) capabilities, understandings of such capabilities are still in an early stage, where existing theory and mechanistic understanding focus mostly on simple scenarios such as learning simple function classes. This paper takes initial steps on understanding ICL in more complex scenarios, by studying learning with \emph{representations}. Concretely, we construct synthetic in-context learning problems with a compositional structure, where the label depends on the input through a possibly complex but \emph{fixed} representation function, composed with a linear function that \emph{differs} in each instance. By construction, the optimal ICL algorithm first transforms the inputs by the representation function, and then performs linear ICL on top of the transformed dataset. We show theoretically the existence of transformers that approximately implement such algorithms with mild depth and size. Empirically, we find trained transformers consistently achieve near-optimal ICL performance in this setting, and exhibit the desired dissection where lower layers transforms the dataset and upper layers perform linear ICL. Through extensive probing and a new pasting experiment, we further reveal several mechanisms within the trained transformers, such as concrete copying behaviors on both the inputs and the representations, linear ICL capability of the upper layers alone, and a post-ICL representation selection mechanism in a harder mixture setting. These observed mechanisms align well with our theory and may shed light on how transformers perform ICL in more realistic scenarios.

NeurIPS Conference 2024 Conference Paper

INDICT: Code Generation with Internal Dialogues of Critiques for Both Security and Helpfulness

  • Hung Le
  • Yingbo Zhou
  • Caiming Xiong
  • Silvio Savarese

Large language models (LLMs) for code are typically trained to align with natural language instructions to closely follow their intentions and requirements. However, in many practical scenarios, it becomes increasingly challenging for these models to navigate the intricate boundary between helpfulness and safety, especially against highly complex yet potentially malicious instructions. In this work, we introduce INDICT: a new framework that empowers LLMs with Internal Dialogues of Critiques for both safety and helpfulness guidance. The internal dialogue is a dual cooperative system between a safety-driven critic and a helpfulness-driven critic. Each critic provides analysis against the given task and corresponding generated response, equipped with external knowledge queried through relevant code snippets and tools like web search and code interpreter. We engage the dual critic system in both code generation stage as well as code execution stage, providing preemptive and post-hoc guidance respectively to LLMs. We evaluated INDICT on 8 diverse tasks across 8 programming languages from 5 benchmarks, using LLMs from 7B to 70B parameters. We observed that our approach can provide an advanced level of critiques of both safety and helpfulness analysis, significantly improving the quality of output codes (+10% absolute improvements in all models).

NeurIPS Conference 2024 Conference Paper

MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens

  • Anas Awadalla
  • Le Xue
  • Oscar Lo
  • Manli Shu
  • Hannah Lee
  • Etash Guha
  • Matt Jordan
  • Sheng Shen

Multimodal interleaved datasets featuring free-form interleaved sequences of images and text are crucial for training frontier large multimodal models (LMMs). Despite the rapid progression of open-source LMMs, there remains a pronounced scarcity of large-scale, open-source multimodal interleaved datasets. In response, we introduce MINT-1T, the most extensive and diverse open-source Multimodal INTerleaved dataset to date. MINT-1T comprises of one trillion text tokens and 3. 4 billion images, a 10x scale-up from existing open-source datasets. Additionally, we include previously untapped sources such as PDFs and ArXiv papers. As scaling multimodal interleaved datasets requires substantial engineering effort, sharing the data curation process and releasing the dataset greatly benefits the community. Our experiments show that LMMs trained on MINT-1T rival the performance of models trained on the previous leading dataset, OBELICS. We release our data at https: //github. com/mlfoundations/MINT-1T.

ICRA Conference 2024 Conference Paper

Online Distribution Shift Detection via Recency Prediction

  • Rachel Luo
  • Rohan Sinha
  • Yixiao Sun
  • Ali Hindy
  • Shengjia Zhao
  • Silvio Savarese
  • Edward Schmerling
  • Marco Pavone 0001

When deploying modern machine learning-enabled robotic systems in high-stakes applications, detecting distribution shift is critical. However, most existing methods for detecting distribution shift are not well-suited to robotics settings, where data often arrives in a streaming fashion and may be very high-dimensional. In this work, we present an online method for detecting distribution shift with guarantees on the false positive rate — i. e. , when there is no distribution shift, our system is very unlikely (with probability < ε) to falsely issue an alert; any alerts that are issued should therefore be heeded. Our method is specifically designed for efficient detection even with high dimensional data, and it empirically achieves up to 11x faster detection on realistic robotics settings compared to prior work while maintaining a low false negative rate in practice (whenever there is a distribution shift in our experiments, our method indeed emits an alert). We demonstrate our approach in both simulation and hardware for a visual servoing task, and show that our method indeed issues an alert before a failure occurs.

NeurIPS Conference 2024 Conference Paper

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

  • Tianbao Xie
  • Danyang Zhang
  • Jixuan Chen
  • Xiaochuan Li
  • Siheng Zhao
  • Ruisheng Cao
  • Toh J. Hua
  • Zhoujun Cheng

Autonomous agents that accomplish complex computer tasks with minimal human interventions have the potential to transform human-computer interaction, significantly enhancing accessibility and productivity. However, existing benchmarks either lack an interactive environment or are limited to environments specific to certain applications or domains, failing to reflect the diverse and complex nature of real-world computer use, thereby limiting the scope of tasks and agent scalability. To address this issue, we introduce OSWorld, the first-of-its-kind scalable, real computer environment for multimodal agents, supporting task setup, execution-based evaluation, and interactive learning across various operating systems such as Ubuntu, Windows, and macOS. OSWorld can serve as a unified, integrated computer environment for assessing open-ended computer tasks that involve arbitrary applications. Building upon OSWorld, we create a benchmark of 369 computer tasks involving real web and desktop apps in open domains, OS file I/O, and workflows spanning multiple applications. Each task example is derived from real-world computer use cases and includes a detailed initial state setup configuration and a custom execution-based evaluation script for reliable, reproducible evaluation. Extensive evaluation of state-of-the-art LLM/VLM-based agents on OSWorld reveals significant deficiencies in their ability to serve as computer assistants. While humans can accomplish over 72. 36% of the tasks, the best model achieves only 12. 24% success, primarily struggling with GUI grounding and operational knowledge. Comprehensive analysis using OSWorld provides valuable insights for developing multimodal generalist agents that were not possible with previous benchmarks. Our code, environment, baseline models, and data are publicly available at this https URL.

ICLR Conference 2024 Conference Paper

Retroformer: Retrospective Large Language Agents with Policy Gradient Optimization

  • Weiran Yao
  • Shelby Heinecke
  • Juan Carlos Niebles
  • Zhiwei Liu 0001
  • Yihao Feng
  • Le Xue
  • Rithesh R. N.
  • Zeyuan Chen 0001

Recent months have seen the emergence of a powerful new trend in which large language models (LLMs) are augmented to become autonomous language agents capable of performing objective oriented multi-step tasks on their own, rather than merely responding to queries from human users. Most existing language agents, however, are not optimized using environment-specific rewards. Although some agents enable iterative refinement through verbal feedback, they do not reason and plan in ways that are compatible with gradient-based learning from rewards. This paper introduces a principled framework for reinforcing large language agents by learning a retrospective model, which automatically tunes the language agent prompts from environment feedback through policy gradient. Specifically, our proposed agent architecture learns from rewards across multiple environments and tasks, for fine-tuning a pre-trained language model which refines the language agent prompt by summarizing the root cause of prior failed attempts and proposing action plans. Experimental results on various tasks demonstrate that the language agents improve over time and that our approach considerably outperforms baselines that do not properly leverage gradients from the environment.

ICML Conference 2024 Conference Paper

Unified Training of Universal Time Series Forecasting Transformers

  • Gerald Woo
  • Chenghao Liu
  • Akshat Kumar
  • Caiming Xiong
  • Silvio Savarese
  • Doyen Sahoo

Deep learning for time series forecasting has traditionally operated within a one-model-per-dataset framework, limiting its potential to leverage the game-changing impact of large pre-trained models. The concept of universal forecasting, emerging from pre-training on a vast collection of time series datasets, envisions a single Large Time Series Model capable of addressing diverse downstream forecasting tasks. However, constructing such a model poses unique challenges specific to time series data: (i) cross-frequency learning, (ii) accommodating an arbitrary number of variates for multivariate time series, and (iii) addressing the varying distributional properties inherent in large-scale data. To address these challenges, we present novel enhancements to the conventional time series Transformer architecture, resulting in our proposed M asked Enc o der-based Un i ve r s a l T i me Series Forecasting Transformer ( Moirai ). Trained on our newly introduced Large-scale Open Time Series Archive (LOTSA) featuring over 27B observations across nine domains, Moirai achieves competitive or superior performance as a zero-shot forecaster when compared to full-shot models. Code, data, and model weights can be found at https: //github. com/SalesforceAIResearch/uni2ts.

ICLR Conference 2023 Conference Paper

An Extensible Multi-modal Multi-task Object Dataset with Materials

  • Trevor Standley
  • Ruohan Gao
  • Dawn Chen
  • Jiajun Wu 0001
  • Silvio Savarese

We present EMMa, an Extensible, Multimodal dataset of Amazon product listings that contains rich Material annotations. It contains more than 2.8 million objects, each with image(s), listing text, mass, price, product ratings, and position in Amazon’s product-category taxonomy. We also design a comprehensive taxonomy of 182 physical materials (e.g., Plastic → Thermoplastic → Acrylic). Objects areannotated with one or more materials from this taxonomy. With the numerous attributes available for each object, we develop a Smart Labeling framework to quickly add new binary labels to all objects with very little manual labeling effort, making the dataset extensible. Each object attribute in our dataset can be included in either the model inputs or outputs, leading to combinatorial possibilities in task configurations. For example, we can train a model to predict the object category from the listing text, or the mass and price from the product listing image. EMMa offers a new benchmark for multi-task learning in computer vision and NLP, and allows practitioners to efficiently add new tasks and object attributes at scale.

ICML Conference 2023 Conference Paper

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

  • Junnan Li 0001
  • Dongxu Li 0003
  • Silvio Savarese
  • Steven C. H. Hoi

The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models. This paper proposes BLIP-2, a generic and efficient pre-training strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained image encoders and frozen large language models. BLIP-2 bridges the modality gap with a lightweight Querying Transformer, which is pre-trained in two stages. The first stage bootstraps vision-language representation learning from a frozen image encoder. The second stage bootstraps vision-to-language generative learning from a frozen language model. BLIP-2 achieves state-of-the-art performance on various vision-language tasks, despite having significantly fewer trainable parameters than existing methods. For example, our model outperforms Flamingo80B by 8. 7% on zero-shot VQAv2 with 54x fewer trainable parameters. We also demonstrate the model’s emerging capabilities of zero-shot image-to-text generation that can follow natural language instructions.

ICLR Conference 2023 Conference Paper

CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis

  • Erik Nijkamp
  • Bo Pang 0004
  • Hiroaki Hayashi
  • Lifu Tu
  • Huan Wang 0016
  • Yingbo Zhou 0002
  • Silvio Savarese
  • Caiming Xiong

Program synthesis strives to generate a computer program as a solution to a given problem specification, expressed with input-output examples or natural language descriptions. The prevalence of large language models advances the state-of-the-art for program synthesis, though limited training resources and data impede open access to such models. To democratize this, we train and release a family of large language models up to 16.1B parameters, called CODEGEN, on natural language and programming language data, and open source the training library JAXFORMER. We show the utility of the trained model by demonstrating that it is competitive with the previous state-of-the-art on zero-shot Python code generation on HumanEval. We further investigate the multi-step paradigm for program synthesis, where a single program is factorized into multiple prompts specifying subproblems. To this end, we construct an open benchmark, Multi-Turn Programming Benchmark (MTPB), consisting of 115 diverse problem sets that are factorized into multi-turn prompts. Our analysis on MTPB shows that the same intent provided to CODEGEN in multi-turn fashion significantly improves program synthesis over that provided as a single turn. We make the training library JAXFORMER and model checkpoints available as open source contribution: https://github.com/salesforce/CodeGen.

ICLR Conference 2023 Conference Paper

Masked Unsupervised Self-training for Label-free Image Classification

  • Junnan Li 0001
  • Silvio Savarese
  • Steven C. H. Hoi

State-of-the-art computer vision models are mostly trained with supervised learning using human-labeled images, which limits their scalability due to the expensive annotation cost. While self-supervised representation learning has achieved impressive progress, it still requires a second stage of finetuning on labeled data. On the other hand, models pre-trained with large-scale text supervision (e.g., CLIP) have enabled zero-shot transfer to downstream image classification tasks. However, the zero-shot performance of CLIP-like models are often insufficient for real-world adoption. In this paper, we aim to leverage the abundant unlabeled data from a target domain to improve the performance of a pre-trained zero-shot classifier, by unsupervised finetuning of the pre-trained model. We propose Masked Unsupervised Self-Training (MUST), a new approach which leverages two different and complimentary sources of training signals: pseudo-labels and raw images. MUST jointly optimizes three objectives to learn both class-level global feature and pixel-level local feature and enforces a regularization between the two. We demonstrate the efficacy of MUST on 8 downstream tasks across a variety of domains, where it improves upon CLIP by a large margin. MUST also outperforms supervised few-shot adaptation methods. It achieves a top-1 accuracy of 77.7% on ImageNet using ViT-B, +9.4% higher than CLIP, and +6.2% higher than 16-shot CLIP adaptation. Our code is available at https://github.com/salesforce/MUST.

JMLR Journal 2023 Journal Article

Merlion: End-to-End Machine Learning for Time Series

  • Aadyot Bhatnagar
  • Paul Kassianik
  • Chenghao Liu
  • Tian Lan
  • Wenzhuo Yang
  • Rowan Cassius
  • Doyen Sahoo
  • Devansh Arpit

We introduce Merlion, an open-source machine learning library for time series. It features a unified interface for many commonly used models and datasets for forecasting and anomaly detection on both univariate and multivariate time series, along with standard pre/post-processing layers. It has several modules to improve ease-of-use, including a no-code visual dashboard, anomaly score calibration to improve interpetability, AutoML for hyperparameter tuning and model selection, and model ensembling. Merlion also provides an evaluation framework that simulates the live deployment of a model in production, and a distributed computing backend to run time series models at industrial scale. This library aims to provide engineers and researchers a one-stop solution to rapidly develop models for their specific time series needs and benchmark them across multiple datasets. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2023. ( edit, beta )

AAMAS Conference 2023 Conference Paper

Modeling Dynamic Environments with Scene Graph Memory

  • Andrey Kurenkov
  • Michael Lingelbach
  • Tanmay Agarwal
  • Chengshu Li
  • Emily Jin
  • Ruohan Zhang
  • Li Fei-Fei
  • Jiajun Wu

Embodied AI agents operating in dynamic environments often need to predict object locations to make informed decisions. We propose a method for doing this via link prediction on partially observable dynamic graphs. We represent the agent’s accumulated set of observations in a data structure called a Scene Graph Memory (SGM), combine this data structure with a neural net architecture we call Node Edge Predictor (NEP), and show that it can be trained to predict the locations of objects in a variety of environments with diverse object movement dynamics. To evaluate our method, we implement the Dynamic Household Simulator, a novel benchmark which enables sampling of diverse dynamic scene graphs that follow the semantic patterns typically seen at peoples’ homes. We demonstrate that our method outperforms baselines both in terms of quickly adapting to the dynamics of a new scene and in terms of its overall accuracy.

ICML Conference 2023 Conference Paper

Modeling Dynamic Environments with Scene Graph Memory

  • Andrey Kurenkov
  • Michael Lingelbach
  • Tanmay Agarwal
  • Emily Jin
  • Chengshu Li 0002
  • Ruohan Zhang
  • Li Fei-Fei 0001
  • Jiajun Wu 0001

Embodied AI agents that search for objects in large environments such as households often need to make efficient decisions by predicting object locations based on partial information. We pose this as a new type of link prediction problem: link prediction on partially observable dynamic graphs Our graph is a representation of a scene in which rooms and objects are nodes, and their relationships are encoded in the edges; only parts of the changing graph are known to the agent at each timestep. This partial observability poses a challenge to existing link prediction approaches, which we address. We propose a novel state representation – Scene Graph Memory (SGM) – with captures the agent’s accumulated set of observations, as well as a neural net architecture called a Node Edge Predictor (NEP) that extracts information from the SGM to search efficiently. We evaluate our method in the Dynamic House Simulator, a new benchmark that creates diverse dynamic graphs following the semantic patterns typically seen at homes, and show that NEP can be trained to predict the locations of objects in a variety of environments with diverse object movement dynamics, outperforming baselines both in terms of new scene adaptability and overall accuracy. The codebase and more can be found www. scenegraphmemory. com.

ICRA Conference 2023 Conference Paper

Sonicverse: A Multisensory Simulation Platform for Embodied Household Agents that See and Hear

  • Ruohan Gao
  • Hao Li 0076
  • Gokul Dharan
  • Zhuzhu Wang
  • Chengshu Li 0002
  • Fei Xia 0002
  • Silvio Savarese
  • Li Fei-Fei 0001

Developing embodied agents in simulation has been a key research topic in recent years. Exciting new tasks, algorithms, and benchmarks have been developed in various simulators. However, most of them assume deaf agents in silent environments, while we humans perceive the world with multiple senses. We introduce Sonicverse, a multisensory simulation platform with integrated audio-visual simulation for training household agents that can both see and hear. Sonicverse models realistic continuous audio rendering in 3D environments in real-time. Together with a new audio-visual VR interface that allows humans to interact with agents with audio, Sonicverse enables a series of embodied AI tasks that need audio-visual perception. For semantic audio-visual navigation in particular, we also propose a new multi-task learning model that achieves state-of-the-art performance. In addition, we demonstrate Sonicverse's realism via sim-to-real transfer, which has not been achieved by other simulators: an agent trained in Sonicverse can successfully perform audio-visual navigation in real-world environments. Sonicverse is available at: https://github.com/StanfordVL/Sonicverse.

NeurIPS Conference 2023 Conference Paper

UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild

  • Can Qin
  • Shu Zhang
  • Ning Yu
  • Yihao Feng
  • Xinyi Yang
  • Yingbo Zhou
  • Huan Wang
  • Juan Carlos Niebles

Achieving machine autonomy and human control often represent divergent objectives in the design of interactive AI systems. Visual generative foundation models such as Stable Diffusion show promise in navigating these goals, especially when prompted with arbitrary languages. However, they often fall short in generating images with spatial, structural, or geometric controls. The integration of such controls, which can accommodate various visual conditions in a single unified model, remains an unaddressed challenge. In response, we introduce UniControl, a new generative foundation model that consolidates a wide array of controllable condition-to-image (C2I) tasks within a singular framework, while still allowing for arbitrary language prompts. UniControl enables pixel-level-precise image generation, where visual conditions primarily influence the generated structures and language prompts guide the style and context. To equip UniControl with the capacity to handle diverse visual conditions, we augment pretrained text-to-image diffusion models and introduce a task-aware HyperNet to modulate the diffusion models, enabling the adaptation to different C2I tasks simultaneously. Trained on nine unique C2I tasks, UniControl demonstrates impressive zero-shot generation abilities with unseen visual conditions. Experimental results show that UniControl often surpasses the performance of single-task-controlled methods of comparable model sizes. This control versatility positions UniControl as a significant advancement in the realm of controllable visual generation.

NeurIPS Conference 2022 Conference Paper

CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning

  • Hung Le
  • Yue Wang
  • Akhilesh Deepak Gotmare
  • Silvio Savarese
  • Steven Chu Hong Hoi

Program synthesis or code generation aims to generate a program that satisfies a problem specification. Recent approaches using large-scale pretrained language models (LMs) have shown promising results, yet they have some critical limitations. In particular, they often follow a standard supervised fine-tuning procedure to train a code generation model from natural language problem descriptions and ground-truth programs only. Such paradigm largely ignores some important but potentially useful signals in the problem specification such as unit tests, which thus results in poor performance when solving complex unseen coding tasks. We propose “CodeRL” to address the limitations, a new framework for program synthesis tasks through pretrained LMs and deep reinforcement learning (RL). Specifically, during training, we treat the code-generating LM as an actor network, and introduce a critic network that is trained to predict the functional correctness of generated programs and provide dense feedback signals to the actor. During inference, we introduce a new generation procedure with a critical sampling strategy that allows a model to automatically regenerate programs based on feedback from example unit tests and critic scores. For the model backbones, we extended the encoder-decoder architecture of CodeT5 with enhanced learning objectives, larger model sizes, and better pretraining data. Our method not only achieves new SOTA results on the challenging APPS benchmark, but also shows strong zero-shot transfer capability with new SOTA results on the simpler MBPP benchmark.

UAI Conference 2022 Conference Paper

Local calibration: metrics and recalibration

  • Rachel Luo
  • Aadyot Bhatnagar
  • Yu Bai 0017
  • Shengjia Zhao
  • Huan Wang 0016
  • Caiming Xiong
  • Silvio Savarese
  • Stefano Ermon

Probabilistic classifiers output confidence scores along with their predictions, and these confidence scores should be calibrated, i. e. , they should reflect the reliability of the prediction. Confidence scores that minimize standard metrics such as the expected calibration error (ECE) accurately measure the reliability \textit{on average} across the entire population. However, it is in general impossible to measure the reliability of an \textit{individual} prediction. In this work, we propose the local calibration error (LCE) to span the gap between average and individual reliability. For each individual prediction, the LCE measures the average reliability of a set of similar predictions, where similarity is quantified by a kernel function on a pretrained feature space and by a binning scheme over predicted model confidences. We show theoretically that the LCE can be estimated sample-efficiently from data, and empirically find that it reveals miscalibration modes that are more fine-grained than the ECE can detect. Our key result is a novel {\textbf{lo}cal \textbf{re}calibration} method \method{}, to improve confidence scores for individual predictions and decrease the LCE. Experimentally, we show that our recalibration method produces more accurate confidence scores, which improves downstream fairness and decision making on classification tasks with both image and tabular data.

ICLR Conference 2021 Conference Paper

Adaptive Procedural Task Generation for Hard-Exploration Problems

  • Kuan Fang
  • Yuke Zhu
  • Silvio Savarese
  • Li Fei-Fei 0001

We introduce Adaptive Procedural Task Generation (APT-Gen), an approach to progressively generate a sequence of tasks as curricula to facilitate reinforcement learning in hard-exploration problems. At the heart of our approach, a task generator learns to create tasks from a parameterized task space via a black-box procedural generation module. To enable curriculum learning in the absence of a direct indicator of learning progress, we propose to train the task generator by balancing the agent's performance in the generated tasks and the similarity to the target tasks. Through adversarial training, the task similarity is adaptively estimated by a task discriminator defined on the agent's experiences, allowing the generated tasks to approximate target tasks of unknown parameterization or outside of the predefined task space. Our experiments on the grid world and robotic manipulation task domains show that APT-Gen achieves substantially better performance than various existing baselines by generating suitable tasks of rich variations.

ICRA Conference 2021 Conference Paper

Deep Affordance Foresight: Planning Through What Can Be Done in the Future

  • Danfei Xu
  • Ajay Mandlekar
  • Roberto Martín-Martín
  • Yuke Zhu
  • Silvio Savarese
  • Li Fei-Fei 0001

Planning in realistic environments requires searching in large planning spaces. Affordances are a powerful concept to simplify this search, because they model what actions can be successful in a given situation. However, the classical notion of affordance is not suitable for long horizon planning because it only informs the robot about the immediate outcome of actions instead of what actions are best for achieving a long-term goal. In this paper, we introduce a new affordance representation that enables the robot to reason about the longterm effects of actions through modeling what actions are afforded in the future. Based on the new representation, we develop a learning-to-plan method, Deep Affordance Foresight (DAF), that learns partial environment models of affordances of parameterized motor skills through trial-and-error. We evaluate DAF on two challenging manipulation domains and show that it can effectively learn to carry out multi-step tasks, share learned affordance representations among different tasks, and learn to plan with high-dimensional image inputs.

IROS Conference 2021 Conference Paper

Generalization Through Hand-Eye Coordination: An Action Space for Learning Spatially-Invariant Visuomotor Control

  • Chen Wang 0053
  • Rui Wang
  • Ajay Mandlekar
  • Li Fei-Fei 0001
  • Silvio Savarese
  • Danfei Xu

Imitation Learning (IL) is an effective framework to learn visuomotor skills from offline demonstration data. However, IL methods often fail to generalize to new scene configurations not covered by training data. On the other hand, humans can manipulate objects in varying conditions. Key to such capability is hand-eye coordination, a cognitive ability that enables humans to adaptively direct their movements at task-relevant objects and be invariant to the objects’ absolute spatial location. In this work, we present a learnable action space, Hand-eye Action Networks (HAN) that learns coordinated hand-eye movements from human teleoperated demonstrations. Through a set of challenging multi-stage manipulation tasks, we show that a visuomotor policy equipped with HAN is able to inherit the key spatial invariance property of handeye coordination and achieve generalization to new scene configurations. Additional materials available at https://sites.google.com/stanford.edu/han

IROS Conference 2021 Conference Paper

iGibson 1. 0: A Simulation Environment for Interactive Tasks in Large Realistic Scenes

  • Bokui Shen
  • Fei Xia 0002
  • Chengshu Li 0002
  • Roberto Martín-Martín
  • Linxi Fan
  • Guanzhi Wang
  • Claudia Pérez-D'Arpino
  • Shyamal Buch

We present iGibson 1. 0, a novel simulation environment to develop robotic solutions for interactive tasks in large-scale realistic scenes. Our environment contains 15 fully interactive home-sized scenes with 108 rooms populated with rigid and articulated objects. The scenes are replicas of real-world homes, with distribution and the layout of objects aligned to those of the real world. iGibson 1. 0 integrates several key features to facilitate the study of interactive tasks: i) generation of high-quality virtual sensor signals (RGB, depth, segmentation, LiDAR, flow and so on), ii) domain randomization to change the materials of the objects (both visual and physical) and/or their shapes, iii) integrated sampling-based motion planners to generate collision-free trajectories for robot bases and arms, and iv) intuitive human-iGibson interface that enables efficient collection of human demonstrations. Through experiments, we show that the full interactivity of the scenes enables agents to learn useful visual representations that accelerate the training of downstream manipulation tasks. We also show that iGibson features enable the generalization of navigation agents, and that the human-iGibson interface and integrated motion planners facilitate efficient imitation learning of human demonstrated (mobile) manipulation behaviors. iGibson 1. 0 is open-source, equipped with comprehensive examples and documentation. For more information, visit our project website: http://svl.stanford.edu/igibson/.

ICRA Conference 2021 Conference Paper

LASER: Learning a Latent Action Space for Efficient Reinforcement Learning

  • Arthur Allshire
  • Roberto Martín-Martín
  • Charles Lin
  • Shawn Manuel
  • Silvio Savarese
  • Animesh Garg

The process of learning a manipulation task depends strongly on the action space used for exploration: posed in the incorrect action space, solving a task with reinforcement learning can be drastically inefficient. Additionally, similar tasks or instances of the same task family impose latent manifold constraints on the most effective action space: the task family can be best solved with actions in a manifold of the entire action space of the robot. Combining these insights we present LASER, a method to learn latent action spaces for efficient reinforcement learning. LASER factorizes the learning problem into two sub-problems, namely action space learning and policy learning in the new action space. It leverages data from similar manipulation task instances, either from an offline expert or online during policy learning, and learns from these trajectories a mapping from the original to a latent action space. LASER is trained as a variational encoder-decoder model to map raw actions into a disentangled latent action space while maintaining action reconstruction and latent space dynamic consistency. We evaluate LASER on two contact-rich robotic tasks in simulation, and analyze the benefit of policy learning in the generated latent action space. We show improved sample efficiency compared to the original action space from better alignment of the action space to the task space, as we observe with visualizations of the learned action space manifold. Additional details: pair. toronto.edu/laser

ICRA Conference 2021 Conference Paper

Learning Multi-Arm Manipulation Through Collaborative Teleoperation

  • Albert Tung
  • Josiah Wong
  • Ajay Mandlekar
  • Roberto Martín-Martín
  • Yuke Zhu
  • Li Fei-Fei 0001
  • Silvio Savarese

Imitation Learning (IL) is a powerful paradigm to teach robots to perform manipulation tasks by allowing them to learn from human demonstrations collected via teleoperation, but has mostly been limited to single-arm manipulation. However, many real-world tasks require multiple arms, such as lifting a heavy object or assembling a desk. Unfortunately, applying IL to multi-arm manipulation tasks has been challenging –asking a human to control more than one robotic arm can impose significant cognitive burden and is often only possible for a maximum of two robot arms. To address these challenges, we present MULTI-ARM ROBOTURK (MART), a multi-user data collection platform that allows multiple remote users to simultaneously teleoperate a set of robotic arms and collect demonstrations for multi-arm tasks. Using MART, we collected demonstrations for five novel two and three-arm tasks from several geographically separated users. From our data we arrived at a critical insight: most multi-arm tasks do not require global coordination throughout its full duration, but only during specific moments. We show that learning from such data consequently presents challenges for centralized agents that directly attempt to model all robot actions simultaneously, and perform a comprehensive study of different policy architectures with varying levels of centralization on our tasks. Finally, we propose and evaluate a base-residual policy framework that allows trained policies to better adapt to the mixed coordination setting common in multi-arm manipulation, and show that a centralized policy augmented with a decentralized residual model outperforms all other models on our set of benchmark tasks. Additional results and videos at https://roboturk.stanford.edu/multiarm

IROS Conference 2021 Conference Paper

Probabilistic Visual Navigation with Bidirectional Image Prediction

  • Noriaki Hirose
  • Shun Taguchi
  • Fei Xia 0002
  • Roberto Martín-Martín
  • Kosuke Tahara
  • Masanori Ishigaki
  • Silvio Savarese

Humans can robustly follow a visual trajectory defined by a sequence of images (i. e. a video) regardless of substantial changes in the environment or the presence of obstacles. We aim at endowing similar visual navigation capabilities to mobile robots solely equipped with a RGB fisheye camera. We propose a novel probabilistic visual navigation system that learns to follow a sequence of images with bidirectional visual predictions conditioned on possible navigation velocities. By predicting bidirectionally (from start towards goal and vice versa) our method extends its predictive horizon enabling the robot to go around unseen large obstacles that are not visible in the video trajectory. Learning how to react to obstacles and potential risks in the visual field is achieved by imitating human teleoperators. Since the human teleoperation commands are diverse, we propose a probabilistic representation of trajectories that we can sample to find the safest path. We evaluate our navigation system quantitatively and qualitatively in multiple simulated and real environments and compare to state-of-the-art baselines. Our approach outperforms the most recent visual navigation methods with a large margin with regard to goal arrival rate, subgoal coverage rate, and success weighted by path length (SPL). Our method also generalizes to new robot embodiments never used during training.

ICRA Conference 2021 Conference Paper

ReLMoGen: Integrating Motion Generation in Reinforcement Learning for Mobile Manipulation

  • Fei Xia 0002
  • Chengshu Li 0002
  • Roberto Martín-Martín
  • Or Litany
  • Alexander Toshev
  • Silvio Savarese

Many Reinforcement Learning (RL) approaches use joint control signals (positions, velocities, torques) as action space for continuous control tasks. We propose to lift the action space to a higher level in the form of subgoals for a motion generator (a combination of motion planner and trajectory executor). We argue that, by lifting the action space and by leveraging sampling-based motion planners, we can efficiently use RL to solve complex, long-horizon tasks that could not be solved with existing RL methods in the original action space. We propose ReLMoGen – a framework that combines a learned policy to predict subgoals and a motion generator to plan and execute the motion needed to reach these subgoals. To validate our method, we apply ReLMoGen to two types of tasks: 1) Interactive Navigation tasks, navigation problems where interactions with the environment are required to reach the destination, and 2) Mobile Manipulation tasks, manipulation tasks that require moving the robot base. These problems are challenging because they are usually long-horizon, hard to explore during training, and comprise alternating phases of navigation and interaction. Our method is benchmarked on a diverse set of seven robotics tasks in photo-realistic simulation environments. In all settings, ReLMoGen outperforms state-of-the-art RL and Hierarchical RL baselines. ReLMoGen also shows outstanding transferability between different motion generators at test time, indicating a great potential to transfer to real robots. For more information, please visit project website: http://svl.stanford.edu/projects/relmogen.

ICRA Conference 2021 Conference Paper

Robot Navigation in Constrained Pedestrian Environments using Reinforcement Learning

  • Claudia Pérez-D'Arpino
  • Can Liu
  • Patrick Goebel
  • Roberto Martín-Martín
  • Silvio Savarese

Navigating fluently around pedestrians is a necessary capability for mobile robots deployed in human environments, such as buildings and homes. While research on social navigation has focused mainly on the scalability with the number of pedestrians in open spaces, typical indoor environments present the additional challenge of constrained spaces such as corridors and doorways that limit maneuverability and influence patterns of pedestrian interaction. We present an approach based on reinforcement learning (RL) to learn policies capable of dynamic adaptation to the presence of moving pedestrians while navigating between desired locations in constrained environments. The policy network receives guidance from a motion planner that provides waypoints to follow a globally planned trajectory, whereas RL handles the local interactions. We explore a compositional principle for multi-layout training and find that policies trained in a small set of geometrically simple layouts successfully generalize to more complex unseen layouts that exhibit composition of the structural elements available during training. Going beyond walls-world like domains, we show transfer of the learned policy to unseen 3D reconstructions of two real environments. These results support the applicability of the compositional principle to navigation in real-world buildings and indicate promising usage of multi-agent simulation within reconstructed environments for tasks that involve interaction. https://ai.stanford.edu/∼cdarpino/socialnavconstrained/

ICRA Conference 2021 Conference Paper

Semantic and Geometric Modeling with Neural Message Passing in 3D Scene Graphs for Hierarchical Mechanical Search

  • Andrey Kurenkov
  • Roberto Martín-Martín
  • Jeffrey Ichnowski
  • Ken Goldberg
  • Silvio Savarese

Searching for objects in indoor organized environments such as homes or offices is part of our everyday activities. When looking for a desired object, we reason about the rooms and containers the object is likely to be in; the same type of container will have a different probability of containing the target depending on which room it is in. We also combine geometric and semantic information to infer what container is best to search, or what other objects are best to move, if the target object is hidden from view. We use a 3D scene graph representation to capture the hierarchical, semantic, and geometric aspects of this problem. To exploit this representation in a search process, we introduce Hierarchical Mechanical Search (HMS), a method that guides an agent’s actions towards finding a target object specified with a natural language description. HMS is based on a novel neural network architecture that uses neural message passing of vectors with visual, geometric, and linguistic information to allow HMS to process data across layers of the graph while combining semantic and geometric cues. HMS is trained on 1000 3D scene graphs and evaluated on a novel dataset of 500 3D scene graphs with dense placements of semantically related objects in storage locations, and is shown to be significantly better than several baselines at finding objects. It is also close to the oracle policy in terms of the median number of actions required. Additional qualitative results can be found at https://ai.stanford.edu/mech-search/hms

ICRA Conference 2020 Conference Paper

6-PACK: Category-level 6D Pose Tracker with Anchor-Based Keypoints

  • Chen Wang 0053
  • Roberto Martín-Martín
  • Danfei Xu
  • Jun Lv
  • Cewu Lu
  • Li Fei-Fei 0001
  • Silvio Savarese
  • Yuke Zhu

We present 6-PACK, a deep learning approach to category-level 6D object pose tracking on RGB-D data. Our method tracks in real time novel object instances of known object categories such as bowls, laptops, and mugs. 6-PACK learns to compactly represent an object by a handful of 3D keypoints, based on which the interframe motion of an object instance can be estimated through keypoint matching. These keypoints are learned end-to-end without manual supervision in order to be most effective for tracking. Our experiments show that our method substantially outperforms existing methods on the NOCS category-level 6D pose estimation benchmark and supports a physical robot to perform simple vision-based closed-loop manipulation tasks. Our code and video are available at https://sites.google.com/view/6packtracking.

ICML Conference 2020 Conference Paper

Goal-Aware Prediction: Learning to Model What Matters

  • Suraj Nair 0003
  • Silvio Savarese
  • Chelsea Finn

Learned dynamics models combined with both planning and policy learning algorithms have shown promise in enabling artificial agents to learn to perform many diverse tasks with limited supervision. However, one of the fundamental challenges in using a learned forward dynamics model is the mismatch between the objective of the learned model (future state reconstruction), and that of the downstream planner or policy (completing a specified task). This issue is exacerbated by vision-based control tasks in diverse real-world environments, where the complexity of the real world dwarfs model capacity. In this paper, we propose to direct prediction towards task relevant information, enabling the model to be aware of the current task and encouraging it to only model relevant quantities of the state space, resulting in a learning objective that more closely matches the downstream task. Further, we do so in an entirely self-supervised manner, without the need for a reward function or image labels. We find that our method more effectively models the relevant parts of the scene conditioned on the goal, and as a result outperforms standard task-agnostic dynamics models and model-free reinforcement learning.

ICRA Conference 2020 Conference Paper

IRIS: Implicit Reinforcement without Interaction at Scale for Learning Control from Offline Robot Manipulation Data

  • Ajay Mandlekar
  • Fabio Ramos 0001
  • Byron Boots
  • Silvio Savarese
  • Li Fei-Fei 0001
  • Animesh Garg
  • Dieter Fox

Learning from offline task demonstrations is a problem of great interest in robotics. For simple short-horizon manipulation tasks with modest variation in task instances, offline learning from a small set of demonstrations can produce controllers that successfully solve the task. However, leveraging a fixed batch of data can be problematic for larger datasets and longer-horizon tasks with greater variations. The data can exhibit substantial diversity and consist of suboptimal solution approaches. In this paper, we propose Implicit Reinforcement without Interaction at Scale (IRIS), a novel framework for learning from large-scale demonstration datasets. IRIS factorizes the control problem into a goal-conditioned low-level controller that imitates short demonstration sequences and a high-level goal selection mechanism that sets goals for the low-level and selectively combines parts of suboptimal solutions leading to more successful task completions. We evaluate IRIS across three datasets, including the RoboTurk Cans dataset collected by humans via crowdsourcing, and show that performant policies can be learned from purely offline learning. Additional results at https://sites.google.com/stanford.edu/iris/.

IROS Conference 2020 Conference Paper

JRMOT: A Real-Time 3D Multi-Object Tracker and a New Large-Scale Dataset

  • Abhijeet Shenoi
  • Mihir Patel
  • JunYoung Gwak
  • Patrick Goebel
  • Amir Sadeghian
  • Hamid Rezatofighi
  • Roberto Martín-Martín
  • Silvio Savarese

Robots navigating autonomously need to perceive and track the motion of objects and other agents in its surroundings. This information enables planning and executing robust and safe trajectories. To facilitate these processes, the motion should be perceived in 3D Cartesian space. However, most recent multi-object tracking (MOT) research has focused on tracking people and moving objects in 2D RGB video sequences. In this work we present JRMOT, a novel 3D MOT system that integrates information from RGB images and 3D point clouds to achieve real-time, state-of-the-art tracking performance. Our system is built with recent neural networks for re-identification, 2D and 3D detection and track description, combined into a joint probabilistic data-association framework within a multi-modal recursive Kalman architecture. As part of our work, we release the JRDB dataset, a novel large scale 2D+3D dataset and benchmark, annotated with over 2 million boxes and 3500 time consistent 2D+3D trajectories across 54 indoor and outdoor scenes. JRDB contains over 60 minutes of data including 360 o cylindrical RGB video and 3D pointclouds in social settings that we use to develop, train and evaluate JRMOT. The presented 3D MOT system demonstrates state-of-the-art performance against competing methods on the popular 2D tracking KITTI benchmark and serves as first 3D tracking solution for our benchmark. Real-robot tests on our social robot JackRabbot indicate that the system is capable of tracking multiple pedestrians fast and reliably. We provide the ROS code of our tracker at https://sites.google.com/view/jrmot.

ICRA Conference 2020 Conference Paper

KETO: Learning Keypoint Representations for Tool Manipulation

  • Zengyi Qin
  • Kuan Fang
  • Yuke Zhu
  • Li Fei-Fei 0001
  • Silvio Savarese

We aim to develop an algorithm for robots to manipulate novel objects as tools for completing different task goals. An efficient and informative representation would facilitate the effectiveness and generalization of such algorithms. For this purpose, we present KETO, a framework of learning keypoint representations of tool-based manipulation. For each task, a set of task-specific keypoints is jointly predicted from 3D point clouds of the tool object by a deep neural network. These keypoints offer a concise and informative description of the object to determine grasps and subsequent manipulation actions. The model is learned from self-supervised robot interactions in the task environment without the need for explicit human annotations. We evaluate our framework in three manipulation tasks with tool use. Our model consistently outperforms state-of-the-art methods in terms of task success rates. Qualitative results of keypoint prediction and tool generation are shown to visualize the learned representations.

IROS Conference 2020 Conference Paper

Localizing Against Drawn Maps via Spline-Based Registration

  • Kevin Chen 0001
  • Marynel Vázquez
  • Silvio Savarese

We propose a method to facilitate robot navigation relative to sketched maps of human environments. Our main contribution centers around using thin plate splines for registering the robot's LIDAR observation with the hand-drawn maps. Thin plate splines are particularly effective for this task because they are able to handle many of the nonrigid deformations commonly seen in sketches of maps, which render traditional rigid transformations inappropriate. Our proposed approach uses a convolutional neural network to efficiently predict the control points which define the spline transform, from which we then compute the pose of the robot on the hand drawn map for navigation purposes. Our systematic evaluations in simulation using a synthetic dataset and real, hand-drawn sketches show that the proposed spline-based registration approach outperforms baseline methods.

IROS Conference 2020 Conference Paper

Multimodal Sensor Fusion with Differentiable Filters

  • Michelle A. Lee
  • Brent Yi
  • Roberto Martín-Martín
  • Silvio Savarese
  • Jeannette Bohg

Leveraging multimodal information with recursive Bayesian filters improves performance and robustness of state estimation, as recursive filters can combine different modalities according to their uncertainties. Prior work has studied how to optimally fuse different sensor modalities with analytical state estimation algorithms. However, deriving the dynamics and measurement models along with their noise profile can be difficult or lead to intractable models. Differentiable filters provide a way to learn these models end-to-end while retaining the algorithmic structure of recursive filters. This can be especially helpful when working with sensor modalities that are high dimensional and have very different characteristics. In contact-rich manipulation, we want to combine visual sensing (which gives us global information) with tactile sensing (which gives us local information). In this paper, we study new differentiable filtering architectures to fuse heterogeneous sensor information. As case studies, we evaluate three tasks: two in planar pushing (simulated and real) and one in manipulating a kinematically constrained door (simulated). In extensive evaluations, we find that differentiable filters that leverage crossmodal sensor information reach comparable accuracies to unstructured LSTM models, while presenting interpretability benefits that may be important for safety-critical systems. We also release an open-source library for creating and training differentiable Bayesian filters in PyTorch, which can be found on our project website: https://sites.google.com/view/multimodalfilter.

IROS Conference 2020 Conference Paper

Visuomotor Mechanical Search: Learning to Retrieve Target Objects in Clutter

  • Andrey Kurenkov
  • Joseph Taglic
  • Rohun Kulkarni
  • Marcus Dominguez-Kuhne
  • Animesh Garg
  • Roberto Martín-Martín
  • Silvio Savarese

When searching for objects in cluttered environments, it is often necessary to perform complex interactions in order to move occluding objects out of the way and fully reveal the object of interest and make it graspable. Due to the complexity of the physics involved and the lack of accurate models of the clutter, planning and controlling precise predefined interactions with accurate outcome is extremely hard, when not impossible. In problems where accurate (forward) models are lacking, Deep Reinforcement Learning (RL) has shown to be a viable solution to map observations (e. g. images) to good interactions in the form of close-loop visuomotor policies. However, Deep RL is sample inefficient and fails when applied directly to the problem of unoccluding objects based on images. In this work we present a novel Deep RL procedure that combines i) teacher-aided exploration, ii) a critic with privileged information, and iii) mid-level representations, resulting in sample efficient and effective learning for the problem of uncovering a target object occluded by a heap of unknown objects. Our experiments show that our approach trains faster and converges to more efficient uncovering solutions than baselines and ablations, and that our uncovering policies lead to an average improvement in the graspability of the target object, facilitating downstream retrieval applications.

ICML Conference 2020 Conference Paper

Which Tasks Should Be Learned Together in Multi-task Learning?

  • Trevor Standley
  • Amir Zamir
  • Dawn Chen
  • Leonidas J. Guibas
  • Jitendra Malik
  • Silvio Savarese

Many computer vision applications require solving multiple tasks in real-time. A neural network can be trained to solve multiple tasks simultaneously using multi-task learning. This can save computation at inference time as only a single network needs to be evaluated. Unfortunately, this often leads to inferior overall performance as task objectives can compete, which consequently poses the question: which tasks should and should not be learned together in one network when employing multi-task learning? We study task cooperation and competition in several different learning settings and propose a framework for assigning tasks to a few neural networks such that cooperating tasks are computed by the same neural network, while competing tasks are computed by different networks. Our framework offers a time-accuracy trade-off and can produce better accuracy using less inference time than not only a single large multi-task neural network but also many single-task networks.

IROS Conference 2019 Conference Paper

Continuous Relaxation of Symbolic Planner for One-Shot Imitation Learning

  • De-An Huang
  • Danfei Xu
  • Yuke Zhu
  • Animesh Garg
  • Silvio Savarese
  • Li Fei-Fei 0001
  • Juan Carlos Niebles

We address one-shot imitation learning, where the goal is to execute a previously unseen task based on a single demonstration. While there has been exciting progress in this direction, most of the approaches still require a few hundred tasks for meta-training, which limits the scalability of the approaches. Our main contribution is to formulate one-shot imitation learning as a symbolic planning problem along with the symbol grounding problem. This formulation disentangles the policy execution from the inter-task generalization and leads to better data efficiency. The key technical challenge is that the symbol grounding is prone to error with limited training data and leads to subsequent symbolic planning failures. We address this challenge by proposing a continuous relaxation of the discrete symbolic planner that directly plans on the probabilistic outputs of the symbol grounding model. Our continuous relaxation of the planner can still leverage the information contained in the probabilistic symbol grounding and significantly improve over the baseline planner for the one-shot imitation learning tasks without using large training data.

ICRA Conference 2019 Conference Paper

Deep Local Trajectory Replanning and Control for Robot Navigation

  • Ashwini Pokle
  • Roberto Martín-Martín
  • Patrick Goebel
  • Vincent Chow
  • Hans M. Ewald
  • Junwei Yang
  • Zhenkai Wang
  • Amir Sadeghian

We present a navigation system that combines ideas from hierarchical planning and machine learning. The system uses a traditional global planner to compute optimal paths towards a goal, and a deep local trajectory planner and velocity controller to compute motion commands. The latter components of the system adjust the behavior of the robot through attention mechanisms such that it moves towards the goal, avoids obstacles, and respects the space of nearby pedestrians. Both the structure of the proposed deep models and the use of attention mechanisms make the system's execution interpretable. Our simulation experiments suggest that the proposed architecture outperforms baselines that try to map global plan information and sensor data directly to velocity commands. In comparison to a hand-designed traditional navigation system, the proposed approach showed more consistent performance.

ICRA Conference 2019 Conference Paper

Making Sense of Vision and Touch: Self-Supervised Learning of Multimodal Representations for Contact-Rich Tasks

  • Michelle A. Lee
  • Yuke Zhu
  • Krishnan Srinivasan
  • Parth Shah 0003
  • Silvio Savarese
  • Li Fei-Fei 0001
  • Animesh Garg
  • Jeannette Bohg

Contact-rich manipulation tasks in unstructured environments often require both haptic and visual feedback. However, it is non-trivial to manually design a robot controller that combines modalities with very different characteristics. While deep reinforcement learning has shown success in learning control policies for high-dimensional inputs, these algorithms are generally intractable to deploy on real robots due to sample complexity. We use self-supervision to learn a compact and multimodal representation of our sensory inputs, which can then be used to improve the sample efficiency of our policy learning. We evaluate our method on a peg insertion task, generalizing over different geometry, configurations, and clearances, while being robust to external perturbations. We present results in simulation and on a real robot.

ICRA Conference 2019 Conference Paper

Mechanical Search: Multi-Step Retrieval of a Target Object Occluded by Clutter

  • Michael Danielczuk
  • Andrey Kurenkov
  • Ashwin Balakrishna
  • Matthew Matl
  • David Wang
  • Roberto Martín-Martín
  • Animesh Garg
  • Silvio Savarese

When operating in unstructured environments such as warehouses, homes, and retail centers, robots are frequently required to interactively search for and retrieve specific objects from cluttered bins, shelves, or tables. Mechanical Search describes the class of tasks where the goal is to locate and extract a known target object. In this paper, we formalize Mechanical Search and study a version where distractor objects are heaped over the target object in a bin. The robot uses an RGBD perception system and control policies to iteratively select, parameterize, and perform one of 3 actions - push, suction, grasp - until the target object is extracted, or either a time limit is exceeded, or no high confidence push or grasp is available. We present a study of 5 algorithmic policies for mechanical search, with 15, 000 simulated trials and 300 physical trials for heaps ranging from 10 to 20 objects. Results suggest that success can be achieved in this long-horizon task with algorithmic policies in over 95% of instances and that the number of actions required scales approximately linearly with the size of the heap. Code and supplementary material can be found at http://ai.stanford.edu/mech-search.

NeurIPS Conference 2019 Conference Paper

Regression Planning Networks

  • Danfei Xu
  • Roberto Martín-Martín
  • De-An Huang
  • Yuke Zhu
  • Silvio Savarese
  • Li Fei-Fei

Recent learning-to-plan methods have shown promising results on planning directly from observation space. Yet, their ability to plan for long-horizon tasks is limited by the accuracy of the prediction model. On the other hand, classical symbolic planners show remarkable capabilities in solving long-horizon tasks, but they require predefined symbolic rules and symbolic states, restricting their real-world applicability. In this work, we combine the benefits of these two paradigms and propose a learning-to-plan method that can directly generate a long-term symbolic plan conditioned on high-dimensional observations. We borrow the idea of regression (backward) planning from classical planning literature and introduce Regression Planning Networks (RPN), a neural network architecture that plans backward starting at a task goal and generates a sequence of intermediate goals that reaches the current observation. We show that our model not only inherits many favorable traits from symbolic planning --including the ability to solve previously unseen tasks-- but also can learn from visual inputs in an end-to-end manner. We evaluate the capabilities of RPN in a grid world environment and a simulated 3D kitchen environment featuring complex visual scenes and long task horizon, and show that it achieves near-optimal performance in completely new task instances.

IROS Conference 2019 Conference Paper

Scaling Robot Supervision to Hundreds of Hours with RoboTurk: Robotic Manipulation Dataset through Human Reasoning and Dexterity

  • Ajay Mandlekar
  • Jonathan Booher
  • Max Spero
  • Albert Tung
  • Anchit Gupta
  • Yuke Zhu
  • Animesh Garg
  • Silvio Savarese

Large, richly annotated datasets have accelerated progress in fields such as computer vision and natural language processing, but replicating these successes in robotics has been challenging. While prior data collection methodologies such as self-supervision have resulted in large datasets, the data can have poor signal-to-noise ratio. By contrast, previous efforts to collect task demonstrations with humans provide better quality data, but they cannot reach the same data magnitude. Furthermore, neither approach places guarantees on the diversity of the data collected, in terms of solution strategies. In this work, we leverage and extend the RoboTurk platform to scale up data collection for robotic manipulation using remote teleoperation. The primary motivation for our platform is two-fold: (1) to address the shortcomings of prior work and increase the total quantity of manipulation data collected through human supervision by an order of magnitude without sacrificing the quality of the data and (2) to collect data on challenging manipulation tasks across several operators and observe a diverse set of emergent behaviors and solutions. We collected over 111 hours of robot manipulation data across 54 users and 3 challenging manipulation tasks in 1 week, resulting in the largest robot dataset collected via remote teleoperation. We evaluate the quality of our platform, the diversity of demonstrations in our dataset, and the utility of our dataset via quantitative and qualitative analysis. For additional results, supplementary videos, and to download our dataset, visit http; //roboturk. stanford.edu/realrobotdataset

NeurIPS Conference 2019 Conference Paper

Social-BiGAT: Multimodal Trajectory Forecasting using Bicycle-GAN and Graph Attention Networks

  • Vineet Kosaraju
  • Amir Sadeghian
  • Roberto Martín-Martín
  • Ian Reid
  • Hamid Rezatofighi
  • Silvio Savarese

Predicting the future trajectories of multiple interacting pedestrians in a scene has become an increasingly important problem for many different applications ranging from control of autonomous vehicles and social robots to security and surveillance. This problem is compounded by the presence of social interactions between humans and their physical interactions with the scene. While the existing literature has explored some of these cues, they mainly ignored the multimodal nature of each human's future trajectory which is noticeably influenced by the intricate social interactions. In this paper, we present Social-BiGAT, a graph-based generative adversarial network that generates realistic, multimodal trajectory predictions for multiple pedestrians in a scene. Our method is based on a graph attention network (GAT) that learns feature representations that encode the social interactions between humans in the scene, and a recurrent encoder-decoder architecture that is trained adversarially to predict, based on the features, the humans' paths. We explicitly account for the multimodal nature of the prediction problem by forming a reversible transformation between each scene and its latent noise vector, as in Bicycle-GAN. We show that our framework achieves state-of-the-art performance comparing it to several baselines on existing trajectory forecasting benchmarks.

IJCAI Conference 2019 Conference Paper

Taskonomy: Disentangling Task Transfer Learning

  • Amir Zamir
  • Alexander Sax
  • William Shen
  • Leonidas Guibas
  • Jitendra Malik
  • Silvio Savarese

Do visual tasks have relationships, or are they unrelated? For instance, could having surface normals simplify estimating the depth of an image? Intuition answers these questions positively, implying existence of a certain structure among visual tasks. Knowing this structure has notable values; it provides a principled way for identifying relationships across tasks, for instance, in order to reuse supervision among tasks with redundancies or solve many tasks in one system without piling up the complexity. We propose a fully computational approach for modeling the transfer learning structure of the space of visual tasks. This is done via finding transfer learning dependencies across tasks in a dictionary of twenty-six 2D, 2. 5D, 3D, and semantic tasks. The product is a computational taxonomic map among tasks for transfer learning, and we exploit it to reduce the demand for labeled data. For example, we show that the total number of labeled datapoints needed for solving a set of 10 tasks can be reduced by roughly 2/3 (compared to training independently) while keeping the performance nearly the same. We provide a set of tools for computing and visualizing this taxonomical structure at http: //taskonomy. vision.

IROS Conference 2019 Conference Paper

Variable Impedance Control in End-Effector Space: An Action Space for Reinforcement Learning in Contact-Rich Tasks

  • Roberto Martín-Martín
  • Michelle A. Lee
  • Rachel Gardner
  • Silvio Savarese
  • Jeannette Bohg
  • Animesh Garg

Reinforcement Learning (RL) of contact-rich manipulation tasks has yielded impressive results in recent years. While many studies in RL focus on varying the observation space or reward model, few efforts focused on the choice of action space (e. g. joint or end-effector space, position, velocity, etc.). However, studies in robot motion control indicate that choosing an action space that conforms to the characteristics of the task can simplify exploration and improve robustness to disturbances. This paper studies the effect of different action spaces in deep RL and advocates for variable impedance control in end-effector space (VICES) as an advantageous action space for constrained and contact-rich tasks. We evaluate multiple action spaces on three prototypical manipulation tasks: Path Following (task with no contact), Door Opening (task with kinematic constraints), and Surface Wiping (task with continuous contact). We show that VICES improves sample efficiency, maintains low energy consumption, and ensures safety across all three experimental setups. Further, RL policies learned with VICES can transfer across different robot models in simulation, and from simulation to real for the same robot. Further information is available at https://stanfordvl.github.io/vices.

NeurIPS Conference 2018 Conference Paper

Generalizing to Unseen Domains via Adversarial Data Augmentation

  • Riccardo Volpi
  • Hongseok Namkoong
  • Ozan Sener
  • John Duchi
  • Vittorio Murino
  • Silvio Savarese

We are concerned with learning models that generalize well to different unseen domains. We consider a worst-case formulation over data distributions that are near the source domain in the feature space. Only using training data from a single source distribution, we propose an iterative procedure that augments the dataset with examples from a fictitious target domain that is "hard" under the current model. We show that our iterative scheme is an adaptive data augmentation method where we append adversarial examples at each iteration. For softmax losses, we show that our method is a data-dependent regularization scheme that behaves differently from classical regularizers that regularize towards zero (e. g. , ridge or lasso). On digit recognition and semantic segmentation tasks, our method learns models improve performance across a range of a priori unknown target domains.

IROS Conference 2018 Conference Paper

GONet: A Semi-Supervised Deep Learning Approach For Traversability Estimation

  • Noriaki Hirose
  • Amir Sadeghian
  • Marynel Vázquez
  • Patrick Goebel
  • Silvio Savarese

We present semi-supervised deep learning approaches for traversability estimation from fisheye images. Our method, GONet, and the proposed extensions leverage Generative Adversarial Networks (GANs) to effectively predict whether the area seen in the input image(s) is safe for a robot to traverse. These methods are trained with many positive images of traversable places, but just a small set of negative images depicting blocked and unsafe areas. This makes the proposed methods practical. Positive examples can be collected easily by simply operating a robot through traversable spaces, while obtaining negative examples is time consuming, costly, and potentially dangerous. Through extensive experiments and several demonstrations, we show that the proposed traversability estimation approaches are robust and can generalize to unseen scenarios. Further, we demonstrate that our methods are memory efficient and fast, allowing for real-time operation on a mobile robot with single or stereo fisheye cameras. As part of our contributions, we open-source two new datasets for traversability estimation. These datasets are composed of approximately 24h of videos from more than 25 indoor environments. Our methods outperform baseline approaches for traversability estimation on these new datasets.

ICRA Conference 2018 Conference Paper

Multi-Task Domain Adaptation for Deep Learning of Instance Grasping from Simulation

  • Kuan Fang
  • Yunfei Bai
  • Stefan Hinterstoisser
  • Silvio Savarese
  • Mrinal Kalakrishnan

Learning-based approaches to robotic manipulation are limited by the scalability of data collection and accessibility of labels. In this paper, we present a multi-task domain adaptation framework for instance grasping in cluttered scenes by utilizing simulated robot experiments. Our neural network takes monocular RGB images and the instance segmentation mask of a specified target object as inputs, and predicts the probability of successfully grasping the specified object for each candidate motor command. The proposed transfer learning framework trains a model for instance grasping in simulation and uses a domain-adversarial loss to transfer the trained model to real robots using indiscriminate grasping data, which is available both in simulation and the real world. We evaluate our model in real-world robot experiments, comparing it with alternative model architectures as well as an indiscriminate grasping baseline.

ICRA Conference 2018 Conference Paper

Neural Task Programming: Learning to Generalize Across Hierarchical Tasks

  • Danfei Xu
  • Suraj Nair 0003
  • Yuke Zhu
  • Julian Gao
  • Animesh Garg
  • Li Fei-Fei 0001
  • Silvio Savarese

In this work, we propose a novel robot learning framework called Neural Task Programming (NTP), which bridges the idea of few-shot learning from demonstration and neural program induction. NTP takes as input a task specification (e. g. , video demonstration of a task) and recursively decomposes it into finer sub-task specifications. These specifications are fed to a hierarchical neural program, where bottom-level programs are callable subroutines that interact with the environment. We validate our method in three robot manipulation tasks. NTP achieves strong generalization across sequential tasks that exhibit hierarchal and compositional structures. The experimental results show that NTP learns to generalize well towards unseen tasks with increasing lengths, variable topologies, and changing objectives. stanfordvl.github.io/ntp/.

IROS Conference 2017 Conference Paper

Adversarially Robust Policy Learning: Active construction of physically-plausible perturbations

  • Ajay Mandlekar
  • Yuke Zhu
  • Animesh Garg
  • Li Fei-Fei 0001
  • Silvio Savarese

Policy search methods in reinforcement learning have demonstrated success in scaling up to larger problems beyond toy examples. However, deploying these methods on real robots remains challenging due to the large sample complexity required during learning and their vulnerability to malicious intervention. We introduce Adversarially Robust Policy Learning (ARPL), an algorithm that leverages active computation of physically-plausible adversarial examples during training to enable robust policy learning in the source domain and robust performance under both random and adversarial input perturbations. We evaluate ARPL on four continuous control tasks and show superior resilience to changes in physical environment dynamics parameters and environment state as compared to state-of-the-art robust policy learning methods. Code, data, and additional experimental results are available at: stanfordvl.github.io/ARPL.

ICRA Conference 2017 Conference Paper

Unsupervised camera localization in crowded spaces

  • Alexandre Alahi
  • Judson Wilson
  • Li Fei-Fei 0001
  • Silvio Savarese

Existing camera networks in public spaces such as train terminals or malls can help social robots to navigate crowded scenes. However, the localization of the cameras is required, i. e. , the positions and poses of all cameras in a unique reference. In this work, we estimate the relative location of any pair of cameras by solely using noisy trajectories observed from each camera. We propose a fully unsupervised learning technique using unlabelled pedestrians motion patterns captured in crowded scenes. We first estimate the pairwise camera parameters by optimally matching single-view pedestrian tracks using social awareness. Then, we show the impact of jointly estimating the network parameters. This is done by formulating a nonlinear least square optimization problem, leveraging a continuous approximation of the matching function. We evaluate our approach in real-world environments such as train terminals, where several hundreds of individuals need to be tracked across dozens of cameras every second.

NeurIPS Conference 2016 Conference Paper

Learning Transferrable Representations for Unsupervised Domain Adaptation

  • Ozan Sener
  • Hyun Oh Song
  • Ashutosh Saxena
  • Silvio Savarese

Supervised learning with large scale labelled datasets and deep layered models has caused a paradigm shift in diverse areas in learning and recognition. However, this approach still suffers from generalization issues under the presence of a domain shift between the training and the test data distribution. Since unsupervised domain adaptation algorithms directly address this domain shift problem between a labelled source dataset and an unlabelled target dataset, recent papers have shown promising results by fine-tuning the networks with domain adaptation loss functions which try to align the mismatch between the training and testing data distributions. Nevertheless, these recent deep learning based domain adaptation approaches still suffer from issues such as high sensitivity to the gradient reversal hyperparameters and overfitting during the fine-tuning stage. In this paper, we propose a unified deep learning framework where the representation, cross domain transformation, and target label inference are all jointly optimized in an end-to-end fashion for unsupervised domain adaptation. Our experiments show that the proposed method significantly outperforms state-of-the-art algorithms in both object recognition and digit classification experiments by a large margin. We will make our learned models as well as the source code available immediately upon acceptance.

ICRA Conference 2016 Conference Paper

Robust single-view instance recognition

  • David Held
  • Sebastian Thrun
  • Silvio Savarese

Some robots must repeatedly interact with a fixed set of objects in their environment. To operate correctly, it is helpful for the robot to be able to recognize the object instances that it repeatedly encounters. However, current methods for recognizing object instances require that, during training, many pictures are taken of each object from a large number of viewing angles. This procedure is slow and requires much manual effort before the robot can begin to operate in a new environment. We have developed a novel procedure for training a neural network to recognize a set of objects from just a single training image per object. To obtain robustness to changes in viewpoint, we take advantage of a supplementary dataset in which we observe a separate (non-overlapping) set of objects from multiple viewpoints. After pre-training the network in a novel multi-stage fashion, the network can robustly recognize new object instances given just a single training image of each object. If more images of each object are available, the performance improves. We perform a thorough analysis comparing our novel training procedure to traditional neural network pre-training techniques as well as previous state-of-the-art approaches including keypoint-matching, template-matching, and sparse coding, and we demonstrate that our method significantly outperforms these previous approaches. Our method can thus be used to easily teach a robot to recognize a novel set of object instances from unknown viewpoints.

NeurIPS Conference 2016 Conference Paper

Universal Correspondence Network

  • Christopher Choy
  • JunYoung Gwak
  • Silvio Savarese
  • Manmohan Chandraker

We present a deep learning framework for accurate visual correspondences and demonstrate its effectiveness for both geometric and semantic matching, spanning across rigid motions to intra-class shape or appearance variations. In contrast to previous CNN-based approaches that optimize a surrogate patch similarity objective, we use deep metric learning to directly learn a feature space that preserves either geometric or semantic similarity. Our fully convolutional architecture, along with a novel correspondence contrastive loss allows faster training by effective reuse of computations, accurate gradient computation through the use of thousands of examples per image pair and faster testing with $O(n)$ feedforward passes for n keypoints, instead of $O(n^2)$ for typical patch similarity methods. We propose a convolutional spatial transformer to mimic patch normalization in traditional features like SIFT, which is shown to dramatically boost accuracy for semantic correspondences across intra-class shape variations. Extensive experiments on KITTI, PASCAL and CUB-2011 datasets demonstrate the significant advantages of our features over prior works that use either hand-constructed or learned features.

ICRA Conference 2016 Conference Paper

Watch-Bot: Unsupervised learning for reminding humans of forgotten actions

  • Chenxia Wu
  • Jiemi Zhang
  • Bart Selman
  • Silvio Savarese
  • Ashutosh Saxena

We present a robotic system that watches a human using a Kinect v2 RGB-D sensor, detects what he forgot to do while performing an activity, and if necessary reminds the person using a laser pointer to point out the related object. Our simple setup can be easily deployed on any assistive robot. Our approach is based on a learning algorithm trained in a purely unsupervised setting, which does not require any human annotations. This makes our approach scalable and applicable to variant scenarios. Our model learns the action/object co-occurrence and action temporal relations in the activity, and uses the learned rich relationships to infer the forgotten action and the related object. We show that our approach not only improves the unsupervised action segmentation and action cluster assignment performance, but also effectively detects the forgotten actions on a challenging human activity RGB-D video dataset. In robotic experiments, we show that our robot is able to remind people of forgotten actions successfully.

ICML Conference 2014 Conference Paper

Structured Recurrent Temporal Restricted Boltzmann Machines

  • Roni Mittelman
  • Benjamin Kuipers
  • Silvio Savarese
  • Honglak Lee

The Recurrent temporal restricted Boltzmann machine (RTRBM) is a probabilistic model for temporal data, that has been shown to effectively capture both short and long-term dependencies in time-series. The topology of the RTRBM graphical model, however, assumes full connectivity between all the pairs of visible and hidden units, therefore ignoring the dependency structure between the different observations. Learning this structure has the potential to not only improve the prediction performance, but it can also reveal important patterns in the data. For example, given an econometric dataset, we could identify interesting dependencies between different market sectors; given a meteorological dataset, we could identify regional weather patterns. In this work we propose a new class of RTRBM, which explicitly uses a dependency graph to model the structure in the problem and to define the energy function. We refer to the new model as the structured RTRBM (SRTRBM). Our technique is related to methods such as graphical lasso, which are used to learn the topology of Gaussian graphical models. We also develop a spike-and-slab version of the RTRBM, and combine it with our method to learn structure in datasets with real valued observations. Our experimental results using synthetic and real datasets, demonstrate that the SRTRBM can improve the prediction performance of the RTRBM, particularly when the number of visible units is large and the size of the training set is small. It also reveals the structure underlying our benchmark datasets.

ICRA Conference 2014 Conference Paper

Toward mutual information based place recognition

  • Gaurav Pandey 0004
  • James R. McBride
  • Silvio Savarese
  • Ryan M. Eustice

This paper reports on a novel mutual information (MI) based algorithm for robust place recognition. The proposed method provides a principled framework for fusing the complementary information obtained from 3D lidar and camera imagery for recognizing places within an a priori map of a dynamic environment. The visual appearance of the locations in the map can be significantly different due to changing weather, lighting conditions and dynamical objects present in the environment. Various 3D/2D features are extracted from the textured point clouds (scans) and each scan is represented as a collection of these features. For two scans acquired from the same location, the high value of MI between the features present in the scans indicates that the scans are captured from the same location. We use a non-parametric entropy estimator to estimate the true MI from the sparse marginal and joint histograms of the features extracted from the scans. Experimental results using seasonal datasets collected over several years are used to validate the robustness of the proposed algorithm.

AAAI Conference 2012 Conference Paper

Automatic Targetless Extrinsic Calibration of a 3D Lidar and Camera by Maximizing Mutual Information

  • Gaurav Pandey
  • James McBride
  • Silvio Savarese
  • Ryan Eustice

This paper reports on a mutual information (MI) based algorithm for automatic extrinsic calibration of a 3D laser scanner and optical camera system. By using MI as the registration criterion, our method is able to work in situ without the need for any specific calibration targets, which makes it practical for in-field calibration. The calibration parameters are estimated by maximizing the mutual information obtained between the sensor-measured surface intensities. We calculate the Cramer-Rao-Lower-Bound (CRLB) and show that the sample variance of the estimated parameters empirically approaches the CRLB for a sufficient number of views. Furthermore, we compare the calibration results to independent ground-truth and observe that the mean error also empirically approaches to zero as the number of views are increased. This indicates that the proposed algorithm, in the limiting case, calculates a minimum variance unbiased (MVUB) estimate of the calibration parameters. Experimental results are presented for data collected by a vehicle mounted with a 3D laser scanner and an omnidirectional camera system.

IROS Conference 2012 Conference Paper

Toward mutual information based automatic registration of 3D point clouds

  • Gaurav Pandey 0004
  • James R. McBride
  • Silvio Savarese
  • Ryan M. Eustice

This paper reports a novel mutual information (MI) based algorithm for automatic registration of unstructured 3D point clouds comprised of co-registered 3D lidar and camera imagery. The proposed method provides a robust and principled framework for fusing the complementary information obtained from these two different sensing modalities. High-dimensional features are extracted from a training set of textured point clouds (scans) and hierarchical k-means clustering is used to quantize these features into a set of codewords. Using this codebook, any new scan can be represented as a collection of codewords. Under the correct rigid-body transformation aligning two overlapping scans, the MI between the codewords present in the scans is maximized. We apply a James-Stein-type shrinkage estimator to estimate the true MI from the marginal and joint histograms of the codewords extracted from the scans. Experimental results using scans obtained by a vehicle equipped with a 3D laser scanner and an omnidirectional camera are used to validate the robustness of the proposed algorithm over a wide range of initial conditions. We also show that the proposed method works well with 3D data alone.

ICRA Conference 2011 Conference Paper

Visually bootstrapped generalized ICP

  • Gaurav Pandey 0004
  • Silvio Savarese
  • James R. McBride
  • Ryan M. Eustice

This paper reports a novel algorithm for boot strapping the automatic registration of unstructured 3D point clouds collected using co-registered 3D lidar and omnidirectional camera imagery. Here, we exploit the co-registration of the 3D point cloud with the available camera imagery to associate high dimensional feature descriptors such as scale invariant feature transform (SIFT) or speeded up robust features (SURF) to the 3D points. We first establish putative point correspondence in the high dimensional feature space and then use these correspondences in a random sample consensus (RANSAC) framework to obtain an initial rigid body transformation that aligns the two scans. This initial transformation is then refined in a generalized iterative closest point (ICP) framework. The proposed method is completely data driven and does not require any initial guess on the transformation. We present results from a real world dataset collected by a vehicle equipped with a 3D laser scanner and an omnidirectional camera.