Author name cluster

Trevor Darrell

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

128 papers

2 author rows

ICLR Conference 2025 Conference Paper

A Coefficient Makes SVRG Effective

Yida Yin
Zhiqiu Xu
ZhiYuan Li
Trevor Darrell
Zhuang Liu 0003

Stochastic Variance Reduced Gradient (SVRG), introduced by Johnson & Zhang (2013), is a theoretically compelling optimization method. However, as Defazio & Bottou (2019) highlight, its effectiveness in deep learning is yet to be proven. In this work, we demonstrate the potential of SVRG in optimizing real-world neural networks. Our empirical analysis finds that, for deeper neural networks, the strength of the variance reduction term in SVRG should be smaller and decrease as training progresses. Inspired by this, we introduce a multiplicative coefficient $\alpha$ to control the strength and adjust it through a linear decay schedule. We name our method $\alpha$-SVRG. Our results show $\alpha$-SVRG better optimizes models, consistently reducing training loss compared to the baseline and standard SVRG across various model architectures and multiple image classification datasets. We hope our findings encourage further exploration into variance reduction techniques in deep learning. Code is available at github.com/davidyyd/alpha-SVRG.

Details

ICRA Conference 2025 Conference Paper

Decentralized Vehicle Coordination: The Berkeley DeepDrive Drone Dataset and Consensus-Based Models

Fangyu Wu 0003
Dequan Wang
Minjune Hwang
Chenhui Hao
Jiawei Lu
Jiamu Zhang
Christopher Chou
Trevor Darrell

A significant portion of roads, particularly in densely populated developing countries, lacks explicitly defined right-of-way rules. These understructured roads pose substantial challenges for autonomous vehicle motion planning, where efficient and safe navigation relies on understanding decentralized human coordination for collision avoidance. This coordination, often termed “social driving etiquette, ” remains underexplored due to limited open-source empirical data and suitable modeling frameworks. In this paper, we present a novel dataset and modeling framework designed to study motion planning in these understructured environments. The dataset includes 20 aerial videos of representative scenarios, an image dataset for training vehicle detection models, and a development kit for vehicle trajectory estimation. We demonstrate that a consensus-based modeling approach can effectively explain the emergence of priority orders observed in our dataset, and is therefore a viable framework for decentralized collision avoidance planning.

Details

NeurIPS Conference 2025 Conference Paper

Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling

Tsung-Han (Patrick) Wu
Heekyung Lee
Jiaxin Ge
Joseph Gonzalez
Trevor Darrell
David Chan

Vision-Language Models (VLMs) excel at visual understanding but often suffer from visual hallucinations, where they generate descriptions of nonexistent objects, actions, or concepts, posing significant risks in safety-critical applications. Existing hallucination mitigation methods typically follow one of two paradigms: generation adjustment, which modifies decoding behavior to align text with visual inputs, and post-hoc verification, where external models assess and correct outputs. While effective, generation adjustment methods often rely on heuristics and lack correction mechanisms, while post-hoc verification is complicated, typically requiring multiple models and tending to reject outputs rather than refine them. In this work, we introduce REVERSE, a unified framework that integrates hallucination-aware training with on-the-fly self-verification. By leveraging a new hallucination-verification dataset containing over 1. 3M semi-synthetic samples, along with a novel inference-time retrospective resampling technique, our approach enables VLMs to both detect hallucinations during generation and dynamically revise those hallucinations. Our evaluations show that REVERSE achieves state-of-the-art hallucination reduction, outperforming the best existing methods by up to 12% on CHAIR-MSCOCO and 34% on HaloQuest.

PDF Details

ICRA Conference 2025 Conference Paper

In-Context Learning Enables Robot Action Prediction in LLMs

Yida Yin
Zekai Wang
Yuvan Sharma
Dantong Niu
Trevor Darrell
Roei Herzig

Recently, Large Language Models (LLMs) have achieved remarkable success using in-context learning (ICL) in the language domain. However, leveraging the ICL capabilities within LLMs to directly predict robot actions remains largely unexplored. In this paper, we introduce RoboPrompt, a frame-work that enables off-the-shelf text-only LLMs to directly predict robot actions through ICL without training. Our approach first heuristically identifies keyframes that capture important moments from an episode. Next, we extract end-effector actions from these keyframes as well as the estimated initial object poses, and both are converted into textual descriptions. Finally, we construct a structured template to form ICL demonstrations from these textual descriptions and a task instruction. This enables an LLM to directly predict robot actions at test time. Through extensive experiments and analysis, RoboPrompt shows stronger performance over zero-shot and ICL baselines in simulated and real-world settings. Our project page is available at https://davidyyd.github.io/roboprompt.

Details

NeurIPS Conference 2025 Conference Paper

LISAt: Language-Instructed Segmentation Assistant for Satellite Imagery

Jerome Quenum
Wen-Han Hsieh
Tsung-Han (Patrick) Wu
Ritwik Gupta
Trevor Darrell
David Chan

Segmentation models can recognize a pre-defined set of objects in images. However, segmentation models capable of "reasoning" over complex user queries that implicitly refer to multiple objects of interest remain underexplored, especially in the geospatial domain. Recent advances in "reasoning segmentation"---generating segmentation masks from complex, implicit query text---demonstrate the potential of vision-language models (VLMs) to reason across an open domain of objects. Yet, our experiments reveal that these models struggle when applied to the unique challenges of remote-sensing imagery. To address this gap, we introduce a new dataset which consists of: GRES, a curated geospatial reasoning-segmentation dataset with 27, 615 annotations across 9, 205 images, and PreGRES, a collection of existing datasets to make up a large-scale multimodal pretraining corpus with over 1M question-answer pairs across 119, 279 images. We propose an initial benchmark model, LISAt, a VLM for geospatial analysis that can describe complex remote-sensing scenes, answer detailed queries, and segment objects based on natural-language prompts. LISAt establishes a strong initial geospatial benchmark, outperforming prior foundation models such as RS-GPT4V by 10. 04\% (BLEU-4) on visual description tasks and surpassing open-domain models on geospatial reasoning segmentation by 143. 36\% (gIoU). Our model, dataset, and code are available on our project page: https: //lisat-bair. github. io/LISAt/.

PDF Details

ICLR Conference 2025 Conference Paper

MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion

Junyi Zhang 0004
Charles Herrmann
Junhwa Hur
Varun Jampani
Trevor Darrell
Forrester Cole
Deqing Sun
Ming-Hsuan Yang 0001

Estimating geometry from dynamic scenes, where objects move and deform over time, remains a core challenge in computer vision. Current approaches often rely on multi-stage pipelines or global optimizations that decompose the problem into subtasks, like depth and flow, leading to complex systems prone to errors. In this paper, we present Motion DUSt3R (MonST3R), a novel geometry-first approach that directly estimates per-timestep geometry from dynamic scenes. Our key insight is that by simply estimating a pointmap for each timestep, we can effectively adapt DUSt3R’s representation, previously only used for static scenes, to dynamic scenes. However, this approach presents a significant challenge: the scarcity of suitable training data, namely dynamic, posed videos with depth labels. Despite this, we show that by posing the problem as a fine-tuning task, identifying several suitable datasets, and strategically training the model on this limited data, we can surprisingly enable the model to handle dynamics, even without an explicit motion representation. Based on this, we introduce new optimizations for several downstream video-specific tasks and demonstrate strong performance on video depth and camera pose estimation, outperforming prior work in terms of robustness and efficiency. Moreover, MonST3R shows promising results for primarily feed-forward 4D reconstruction. Interactive 4D results, source code, and trained models are available at: https://monst3r-project.github.io/.

Details

ICML Conference 2025 Conference Paper

Pre-training Auto-regressive Robotic Models with 4D Representations

Dantong Niu
Yuvan Sharma
Haoru Xue
Giscard Biamby
Junyi Zhang 0004
Ziteng Ji
Trevor Darrell
Roei Herzig

Foundation models pre-trained on massive unlabeled datasets have revolutionized natural language and computer vision, exhibiting remarkable generalization capabilities, thus highlighting the importance of pre-training. Yet, efforts in robotics have struggled to achieve similar success, limited by either the need for costly robotic annotations or the lack of representations that effectively model the physical world. In this paper, we introduce ARM4R, an A uto-regressive R obotic M odel that leverages low-level 4 D R epresentations learned from human video data to yield a better pre-trained robotic model. Specifically, we focus on utilizing 3D point tracking representations from videos derived by lifting 2D representations into 3D space via monocular depth estimation across time. These 4D representations maintain a shared geometric structure between the points and robot state representations up to a linear transformation, enabling efficient transfer learning from human video data to low-level robotic control. Our experiments show that ARM4R can transfer efficiently from human video data to robotics and consistently improves performance on tasks across various robot environments and configurations.

Details

NeurIPS Conference 2025 Conference Paper

REOrdering Patches Improves Vision Models

Declan Kutscher
David Chan
Yutong Bai
Trevor Darrell
Ritwik Gupta

Sequence models such as transformers require inputs to be represented as one-dimensional sequences. In vision, this typically involves flattening images using a fixed row-major (raster-scan) order. While full self-attention is permutation-equivariant, modern long-sequence transformers increasingly rely on architectural approximations that break this invariance and introduce sensitivity to patch ordering. We show that patch order significantly affects model performance in such settings, with simple alternatives like column-major or Hilbert curves yielding notable accuracy shifts. Motivated by this, we propose REOrder, a two-stage framework for discovering task-optimal patch orderings. First, we derive an information-theoretic prior by evaluating the compressibility of various patch sequences. Then, we learn a policy over permutations by optimizing a Plackett-Luce policy using REINFORCE. This approach enables efficient learning in a combinatorial permutation space. REOrder improves top-1 accuracy over row-major ordering on ImageNet-1K by up to 3. 01% and Functional Map of the World by 13. 35%.

PDF Details

TMLR Journal 2025 Journal Article

Rethinking Patch Dependence for Masked Autoencoders

Letian Fu
Long Lian
Renhao Wang
Baifeng Shi
Xudong Wang
Adam Yala
Trevor Darrell
Alexei A Efros

In this work, we examine the impact of inter-patch dependencies in the decoder of masked autoencoders (MAE) on representation learning. We decompose the decoding mechanism for masked reconstruction into self-attention between mask tokens and cross-attention between masked and visible tokens. Our findings reveal that MAE reconstructs coherent images from visible patches not through interactions between patches in the decoder but by learning a global representation within the encoder. This discovery leads us to propose a simple visual pretraining framework: cross-attention masked autoencoders (CrossMAE). This framework employs only cross-attention in the decoder to independently read out reconstructions for a small subset of masked patches from encoder outputs. This approach achieves comparable or superior performance to traditional MAE across models ranging from ViT-S to ViT-H and significantly reduces computational requirements. By its design, CrossMAE challenges the necessity of interaction between mask tokens for effective masked pretraining. Code and models are publicly available: https://crossmae.github.io/

PDF Details

ICLR Conference 2025 Conference Paper

Revealing and Reducing Gender Biases in Vision and Language Assistants (VLAs)

Leander Girrbach
Stephan Alaniz
Yiran Huang
Trevor Darrell
Zeynep Akata

Pre-trained large language models (LLMs) have been reliably integrated with visual input for multimodal tasks. The widespread adoption of instruction-tuned image-to-text vision-language assistants (VLAs) like LLaVA and InternVL necessitates evaluating gender biases. We study gender bias in 22 popular open-source VLAs with respect to personality traits, skills, and occupations. Our results show that VLAs replicate human biases likely present in the data, such as real-world occupational imbalances. Similarly, they tend to attribute more skills and positive personality traits to women than to men, and we see a consistent tendency to associate negative personality traits with men. To eliminate the gender bias in these models, we find that fine-tuning-based debiasing methods achieve the best trade-off between debiasing and retaining performance on downstream tasks. We argue for pre-deploying gender bias assessment in VLAs and motivate further development of debiasing strategies to ensure equitable societal outcomes. Code is available at https://github.com/ExplainableML/vla-gender-bias.

Details

ICLR Conference 2025 Conference Paper

SegLLM: Multi-round Reasoning Segmentation with Large Language Models

Xudong Wang 0007
Shaolun Zhang
Shufan Li
Kehan Li
Konstantinos Kallidromitis
Yusuke Kato
Kazuki Kozuka
Trevor Darrell

We present SegLLM, a novel multi-round interactive reasoning segmentation model that enhances LLM-based segmentation by exploiting conversational memory of both visual and textual outputs. By leveraging a mask-aware multimodal LLM, SegLLM re-integrates previous segmentation results into its input stream, enabling it to reason about complex user intentions and segment objects in relation to previously identified entities, including positional, interactional, and hierarchical relationships, across multiple interactions. This capability allows SegLLM to respond to visual and text queries in a chat-like manner. Evaluated on the newly curated MRSeg benchmark, SegLLM outperforms existing methods in multi- round interactive reasoning segmentation by over 20%. Additionally, we observed that training on multi-round reasoning segmentation data enhances performance on standard single-round referring segmentation and localization tasks, resulting in a 5.5% increase in cIoU for referring expression segmentation and a 4.5% improvement in [email protected] for referring expression localization.

Details

ICLR Conference 2025 Conference Paper

VibeCheck: Discover and Quantify Qualitative Differences in Large Language Models

Lisa Dunlap
Krishna Mandal
Trevor Darrell
Jacob Steinhardt
Joseph E. Gonzalez

Large language models (LLMs) often exhibit subtle yet distinctive characteristics in their outputs that users intuitively recognize, but struggle to quantify. These "vibes" -- such as tone, formatting, or writing style -- influence user preferences, yet traditional evaluations focus primarily on the singular vibe of correctness. We introduce $\textbf{VibeCheck}$, a system for automatically comparing a pair of LLMs by discovering identifying traits of a model ("vibes") that are well-defined, differentiating, and user-aligned. VibeCheck iteratively discovers vibes from model outputs and then utilizes a panel of LLM judges to quantitatively measure the utility of each vibe. We validate that the vibes generated by VibeCheck align with those found in human discovery and run VibeCheck on pairwise preference data from real-world user conversations with Llama-3-70b vs GPT-4. VibeCheck reveals that Llama has a friendly, funny, and somewhat controversial vibe. These vibes predict model identity with 80% accuracy and human preference with 61% accuracy. Lastly, we run VibeCheck on a variety of models and tasks, including summarization, math, and captioning to provide insight into differences in model behavior. VibeCheck discovers vibes like Command X prefers to add concrete intros and conclusions when summarizing in comparison to TNGL, Llama-405b often overexplains its thought process on math problems compared to GPT-4o, and GPT-4 prefers to focus on the mood and emotions of the scene when captioning compared to Gemini-1.5-Flash.

Details

ICLR Conference 2025 Conference Paper

Video Action Differencing

James Burgess
Xiaohan Wang
Yuhui Zhang
Anita Rau
Alejandro Lozano
Lisa Dunlap
Trevor Darrell
Serena Yeung

How do two individuals differ when performing the same action? In this work, we introduce Video Action Differencing (VidDiff), the novel task of identifying subtle differences between videos of the same action, which has numerous applications, such as coaching and skill learning. To enable development on this new task, we first create VidDiffBench, a benchmark dataset containing 549 video pairs, with human annotations of 4,469 fine-grained action differences and 2,075 timestamps indicating where these differences occur. Our experiments demonstrate that VidDiffBench poses a significant challenge for state-of-the-art large multimodal models (LMMs), such as GPT-4o and Qwen2-VL. By analyzing the failure cases of LMMs on VidDiffBench, we highlight two key challenges for this task: localizing relevant sub-actions over two videos and fine-grained frame comparison. To overcome these, we propose the VidDiff method, an agentic workflow that breaks the task into three stages: action difference proposal, keyframe localization, and frame differencing, each stage utilizing specialized foundation models. To encourage future research in this new task, we release the benchmark and code.

Details

ICML Conference 2025 Conference Paper

Vision-Language Models Create Cross-Modal Task Representations

Grace Luo
Trevor Darrell
Amir Bar

Autoregressive vision-language models (VLMs) can handle many tasks within a single model, yet the representations that enable this capability remain opaque. We find that VLMs align conceptually equivalent inputs into a shared task vector, which is invariant to modality (text, image) and format (examples, instruction), and may simplify VLM processing. We measure this alignment via cross-modal transfer–the ability of a task vector derived in one modality to trigger the correct generation in another–on a range of tasks and model architectures. Although the task vector is highly compressed, we find that this single vector outperforms prompting the model with the full task information, unique to this cross-modal case. Furthermore, we show that task vectors can be transferred from a base language model to its fine-tuned vision-language counterpart, and that they can be derived solely from instructions without the need for examples. Taken together, our findings shed light on how VLMs internally process task information, and how they map different modalities into common semantic representations.

Details

ICLR Conference 2025 Conference Paper

Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark

Tsung-Han Wu
Giscard Biamby
Jerome Quenum
Ritwik Gupta
Joseph E. Gonzalez
Trevor Darrell
David M. Chan

Large Multimodal Models (LMMs) have made significant strides in visual question-answering for single images. Recent advancements like long-context LMMs have allowed them to ingest larger, or even multiple, images. However, the ability to process a large number of visual tokens does not guarantee effective retrieval and reasoning for multi-image question answering (MIQA), especially in real-world applications like photo album searches or satellite imagery analysis. In this work, we first assess the limitations of current benchmarks for long-context LMMs. We address these limitations by introducing a new vision-centric, long-context benchmark, "Visual Haystacks (VHs)". We comprehensively evaluate both open-source and proprietary models on VHs, and demonstrate that these models struggle when reasoning across potentially unrelated images, perform poorly on cross-image reasoning, as well as exhibit biases based on the placement of key information within the context window. Towards a solution, we introduce MIRAGE (Multi-Image Retrieval Augmented Generation), an open-source, lightweight visual-RAG framework that processes up to 10k images on a single 40G A100 GPU—far surpassing the 1k-image limit of contemporary models. MIRAGE demonstrates up to 13% performance improvement over existing open-source LMMs on VHs, sets a new state-of-the-art on the RetVQA multi-image QA benchmark, and achieves competitive performance on single-image QA with state-of-the-art LMMs. Our dataset, model, and code are available at: https://visual-haystacks.github.io.

Details

NeurIPS Conference 2025 Conference Paper

Whole-Body Conditioned Egocentric Video Prediction

Yutong Bai
Danny Tran
Amir Bar
Yann LeCun
Trevor Darrell
Jitendra Malik

We train models to predict ego-centric video from human actions (PEVA), given the past video and an action represented by the relative 3D body pose. By conditioning on kinematic pose trajectories, structured by the joint hierarchy of the body, our model learns to simulate how physical human actions shape the environment from a first-person point of view. We train an auto-regressive conditional diffusion transformer on Nymeria, a large-scale dataset of real-world egocentric video and body pose capture. We further design a hierarchical evaluation protocol with increasingly challenging tasks, enabling a comprehensive analysis of the model’s embodied prediction and control abilities. Our work represents an initial attempt to tackle the challenges of modeling complex real-world environments and embodied agent behaviors with video prediction from the perspective of a human.

PDF Details

TMLR Journal 2025 Journal Article

Wolf: Dense Video Captioning with a World Summarization Framework

Boyi Li
Ligeng Zhu
Ran Tian
Shuhan Tan
Yuxiao Chen
Yao Lu
Yin Cui
Sushant Veer

We propose Wolf, a WOrLd summarization Framework for accurate video captioning. Wolf is an automated captioning framework that adopts a mixture-of-experts approach, leveraging complementary strengths of Vision Language Models (VLMs). By utilizing both image and video models, our framework captures different levels of information and summarizes them efficiently. Our approach can be applied to enhance video understanding, auto-labeling, and captioning. To evaluate caption quality, we introduce CapScore, an LLM-based metric to assess the similarity and quality of generated captions compared to the ground truth captions. We further build four human-annotated datasets in three domains: autonomous driving, general scenes, and robotics, to facilitate comprehensive comparisons. We show that Wolf achieves superior captioning performance compared to state-of-the-art approaches from the research community (VILA1.5, CogAgent) and commercial solutions (Gemini-Pro-1.5, GPT-4V). For instance, in comparison with GPT-4V, Wolf improves CapScore (caption quality) by 55.6% and CapScore (caption similarity) by 77.4% on challenging driving videos. Finally, we establish a benchmark for video captioning and introduce a leaderboard, aiming to accelerate advancements in video understanding, captioning, and data alignment.

PDF Details

NeurIPS Conference 2024 Conference Paper

ConMe: Rethinking Evaluation of Compositional Reasoning for Modern VLMs

Irene Huang
Wei Lin
M. J. Mirza
Jacob A. Hansen
Sivan Doveh
Victor I. Butoi
Roei Herzig
Assaf Arbelle

Compositional Reasoning (CR) entails grasping the significance of attributes, relations, and word order. Recent Vision-Language Models (VLMs), comprising a visual encoder and a Large Language Model (LLM) decoder, have demonstrated remarkable proficiency in such reasoning tasks. This prompts a crucial question: have VLMs effectively tackled the CR challenge? We conjecture that existing CR benchmarks may not adequately push the boundaries of modern VLMs due to the reliance on an LLM only negative text generation pipeline. Consequently, the negatives produced either appear as outliers from the natural language distribution learned by VLMs' LLM decoders or as improbable within the corresponding image context. To address these limitations, we introduce ConMe\footnote{ConMe is an abbreviation for Confuse Me. } -- a compositional reasoning benchmark and a novel data generation pipeline leveraging VLMs to produce `hard CR Q&A'. Through a new concept of VLMs conversing with each other to collaboratively expose their weaknesses, our pipeline autonomously generates, evaluates, and selects challenging compositional reasoning questions, establishing a robust CR benchmark, also subsequently validated manually. Our benchmark provokes a noteworthy, up to 33%, decrease in CR performance compared to preceding benchmarks, reinstating the CR challenge even for state-of-the-art VLMs.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Humanoid Locomotion as Next Token Prediction

Ilija Radosavovic
Bike Zhang
Baifeng Shi
Jathushan Rajasegaran
Sarthak Kamat
Trevor Darrell
Koushil Sreenath
Jitendra Malik

We cast real-world humanoid control as a next token prediction problem, akin to predicting the next word in language. Our model is a causal transformer trained via autoregressive prediction of sensorimotor sequences. To account for the multi-modal nature of the data, we perform prediction in a modality-aligned way, and for each input token predict the next token from the same modality. This general formulation enables us to leverage data with missing modalities, such as videos without actions. We train our model on a dataset of sequences from a prior neural network policy, a model-based controller, motion capture, and YouTube videos of humans. We show that our model enables a real humanoid robot to walk in San Francisco zero-shot. Our model can transfer to the real world even when trained on only 27 hours of walking data, and can generalize to commands not seen during training. These findings suggest a promising path toward learning challenging real-world control tasks by generative modeling of sensorimotor sequences.

PDF Details DOI

ICML Conference 2024 Conference Paper

Hyperbolic Active Learning for Semantic Segmentation under Domain Shift

Luca Franco
Paolo Mandica
Konstantinos Kallidromitis
Devin Guillory
Yu-Teng Li
Trevor Darrell
Fabio Galasso

We introduce a hyperbolic neural network approach to pixel-level active learning for semantic segmentation. Analysis of the data statistics leads to a novel interpretation of the hyperbolic radius as an indicator of data scarcity. In HALO (Hyperbolic Active Learning Optimization), for the first time, we propose the use of epistemic uncertainty as a data acquisition strategy, following the intuition of selecting data points that are the least known. The hyperbolic radius, complemented by the widely-adopted prediction entropy, effectively approximates epistemic uncertainty. We perform extensive experimental analysis based on two established synthetic-to-real benchmarks, i. e. GTAV $\rightarrow$ Cityscapes and SYNTHIA $\rightarrow$ Cityscapes. Additionally, we test HALO on Cityscape $\rightarrow$ ACDC for domain adaptation under adverse weather conditions, and we benchmark both convolutional and attention-based backbones. HALO sets a new state-of-the-art in active learning for semantic segmentation under domain shift and it is the first active learning approach that surpasses the performance of supervised domain adaptation while using only a small portion of labels (i. e. , 1%).

Details

TMLR Journal 2024 Journal Article

IMProv: Inpainting-based Multimodal Prompting for Computer Vision Tasks

Jiarui Xu
Yossi Gandelsman
Amir Bar
Jianwei Yang
Jianfeng Gao
Trevor Darrell
Xiaolong Wang

In-context learning allows adapting a model to new tasks given a task description at test time. In this paper, we present IMProv - a generative model that is able to in-context learn visual tasks from multimodal prompts. Given a textual description of a visual task (e.g. “Left: input image, Right: foreground segmentation”), a few input-output visual examples, or both, the model in-context learns to solve it for a new test input. We train a masked generative transformer on a new dataset of figures from computer vision papers and their associated captions, together with a captioned large-scale image-text dataset. During inference time, we prompt the model with text and/or image task example(s) and have the model inpaint the corresponding output. We show that training our model with text conditioning and scaling the dataset size improves in-context learning for computer vision tasks by over $+10\%$ AP for Foreground Segmentation, over $+5\%$ gains in AP for Single Object Detection, and almost $20\%$ lower LPIPS in Colorization. Our emperical results suggest that vision and language prompts are complementary and it is advantageous to use both to achieve better in-context learning performance.

PDF Details

ICLR Conference 2024 Conference Paper

Initializing Models with Larger Ones

Zhiqiu Xu
Yanjie Chen
Kirill Vishniakov
Yida Yin
Zhiqiang Shen
Trevor Darrell
Lingjie Liu
Zhuang Liu 0003

Weight initialization plays an important role in neural network training. Widely used initialization methods are proposed and evaluated for networks that are trained from scratch. However, the growing number of pretrained models now offers new opportunities for tackling this classical problem of weight initialization. In this work, we introduce weight selection, a method for initializing smaller models by selecting a subset of weights from a pretrained larger model. This enables the transfer of knowledge from pretrained weights to smaller models. Our experiments demonstrate that weight selection can significantly enhance the performance of small models and reduce their training time. Notably, it can also be used together with knowledge distillation. Weight selection offers a new approach to leverage the power of pretrained models in resource-constrained settings, and we hope it can be a useful tool for training small models in the large-model era.

Details

TMLR Journal 2024 Journal Article

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models

Long Lian
Boyi Li
Adam Yala
Trevor Darrell

Recent advancements in text-to-image diffusion models have yielded impressive results in generating realistic and diverse images. However, these models still struggle with complex prompts, such as those that involve numeracy and spatial reasoning. This work proposes to enhance prompt understanding capabilities in diffusion models. Our method leverages a pretrained large language model (LLM) for grounded generation in a novel two-stage process. In the first stage, the LLM generates a scene layout that comprises captioned bounding boxes from a given prompt describing the desired image. In the second stage, a novel controller guides an off-the-shelf diffusion model for layout-grounded image generation. Both stages utilize existing pretrained models without additional model parameter optimization. Our method significantly outperforms the base diffusion model and several strong baselines in accurately generating images according to prompts that require various capabilities, doubling the generation accuracy across four tasks on average. Furthermore, our method enables instruction-based multi-round scene specification and can handle prompts in languages not supported by the underlying diffusion model. We anticipate that our method will unleash users' creativity by accurately following more complex prompts. Our code, demo, and benchmark are available at: https://llm-grounded-diffusion.github.io

PDF Details

ICLR Conference 2024 Conference Paper

LLM-grounded Video Diffusion Models

Long Lian
Baifeng Shi
Adam Yala
Trevor Darrell
Boyi Li 0001

Text-conditioned diffusion models have emerged as a promising tool for neural video generation. However, current models still struggle with intricate spatiotemporal prompts and often generate restricted or incorrect motion. To address these limitations, we introduce LLM-grounded Video Diffusion (LVD). Instead of directly generating videos from the text inputs, LVD first leverages a large language model (LLM) to generate dynamic scene layouts based on the text inputs and subsequently uses the generated layouts to guide a diffusion model for video generation. We show that LLMs are able to understand complex spatiotemporal dynamics from text alone and generate layouts that align closely with both the prompts and the object motion patterns typically observed in the real world. We then propose to guide video diffusion models with these layouts by adjusting the attention maps. Our approach is training-free and can be integrated into any video diffusion model that admits classifier guidance. Our results demonstrate that LVD significantly outperforms its base video diffusion model and several strong baseline methods in faithfully generating videos with the desired attributes and motion patterns.

Details

ICLR Conference 2024 Conference Paper

Mixture-of-Experts Meets Instruction Tuning: A Winning Combination for Large Language Models

Sheng Shen 0001
Le Hou
Yanqi Zhou
Nan Du 0002
Shayne Longpre
Jason Wei
Hyung Won Chung
Barret Zoph

Sparse Mixture-of-Experts (MoE) is a neural architecture design that adds learnable parameters to Large Language Models (LLMs) without increasing computational complexity (FLOPs). Instruction tuning is a technique for training LLMs to follow instructions. We advocate combining these two approaches, as we find that MoE models benefit more from instruction tuning than dense models. In particular, we conduct empirical studies across three experimental setups: (i) Direct finetuning on individual downstream tasks devoid of instruction tuning; (ii) Instruction tuning followed by in-context few-shot or zero-shot generalization on downstream tasks; and (iii) Instruction tuning supplemented by further finetuning on individual downstream tasks. In the first scenario, MoE models overall underperform dense models of identical computational capacity. This narrative, however, dramatically changes with the introduction of instruction tuning (in the second and third scenarios), used independently or in conjunction with task-specific finetuning. Our most powerful model, FLAN-MoE-32B, surpasses the performance of Flan-PaLM-62B on four benchmark tasks, while using only a third of the FLOPs. The advancements embodied by FLAN-MoE inspire a reevaluation of the design principles of large-scale, high-performance language models in the framework of task-agnostic learning.

Details

NeurIPS Conference 2024 Conference Paper

Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning

Brandon Huang
Chancharik Mitra
Assaf Arbelle
Leonid Karlinsky
Trevor Darrell
Roei Herzig

The recent success of interleaved Large Multimodal Models (LMMs) in few-shot learning suggests that in-context learning (ICL) with many examples can be promising for learning new tasks. However, this many-shot multimodal ICL setting has one crucial problem: it is fundamentally limited by the model's context length set at pretraining. The problem is especially prominent in the multimodal domain, which processes both text and images, requiring additional tokens. This motivates the need for a multimodal method to compress many shots into fewer tokens without finetuning. In this work, we enable LMMs to perform multimodal, many-shot in-context learning by leveraging Multimodal Task Vectors (MTV)---compact implicit representations of in-context examples compressed in the model's attention heads. Specifically, we first demonstrate the existence of such MTV in LMMs and then leverage these extracted MTV to enable many-shot in-context learning for various vision-and-language tasks. Our experiments suggest that MTV can scale in performance with the number of compressed shots and generalize to similar out-of-domain tasks without additional context length for inference. Code: https: //github. com/Brandon3964/MultiModal-Task-Vector

PDF Details DOI

ICRA Conference 2024 Conference Paper

Open X-Embodiment: Robotic Learning Datasets and RT-X Models: Open X-Embodiment Collaboration

Abby O'Neill
Abdul Rehman
Abhiram Maddukuri
Abhishek Gupta 0004
Abhishek Padalkar
Abraham Lee
Acorn Pooley
Agrim Gupta

Large, high-capacity models trained on diverse datasets have shown remarkable successes on efficiently tackling downstream applications. In domains from NLP to Computer Vision, this has led to a consolidation of pretrained models, with general pretrained backbones serving as a starting point for many applications. Can such a consolidation happen in robotics? Conventionally, robotic learning methods train a separate model for every application, every robot, and even every environment. Can we instead train "generalist" X-robot policy that can be adapted efficiently to new robots, tasks, and environments? In this paper, we provide datasets in standardized data formats and models to make it possible to explore this possibility in the context of robotic manipulation, alongside experimental results that provide an example of effective X-robot policies. We assemble a dataset from 22 different robots collected through a collaboration between 21 institutions, demonstrating 527 skills (160266 tasks). We show that a high-capacity model trained on this data, which we call RT-X, exhibits positive transfer and improves the capabilities of multiple robots by leveraging experience from other platforms. The project website is robotics-transformer-x. github.io.

Details

ICML Conference 2024 Conference Paper

Position: Near to Mid-term Risks and Opportunities of Open-Source Generative AI

Francisco Eiras
Aleksandar Petrov
Bertie Vidgen
Christian Schröder de Witt
Fabio Pizzati
Katherine Elkins
Supratik Mukhopadhyay
Adel Bibi

In the next few years, applications of Generative AI are expected to revolutionize a number of different areas, ranging from science & medicine to education. The potential for these seismic changes has triggered a lively debate about potential risks and resulted in calls for tighter regulation, in particular from some of the major tech companies who are leading in AI development. While regulation is important, it is key that it does not put at risk the budding field of open-source Generative AI. We argue for the responsible open sourcing of generative AI models in the near and medium term. To set the stage, we first introduce an AI openness taxonomy system and apply it to 40 current large language models. We then outline differential benefits and risks of open versus closed source AI and present potential risk mitigation, ranging from best practices to calls for technical and scientific contributions. We hope that this report will add a much needed missing voice to the current public discourse on near to mid-term AI safety and other societal impact.

Details

NeurIPS Conference 2024 Conference Paper

Segment Anything without Supervision

Xudong Wang
Jingfeng Yang
Trevor Darrell

The Segmentation Anything Model (SAM) requires labor-intensive data labeling. We present Unsupervised SAM (UnSAM) for promptable and automatic whole-image segmentation that does not require human annotations. UnSAM utilizes a divide-and-conquer strategy to “discover” the hierarchical structure of visual scenes. We first leverage top-down clustering methods to partition an unlabeled image into instance/semantic level segments. For all pixels within a segment, a bottom-up clustering method is employed to iteratively merge them into larger groups, thereby forming a hierarchical structure. These unsupervised multi-granular masks are then utilized to supervise model training. Evaluated across seven popular datasets, UnSAM achieves competitive results with the supervised counterpart SAM, and surpasses the previous state-of-the-art in unsupervised segmentation by 11% in terms of AR. Moreover, we show that supervised SAM can also benefit from our self-supervised labels. By integrating our unsupervised pseudo masks into SA-1B’s ground-truth masks and training UnSAM with only 1% of SA-1B, a lightly semi-supervised UnSAM can often segment entities overlooked by supervised SAM, exceeding SAM’s AR by over 6. 7% and AP by 3. 9% on SA-1B.

PDF Details DOI

ICML Conference 2024 Conference Paper

Stochastic positional embeddings improve masked image modeling

Amir Bar
Florian Bordes
Assaf Shocher
Mahmoud Assran
Pascal Vincent
Nicolas Ballas
Trevor Darrell
Amir Globerson

Masked Image Modeling (MIM) is a promising self-supervised learning approach that enables learning from unlabeled images. Despite its recent success, learning good representations through MIM remains challenging because it requires predicting the right semantic content in accurate locations. For example, given an incomplete picture of a dog, we can guess that there is a tail, but we cannot determine its exact location. In this work, we propose to incorporate location uncertainty to MIM by using stochastic positional embeddings (StoP). Specifically, we condition the model on stochastic masked token positions drawn from a gaussian distribution. We show that using StoP reduces overfitting to location features and guides the model toward learning features that are more robust to location uncertainties. Quantitatively, using StoP improves downstream MIM performance on a variety of downstream tasks. For example, linear probing on ImageNet using ViT-B is improved by $+1. 7%$, and by $2. 5%$ for ViT-H using 1% of the data.

Details

ICLR Conference 2024 Conference Paper

Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game

Sam Toyer
Olivia Watkins
Ethan Adrian Mendes
Justin Svegliato
Luke Bailey
Tiffany Wang
Isaac Ong
Karim Elmaaroufi

While Large Language Models (LLMs) are increasingly being used in real-world applications, they remain vulnerable to *prompt injection attacks*: malicious third party prompts that subvert the intent of the system designer. To help researchers study this problem, we present a dataset of over 563,000 prompt injection attacks and 118,000 prompt-based "defenses" against prompt injection, all created by players of an online game called Tensor Trust. To the best of our knowledge, this is the first dataset that includes both human-generated attacks and defenses for instruction-following LLMs. The attacks in our dataset have easily interpretable structure, and shed light on the weaknesses of LLMs. We also use the dataset to create a benchmark for resistance to two types of prompt injection, which we refer to as *prompt extraction* and *prompt hijacking*. Our benchmark results show that many models are vulnerable to the attack strategies in the Tensor Trust dataset. Furthermore, we show that some attack strategies from the dataset generalize to deployed LLM-based applications, even though they have a very different set of constraints to the game. We release data and code at [tensortrust.ai/paper](https://tensortrust.ai/paper)

Details

NeurIPS Conference 2024 Conference Paper

When does perceptual alignment benefit vision representations?

Shobhita Sundaram
Stephanie Fu
Lukas Muttenthaler
Netanel Tamir
Lucy Chai
Simon Kornblith
Trevor Darrell
Phillip Isola

Humans judge perceptual similarity according to diverse visual attributes, including scene layout, subject location, and camera pose. Existing vision models understand a wide range of semantic abstractions but improperly weigh these attributes and thus make inferences misaligned with human perception. While vision representations have previously benefited from human preference alignment in contexts like image generation, the utility of perceptually aligned representations in more general-purpose settings remains unclear. Here, we investigate how aligning vision model representations to human perceptual judgments impacts their usability in standard computer vision tasks. We finetune state-of-the-art models on a dataset of human similarity judgments for synthetic image triplets and evaluate them across diverse computer vision tasks. We find that aligning models to perceptual judgments yields representations that improve upon the original backbones across many downstream tasks, including counting, semantic segmentation, depth estimation, instance retrieval, and retrieval-augmented generation. In addition, we find that performance is widely preserved on other tasks, including specialized out-of-distribution domains such as in medical imaging and 3D environment frames. Our results suggest that injecting an inductive bias about human perceptual knowledge into vision models can make them better representation learners.

PDF Details DOI

ICML Conference 2024 Conference Paper

xT: Nested Tokenization for Larger Context in Large Images

Ritwik Gupta
Shufan Li
Tyler Zhu
Jitendra Malik
Trevor Darrell
Karttikeya Mangalam

Modern computer vision pipelines handle large images in one of two sub-optimal ways: down-sampling or cropping. These two methods incur significant losses in the amount of information and context present in an image. There are many downstream applications in which global context matters as much as high frequency details, such as in real-world satellite imagery; in such cases researchers have to make the uncomfortable choice of which information to discard. We introduce xT, a simple framework for vision transformers which effectively aggregates global context with local details and can model large images end-to-end on contemporary GPUs. We select a set of benchmark datasets across classic vision tasks which accurately reflect a vision model’s ability to understand truly large images and incorporate fine details over large scales and assess our method’s improvement on them. xT is a streaming, two-stage architecture that adapts existing vision backbones and long sequence language models to effectively model large images without quadratic memory growth. We are able to increase accuracy by up to 8. 6% on challenging classification tasks and F1 score by 11. 6 on context-dependent segmentation on images as large as 29, 000 x 29, 000 pixels.

Details

NeurIPS Conference 2023 Conference Paper

Diffusion Hyperfeatures: Searching Through Time and Space for Semantic Correspondence

Grace Luo
Lisa Dunlap
Dong Huk Park
Aleksander Holynski
Trevor Darrell

Diffusion models have been shown to be capable of generating high-quality images, suggesting that they could contain meaningful internal representations. Unfortunately, the feature maps that encode a diffusion model's internal information are spread not only over layers of the network, but also over diffusion timesteps, making it challenging to extract useful descriptors. We propose Diffusion Hyperfeatures, a framework for consolidating multi-scale and multi-timestep feature maps into per-pixel feature descriptors that can be used for downstream tasks. These descriptors can be extracted for both synthetic and real images using the generation and inversion processes. We evaluate the utility of our Diffusion Hyperfeatures on the task of semantic keypoint correspondence: our method achieves superior performance on the SPair-71k real image benchmark. We also demonstrate that our method is flexible and transferable: our feature aggregation network trained on the inversion features of real image pairs can be used on the generation features of synthetic image pairs with unseen objects and compositions. Our code is available at https: //diffusion-hyperfeatures. github. io.

PDF Details

NeurIPS Conference 2023 Conference Paper

Diversify Your Vision Datasets with Automatic Diffusion-based Augmentation

Lisa Dunlap
Alyssa Umino
Han Zhang
Jiezhi Yang
Joseph E. Gonzalez
Trevor Darrell

Many fine-grained classification tasks, like rare animal identification, have limited training data and consequently classifiers trained on these datasets often fail to generalize to variations in the domain like changes in weather or location. As such, we explore how natural language descriptions of the domains seen in training data can be used with large vision models trained on diverse pretraining datasets to generate useful variations of the training data. We introduce ALIA (Automated Language-guided Image Augmentation), a method which utilizes large vision and language models to automatically generate natural language descriptions of a dataset's domains and augment the training data via language-guided image editing. To maintain data integrity, a model trained on the original dataset filters out minimal image edits and those which corrupt class-relevant information. The resulting dataset is visually consistent with the original training data and offers significantly enhanced diversity. We show that ALIA is able to surpasses traditional data augmentation and text-to-image generated data on fine-grained classification tasks, including cases of domain generalization and contextual bias. Code is available at https: //github. com/lisadunlap/ALIA.

PDF Details

ICML Conference 2023 Conference Paper

Dropout Reduces Underfitting

Zhuang Liu 0003
Zhiqiu Xu
Joseph Jin
Zhiqiang Shen
Trevor Darrell

Introduced by Hinton et al. in 2012, dropout has stood the test of time as a regularizer for preventing overfitting in neural networks. In this study, we demonstrate that dropout can also mitigate underfitting when used at the start of training. During the early phase, we find dropout reduces the directional variance of gradients across mini-batches and helps align the mini-batch gradients with the entire dataset’s gradient. This helps counteract the stochasticity of SGD and limit the influence of individual batches on model training. Our findings lead us to a solution for improving performance in underfitting models - early dropout: dropout is applied only during the initial phases of training, and turned off afterwards. Models equipped with early dropout achieve lower final training loss compared to their counterparts without dropout. Additionally, we explore a symmetric technique for regularizing overfitting models - late dropout, where dropout is not used in the early iterations and is only activated later in training. Experiments on ImageNet and various vision tasks demonstrate that our methods consistently improve generalization accuracy. Our results encourage more research on understanding regularization in deep learning and our methods can be useful tools for future neural network training, especially in the era of large data. Code is available at https: //github. com/facebookresearch/dropout.

Details

ICML Conference 2023 Conference Paper

Guiding Pretraining in Reinforcement Learning with Large Language Models

Yuqing Du
Olivia Watkins
Zihan Wang
Cédric Colas
Trevor Darrell
Pieter Abbeel
Abhishek Gupta 0004
Jacob Andreas

Reinforcement learning algorithms typically struggle in the absence of a dense, well-shaped reward function. Intrinsically motivated exploration methods address this limitation by rewarding agents for visiting novel states or transitions, but these methods offer limited benefits in large environments where most discovered novelty is irrelevant for downstream tasks. We describe a method that uses background knowledge from text corpora to shape exploration. This method, called ELLM (Exploring with LLMs) rewards an agent for achieving goals suggested by a language model prompted with a description of the agent’s current state. By leveraging large-scale language model pretraining, ELLM guides agents toward human-meaningful and plausibly useful behaviors without requiring a human in the loop. We evaluate ELLM in the Crafter game environment and the Housekeep robotic simulator, showing that ELLM-trained agents have better coverage of common-sense behaviors during pretraining and usually match or improve performance on a range of downstream tasks.

Details

NeurIPS Conference 2023 Conference Paper

Hierarchical Open-vocabulary Universal Image Segmentation

Xudong Wang
Shufan Li
Konstantinos Kallidromitis
Yusuke Kato
Kazuki Kozuka
Trevor Darrell

Open-vocabulary image segmentation aims to partition an image into semantic regions according to arbitrary text descriptions. However, complex visual scenes can be naturally decomposed into simpler parts and abstracted at multiple lev4 els of granularity, introducing inherent segmentation ambiguity. Unlike existing methods that typically sidestep this ambiguity and treat it as an external factor, our approach actively incorporates a hierarchical representation encompassing different semantic-levels into the learning process. We propose a decoupled text-image fusion mechanism and representation learning modules for both “things” and “stuff”. Additionally, we systematically examine the differences that exist in the textual and visual features between these types of categories. Our resulting model, named HIPIE, tackles HIerarchical, oPen-vocabulary, and unIvErsal segmentation tasks within a unified framework. Benchmarked on diverse datasets, e. g. , ADE20K, COCO, Pascal-VOC Part, and RefCOCO/RefCOCOg, HIPIE achieves the state-of14 the-art results at various levels of image comprehension, including semantic-level (e. g. , semantic segmentation), instance-level (e. g. , panoptic/referring segmentationand object detection), as well as part-level (e. g. , part/subpart segmentation) tasks.

PDF Details

NeurIPS Conference 2023 Conference Paper

Large Language Models are Visual Reasoning Coordinators

Liangyu Chen
Bo Li
Sheng Shen
Jingkang Yang
Chunyuan Li
Kurt Keutzer
Trevor Darrell
Ziwei Liu

Visual reasoning requires multimodal perception and commonsense cognition of the world. Recently, multiple vision-language models (VLMs) have been proposed with excellent commonsense reasoning ability in various domains. However, how to harness the collective power of these complementary VLMs is rarely explored. Existing methods like ensemble still struggle to aggregate these models with the desired higher-order communications. In this work, we propose Cola, a novel paradigm that coordinates multiple VLMs for visual reasoning. Our key insight is that a large language model (LLM) can efficiently coordinate multiple VLMs by facilitating natural language communication that leverages their distinct and complementary capabilities. Extensive experiments demonstrate that our instruction tuning variant, Cola-FT, achieves state-of-the-art performance on visual question answering (VQA), outside knowledge VQA, visual entailment, and visual spatial reasoning tasks. Moreover, we show that our in-context learning variant, Cola-Zero, exhibits competitive performance in zero and few-shot settings, without finetuning. Through systematic ablation studies and visualizations, we validate that a coordinator LLM indeed comprehends the instruction prompts as well as the separate functionalities of VLMs; it then coordinates them to enable impressive visual reasoning capabilities.

PDF Details

ICLR Conference 2023 Conference Paper

Using Language to Extend to Unseen Domains

Lisa Dunlap
Clara Mohri
Devin Guillory
Han Zhang
Trevor Darrell
Joseph E. Gonzalez
Aditi Raghunathan
Anna Rohrbach

It is expensive to collect training data for every possible domain that a vision model may encounter when deployed. We instead consider how simply $\textit{verbalizing}$ the training domain (e.g.``photos of birds'') as well as domains we want to extend to but do not have data for (e.g.``paintings of birds'') can improve robustness. Using a multimodal model with a joint image and language embedding space, our method $\textit{LADS}$ learns a transformation of the image embeddings from the source domain to each target domain, while preserving task relevant information. Without using any images from the target domain, we show that over the $\textit{extended}$ domain containing both source and target, $\textit{LADS}$ outperforms standard fine-tuning and ensemble approaches over a suite of 4 benchmarks targeting domain adaptation and dataset bias.

Details

ICLR Conference 2022 Conference Paper

Anytime Dense Prediction with Confidence Adaptivity

Zhuang Liu 0003
Zhiqiu Xu
Hung-Ju Wang
Trevor Darrell
Evan Shelhamer

Anytime inference requires a model to make a progression of predictions which might be halted at any time. Prior research on anytime visual recognition has mostly focused on image classification.We propose the first unified and end-to-end approach for anytime dense prediction. A cascade of "exits" is attached to the model to make multiple predictions. We redesign the exits to account for the depth and spatial resolution of the features for each exit. To reduce total computation, and make full use of prior predictions, we develop a novel spatially adaptive approach to avoid further computation on regions where early predictions are already sufficiently confident. Our full method, named anytime dense prediction with confidence (ADP-C), achieves the same level of final accuracy, and meanwhile significantly reduces total computation. We evaluate our method on Cityscapes semantic segmentation and MPII human pose estimation: ADP-C enables anytime inference without sacrificing accuracy while also reducing the total FLOPs of its base models by 44.4% and 59.1%. We compare with anytime inference by deep equilibrium networks and feature-based stochastic sampling, showing that ADP-C dominates both across the accuracy-computation curve. Our code is available at https://github.com/liuzhuang13/anytime.

Details

NeurIPS Conference 2022 Conference Paper

Bringing Image Scene Structure to Video via Frame-Clip Consistency of Object Tokens

Elad Ben Avraham
Roei Herzig
Karttikeya Mangalam
Amir Bar
Anna Rohrbach
Leonid Karlinsky
Trevor Darrell
Amir Globerson

Recent action recognition models have achieved impressive results by integrating objects, their locations and interactions. However, obtaining dense structured annotations for each frame is tedious and time-consuming, making these methods expensive to train and less scalable. At the same time, if a small set of annotated images is available, either within or outside the domain of interest, how could we leverage these for a video downstream task? We propose a learning framework StructureViT (SViT for short), which demonstrates how utilizing the structure of a small number of images only available during training can improve a video model. SViT relies on two key insights. First, as both images and videos contain structured information, we enrich a transformer model with a set of object tokens that can be used across images and videos. Second, the scene representations of individual frames in video should ``align'' with those of still images. This is achieved via a Frame-Clip Consistency loss, which ensures the flow of structured information between images and videos. We explore a particular instantiation of scene structure, namely a Hand-Object Graph, consisting of hands and objects with their locations as nodes, and physical relations of contact/no-contact as edges. SViT shows strong performance improvements on multiple video understanding tasks and datasets, including the first place in the Ego4D CVPR'22 Point of No Return Temporal Localization Challenge. For code and pretrained models, visit the project page at https: //eladb3. github. io/SViT/.

PDF Details

ICLR Conference 2022 Conference Paper

Differentiable Gradient Sampling for Learning Implicit 3D Scene Reconstructions from a Single Image

Shizhan Zhu
Sayna Ebrahimi
Angjoo Kanazawa
Trevor Darrell

Implicit shape models are promising 3D representations for modeling arbitrary locations, with Signed Distance Functions (SDFs) particularly suitable for clear mesh surface reconstruction. Existing approaches for single object reconstruction impose supervision signals based on the loss of the signed distance value from all locations in a scene, posing difficulties when extending to real-world scenarios. The spatial gradient of the signed distance field, rather than the SDF value itself, has not been typically employed as a source of supervision for single-view reconstruction, in part due to the difficulties of differentiable sampling a spatial gradient from the feature map. In this study, we derive a novel closed-form gradient sampling solution for Differentialble Gradient Sampling (DGS) that enables backpropagation of the loss of the spatial gradient back to the feature map pixels, thus allowing the imposition of the loss efficiently on the spatial gradient. As a result, we achieve high-quality single view indoor scene reconstruction results learning directly from a real-world scanned dataset (e.g. ScannetV2). Our model also performs well when generalizing to unseen images downloaded directly from the internet (Fig. 1). We comfortably advanced the state-of-the-art results with several established datasets including ShapeNet and ScannetV2; extensive quantitative analysis confirmed that our proposed DGS module plays an essential role in achieving this performance improvement. Full codes are available in MaskedURL.

Details

NeurIPS Conference 2022 Conference Paper

K-LITE: Learning Transferable Visual Models with External Knowledge

Sheng Shen
Chunyuan Li
Xiaowei Hu
Yujia Xie
Jianwei Yang
Pengchuan Zhang
Zhe Gan
Lijuan Wang

The new generation of state-of-the-art computer vision systems are trained from natural language supervision, ranging from simple object category names to descriptive captions. This form of supervision ensures high generality and usability of the learned visual models, based on the broad concept coverage achieved through large-scale data collection process. Alternatively, we argue that learning with external knowledge about images is a promising way which leverages a much more structured source of supervision and offers sample efficiency. In this paper, we propose K-LITE (Knowledge-augmented Language-Image Training and Evaluation), a simple strategy to leverage external knowledge for building transferable visual systems: In training, it enriches entities in natural language with WordNet and Wiktionary knowledge, leading to an efficient and scalable approach to learning image representations that uses knowledge about the visual concepts; In evaluation, the natural language is also augmented with external knowledge and then used to reference learned visual concepts (or describe new ones) to enable zero-shot and few-shot transfer of the pre-trained models. We study the performance of K-LITE on two important computer vision problems, image classification and object detection, benchmarking on 20 and 13 different existing datasets, respectively. The proposed knowledge-augmented models show significant improvement in transfer learning performance over existing methods. Our code is released at https: //github. com/microsoft/klite.

PDF Details

IROS Conference 2022 Conference Paper

Towards Learning to Play Piano with Dexterous Hands and Touch

Huazhe Xu
Yuping Luo
Shaoxiong Wang
Trevor Darrell
Roberto Calandra

As Liszt once said “(a virtuoso) must call up scent and blossom, and breathe the breath of life”, a virtuoso plays the piano with passion, poetry, and extraordinary technical ability. Hence, piano playing, being a task that is quintessentially human, becomes a hallmark for roboticians and artificial intelligence researchers to pursue. In this paper, we advocate an end-to-end reinforcement learning (RL) paradigm to demonstrate how an agent can learn directly from machine-readable music score to play the piano with touch-augmented dexterous hands on a simulated piano. To achieve the desired tasks, we design useful touch- and audio-based reward functions and a series of tasks. Empirical results show that the RL agent can not only find the correct key position but also deal with the various rhythmic, volume, and fingering requirements. As a result, the agent demonstrates its effectiveness in playing simple pieces that have different musical requirements which show the potential of leveraging reinforcement learning approach for the piano playing tasks.

Details

AAAI Conference 2022 Conference Paper

Un-mix: Rethinking Image Mixtures for Unsupervised Visual Representation Learning

Zhiqiang Shen
Zechun Liu
Zhuang Liu
Marios Savvides
Trevor Darrell
Eric Xing

The recently advanced unsupervised learning approaches use the siamese-like framework to compare two "views" from the same image for learning representations. Making the two views distinctive is a core to guarantee that unsupervised methods can learn meaningful information. However, such frameworks are sometimes fragile on overfitting if the augmentations used for generating two views are not strong enough, causing the over-confident issue on the training data. This drawback hinders the model from learning subtle variance and fine-grained information. To address this, in this work we aim to involve the soft distance concept on label space in the contrastive-based unsupervised learning task and let the model be aware of the soft degree of similarity between positive or negative pairs through mixing the input data space, to further work collaboratively for the input and loss spaces. Despite its conceptual simplicity, we show empirically that with the solution -- Unsupervised image mixtures (Un-Mix), we can learn subtler, more robust and generalized representations from the transformed input and corresponding new label space. Extensive experiments are conducted on CIFAR-10, CIFAR-100, STL-10, Tiny ImageNet and standard ImageNet-1K with popular unsupervised methods SimCLR, BYOL, MoCo V1&V2, SwAV, etc. Our proposed image mixture and label assignment strategy can obtain consistent improvement by 1~3% following exactly the same hyperparameters and training procedures of the base methods. Code is publicly available at https://github.com/szq0214/Un-Mix.

PDF Details

ICML Conference 2022 Conference Paper

Visual Attention Emerges from Recurrent Sparse Reconstruction

Baifeng Shi
Yale Song
Neel Joshi
Trevor Darrell
Xin Wang 0066

Visual attention helps achieve robust perception under noise, corruption, and distribution shifts in human vision, which are areas where modern neural networks still fall short. We present VARS, Visual Attention from Recurrent Sparse reconstruction, a new attention formulation built on two prominent features of the human visual attention mechanism: recurrency and sparsity. Related features are grouped together via recurrent connections between neurons, with salient objects emerging via sparse regularization. VARS adopts an attractor network with recurrent connections that converges toward a stable pattern over time. Network layers are represented as ordinary differential equations (ODEs), formulating attention as a recurrent attractor network that equivalently optimizes the sparse reconstruction of input using a dictionary of “templates” encoding underlying patterns of data. We show that self-attention is a special case of VARS with a single-step optimization and no sparsity constraint. VARS can be readily used as a replacement for self-attention in popular vision transformers, consistently improving their robustness across various benchmarks.

Details

NeurIPS Conference 2022 Conference Paper

Visual Prompting via Image Inpainting

Amir Bar
Yossi Gandelsman
Trevor Darrell
Amir Globerson
Alexei Efros

How does one adapt a pre-trained visual model to novel downstream tasks without task-specific finetuning or any model modification? Inspired by prompting in NLP, this paper investigates visual prompting: given input-output image example(s) of a new task at test time and a new input image, the goal is to automatically produce the output image, consistent with the given examples. We show that posing this problem as simple image inpainting -- literally just filling in a hole in a concatenated visual prompt image -- turns out to be surprisingly effective, provided that the inpainting algorithm has been trained on the right data. We train masked auto-encoders on a new dataset that we curated -- 88k unlabeled figures from academic papers sources on Arxiv. We apply visual prompting to these pretrained models and demonstrate results on various downstream image-to-image tasks, including foreground segmentation, single object detection, colorization, edge detection, etc. Project page: https: //yossigandelsman. github. io/visual_prompt

PDF Details

ICML Conference 2022 Conference Paper

Zero-Shot Reward Specification via Grounded Natural Language

Parsa Mahmoudieh
Deepak Pathak
Trevor Darrell

Reward signals in reinforcement learning are expensive to design and often require access to the true state which is not available in the real world. Common alternatives are usually demonstrations or goal images which can be labor-intensive to collect. On the other hand, text descriptions provide a general, natural, and low-effort way of communicating the desired task. However, prior works in learning text-conditioned policies still rely on rewards that are defined using either true state or labeled expert demonstrations. We use recent developments in building large-scale visuolanguage models like CLIP to devise a framework that generates the task reward signal just from goal text description and raw pixel observations which is then used to learn the task policy. We evaluate the proposed framework on control and robotic manipulation tasks. Finally, we distill the individual task policies into a single goal text conditioned policy that can generalize in a zero-shot manner to new tasks with unseen objects and unseen goal text descriptions.

Details

ICRA Conference 2021 Conference Paper

Auto-Tuned Sim-to-Real Transfer

Yuqing Du
Olivia Watkins
Trevor Darrell
Pieter Abbeel
Deepak Pathak

Policies trained in simulation often fail when transferred to the real world due to the ‘reality gap’ where the simulator is unable to accurately capture the dynamics and visual properties of the real world. Current approaches to tackle this problem, such as domain randomization, require prior knowledge and engineering to determine how much to randomize system parameters in order to learn a policy that is robust to sim-to-real transfer while also not being too conservative. We propose a method for automatically tuning simulator system parameters to match the real world using only raw RGB images of the real world without the need to define rewards or estimate state. Our key insight is to reframe the auto-tuning of parameters as a search problem where we iteratively shift the simulation system parameters to approach the real world system parameters. We propose a Search Param Model (SPM) that, given a sequence of observations and actions and a set of system parameters, predicts whether the given parameters are higher or lower than the true parameters used to generate the observations. We evaluate our method on multiple robotic control tasks in both sim-to-sim and sim-to-real transfer, demonstrating significant improvement over naive domain randomization. Project videos at https://yuqingd.github.io/autotuned-sim2real/.

Details

NeurIPS Conference 2021 Conference Paper

Benchmark for Compositional Text-to-Image Synthesis

Dong Huk Park
Samaneh Azadi
Xihui Liu
Trevor Darrell
Anna Rohrbach

Rapid progress in text-to-image generation has been often measured by Frechet Inception Distance (FID) to capture how realistic the generated images are, or by R-Precision to assess if they are well conditioned on the given textual descriptions. However, a systematic study on how well the text-to-image synthesis models generalize to novel word compositions is missing. In this work, we focus on assessing how true the generated images are to the input texts in this particularly challenging scenario of novel compositions. We present the first systematic study of text-to-image generation on zero-shot compositional splits targeting two scenarios, unseen object-color (e. g. "blue petal") and object-shape (e. g. "long beak") phrases. We create new benchmarks building on the existing CUB and Oxford Flowers datasets. We also propose a new metric, based on a powerful vision-and-language CLIP model, which we leverage to compute R-Precision. This is in contrast to the common approach where the same retrieval model is used during training and evaluation, potentially leading to biased behavior. We experiment with several recent text-to-image generation methods. Our automatic and human evaluation confirm that there is indeed a gap in performance when encountering previously unseen phrases. We show that the image correctness rather than purely perceptual quality is especially impacted. Finally, our CLIP-R-Precision metric demonstrates better correlation with human judgments than the commonly used metric.

PDF Details

NeurIPS Conference 2021 Conference Paper

CLIP-It! Language-Guided Video Summarization

Medhini Narasimhan
Anna Rohrbach
Trevor Darrell

A generic video summary is an abridged version of a video that conveys the whole story and features the most important scenes. Yet the importance of scenes in a video is often subjective, and users should have the option of customizing the summary by using natural language to specify what is important to them. Further, existing models for fully automatic generic summarization have not exploited available language models, which can serve as an effective prior for saliency. This work introduces CLIP-It, a single framework for addressing both generic and query-focused video summarization, typically approached separately in the literature. We propose a language-guided multimodal transformer that learns to score frames in a video based on their importance relative to one another and their correlation with a user-defined query (for query-focused summarization) or an automatically generated dense video caption (for generic video summarization). Our model can be extended to the unsupervised setting by training without ground-truth supervision. We outperform baselines and prior work by a significant margin on both standard video summarization datasets (TVSum and SumMe) and a query-focused video summarization dataset (QFVS). Particularly, we achieve large improvements in the transfer setting, attesting to our method's strong generalization capabilities.

PDF Details

ICML Conference 2021 Conference Paper

Compositional Video Synthesis with Action Graphs

Amir Bar
Roei Herzig
Xiaolong Wang 0004
Anna Rohrbach
Gal Chechik
Trevor Darrell
Amir Globerson

Videos of actions are complex signals containing rich compositional structure in space and time. Current video generation methods lack the ability to condition the generation on multiple coordinated and potentially simultaneous timed actions. To address this challenge, we propose to represent the actions in a graph structure called Action Graph and present the new "Action Graph To Video" synthesis task. Our generative model for this task (AG2Vid) disentangles motion and appearance features, and by incorporating a scheduling mechanism for actions facilitates a timely and coordinated video generation. We train and evaluate AG2Vid on CATER and Something-Something V2 datasets, which results in videos that have better visual quality and semantic consistency compared to baselines. Finally, our model demonstrates zero-shot abilities by synthesizing novel compositions of the learned actions.

Details

ICLR Conference 2021 Conference Paper

Discovering Non-monotonic Autoregressive Orderings with Variational Inference

Xuanlin Li
Brandon Trabucco
Dong Huk Park
Michael Luo
Sheng Shen 0001
Trevor Darrell
Yang Gao 0029

The predominant approach for language modeling is to encode a sequence of tokens from left to right, but this eliminates a source of information: the order by which the sequence was naturally generated. One strategy to recover this information is to decode both the content and ordering of tokens. Some prior work supervises content and ordering with hand-designed loss functions to encourage specific orders or bootstraps from a predefined ordering. These approaches require domain-specific insight. Other prior work searches over valid insertion operations that lead to ground truth sequences during training, which has high time complexity and cannot be efficiently parallelized. We address these limitations with an unsupervised learner that can be trained in a fully-parallelizable manner to discover high-quality autoregressive orders in a data driven way without a domain-specific prior. The learner is a neural network that performs variational inference with the autoregressive ordering as a latent variable. Since the corresponding variational lower bound is not differentiable, we develop a practical algorithm for end-to-end optimization using policy gradients. Strong empirical results with our solution on sequence modeling tasks suggest that our algorithm is capable of discovering various autoregressive orders for different sequences that are competitive with or even better than fixed orders.

Details

NeurIPS Conference 2021 Conference Paper

Early Convolutions Help Transformers See Better

Tete Xiao
Mannat Singh
Eric Mintun
Trevor Darrell
Piotr Dollar
Ross Girshick

Vision transformer (ViT) models exhibit substandard optimizability. In particular, they are sensitive to the choice of optimizer (AdamW vs. SGD), optimizer hyperparameters, and training schedule length. In comparison, modern convolutional neural networks are easier to optimize. Why is this the case? In this work, we conjecture that the issue lies with the patchify stem of ViT models, which is implemented by a stride-p p×p convolution (p = 16 by default) applied to the input image. This large-kernel plus large-stride convolution runs counter to typical design choices of convolutional layers in neural networks. To test whether this atypical design choice causes an issue, we analyze the optimization behavior of ViT models with their original patchify stem versus a simple counterpart where we replace the ViT stem by a small number of stacked stride-two 3×3 convolutions. While the vast majority of computation in the two ViT designs is identical, we find that this small change in early visual processing results in markedly different training behavior in terms of the sensitivity to optimization settings as well as the final model accuracy. Using a convolutional stem in ViT dramatically increases optimization stability and also improves peak performance (by ∼1-2% top-1 accuracy on ImageNet-1k), while maintaining flops and runtime. The improvement can be observed across the wide spectrum of model complexities (from 1G to 36G flops) and dataset scales (from ImageNet-1k to ImageNet-21k). These findings lead us to recommend using a standard, lightweight convolutional stem for ViT models in this regime as a more robust architectural choice compared to the original ViT model design.

PDF Details

AAAI Conference 2021 Conference Paper

ePointDA: An End-to-End Simulation-to-Real Domain Adaptation Framework for LiDAR Point Cloud Segmentation

Sicheng Zhao
Yezhen Wang
Bo Li
Bichen Wu
Yang Gao
Pengfei Xu
Trevor Darrell
Kurt Keutzer

Due to its robust and precise distance measurements, Li- DAR plays an important role in scene understanding for autonomous driving. Training deep neural networks (DNNs) on LiDAR data requires large-scale point-wise annotations, which are time-consuming and expensive to obtain. Instead, simulation-to-real domain adaptation (SRDA) trains a DNN using unlimited synthetic data with automatically generated labels and transfers the learned model to real scenarios. Existing SRDA methods for LiDAR point cloud segmentation mainly employ a multi-stage pipeline and focus on featurelevel alignment. They require prior knowledge of real-world statistics and ignore the pixel-level dropout noise gap and the spatial feature gap between different domains. In this paper, we propose a novel end-to-end framework, named ePointDA, to address the above issues. Specifically, ePointDA consists of three modules: self-supervised dropout noise rendering, statistics-invariant and spatially-adaptive feature alignment, and transferable segmentation learning. The joint optimization enables ePointDA to bridge the domain shift at the pixel-level by explicitly rendering dropout noise for synthetic LiDAR and at the feature-level by spatially aligning the features between different domains, without requiring the real-world statistics. Extensive experiments adapting from synthetic GTA-LiDAR to real KITTI and SemanticKITTI demonstrate the superiority of ePointDA for LiDAR point cloud segmentation.

PDF Details

ICRA Conference 2021 Conference Paper

Instance-Aware Predictive Navigation in Multi-Agent Environments

Jinkun Cao
Xin Wang 0066
Trevor Darrell
Fisher Yu 0001

In this work, we aim to achieve efficient end-to-end learning of driving policies in dynamic multi-agent environments. Predicting and anticipating future events at the object level are critical for making informed driving decisions. We propose an Instance-Aware Predictive Control (IPC) approach, which forecasts interactions between agents as well as future scene structures. We adopt a novel multi-instance event prediction module to estimate the possible interaction among agents in the ego-centric view, conditioned on the selected action sequence of the ego-vehicle. To decide the action at each step, we seek the action sequence that can lead to safe future states based on the prediction module outputs by repeatedly sampling likely action sequences. We design a sequential action sampling strategy to better leverage predicted states on both scene-level and instance-level. Our method establishes a new state of the art in the challenging CARLA multi-agent driving simulation environments without expert demonstration, giving better explainability and sample efficiency.

Details

ICRA Conference 2021 Conference Paper

PyTouch: A Machine Learning Library for Touch Processing

Mike Lambeta
Huazhe Xu
Jingwei Xu 0005
Po-Wei Chou
Shaoxiong Wang
Trevor Darrell
Roberto Calandra

With the increased availability of rich tactile sensors, there is an an equally proportional need for open-source and integrated software capable of efficiently and effectively processing raw touch measurements into high-level signals that can be used for control and decision-making. In this paper, we present PyTouch – the first machine learning library dedicated to the processing of touch sensing signals. PyTouch, is designed to be modular, easy-to-use and provides state-of-the-art touch processing capabilities as a service with the goal of unifying the tactile sensing community by providing a library for building scalable, proven, and performance-validated modules over which applications and research can be built upon. We evaluate PyTouch on real-world data from several tactile sensors on touch processing tasks such as touch detection, slip and object pose estimations. PyTouch is open-sourced at https://github.com/facebookresearch/pytouch.

Details

ICLR Conference 2021 Conference Paper

Regularization Matters in Policy Optimization - An Empirical Study on Continuous Control

Zhuang Liu 0003
Xuanlin Li
Bingyi Kang
Trevor Darrell

Deep Reinforcement Learning (Deep RL) has been receiving increasingly more attention thanks to its encouraging performance on a variety of control tasks. Yet, conventional regularization techniques in training neural networks (e.g., $L_2$ regularization, dropout) have been largely ignored in RL methods, possibly because agents are typically trained and evaluated in the same environment, and because the deep RL community focuses more on high-level algorithm designs. In this work, we present the first comprehensive study of regularization techniques with multiple policy optimization algorithms on continuous control tasks. Interestingly, we find conventional regularization techniques on the policy networks can often bring large improvement, especially on harder tasks. Our findings are shown to be robust against training hyperparameter variations. We also compare these techniques with the more widely used entropy regularization. In addition, we study regularizing different components and find that only regularizing the policy network is typically the best. We further analyze why regularization may help generalization in RL from four perspectives - sample complexity, reward distribution, weight norm, and noise robustness. We hope our study provides guidance for future practices in regularizing policy optimization algorithms. Our code is available at https://github.com/xuanlinli17/iclr2021_rlreg .

Details

ICLR Conference 2021 Conference Paper

Remembering for the Right Reasons: Explanations Reduce Catastrophic Forgetting

Sayna Ebrahimi
Suzanne Petryk
Akash Gokul
William Gan
Joseph E. Gonzalez
Marcus Rohrbach
Trevor Darrell

The goal of continual learning (CL) is to learn a sequence of tasks without suffering from the phenomenon of catastrophic forgetting. Previous work has shown that leveraging memory in the form of a replay buffer can reduce performance degradation on prior tasks. We hypothesize that forgetting can be further reduced when the model is encouraged to remember the \textit{evidence} for previously made decisions. As a first step towards exploring this hypothesis, we propose a simple novel training paradigm, called Remembering for the Right Reasons (RRR), that additionally stores visual model explanations for each example in the buffer and ensures the model has ``the right reasons'' for its predictions by encouraging its explanations to remain consistent with those used to make decisions at training time. Without this constraint, there is a drift in explanations and increase in forgetting as conventional continual learning algorithms learn new tasks. We demonstrate how RRR can be easily added to any memory or regularization-based approach and results in reduced forgetting, and more importantly, improved model explanations. We have evaluated our approach in the standard and few-shot settings and observed a consistent improvement across various CL approaches using different architectures and techniques to generate model explanations and demonstrated our approach showing a promising connection between explainability and continual learning. Our code is available at \url{https://github.com/SaynaEbrahimi/Remembering-for-the-Right-Reasons}.

Details

NeurIPS Conference 2021 Conference Paper

Teachable Reinforcement Learning via Advice Distillation

Olivia Watkins
Abhishek Gupta
Trevor Darrell
Pieter Abbeel
Jacob Andreas

Training automated agents to complete complex tasks in interactive environments is challenging: reinforcement learning requires careful hand-engineering of reward functions, imitation learning requires specialized infrastructure and access to a human expert, and learning from intermediate forms of supervision (like binary preferences) is time-consuming and extracts little information from each human intervention. Can we overcome these challenges by building agents that learn from rich, interactive feedback instead? We propose a new supervision paradigm for interactive learning based on "teachable" decision-making systems that learn from structured advice provided by an external teacher. We begin by formalizing a class of human-in-the-loop decision making problems in which multiple forms of teacher-provided advice are available to a learner. We then describe a simple learning algorithm for these problems that first learns to interpret advice, then learns from advice to complete tasks even in the absence of human supervision. In puzzle-solving, navigation, and locomotion domains, we show that agents that learn from advice can acquire new skills with significantly less human supervision than standard reinforcement learning algorithms and often less than imitation learning.

PDF Details

ICLR Conference 2021 Conference Paper

Tent: Fully Test-Time Adaptation by Entropy Minimization

Dequan Wang
Evan Shelhamer
Shaoteng Liu
Bruno A. Olshausen
Trevor Darrell

A model must adapt itself to generalize to new and different data during testing. In this setting of fully test-time adaptation the model has only the test data and its own parameters. We propose to adapt by test entropy minimization (tent): we optimize the model for confidence as measured by the entropy of its predictions. Our method estimates normalization statistics and optimizes channel-wise affine transformations to update online on each batch. Tent reduces generalization error for image classification on corrupted ImageNet and CIFAR-10/100 and reaches a new state-of-the-art error on ImageNet-C. Tent handles source-free domain adaptation on digit recognition from SVHN to MNIST/MNIST-M/USPS, on semantic segmentation from GTA to Cityscapes, and on the VisDA-C benchmark. These results are achieved in one epoch of test-time optimization without altering training.

Details

ICLR Conference 2021 Conference Paper

What Should Not Be Contrastive in Contrastive Learning

Tete Xiao
Xiaolong Wang 0004
Alexei A. Efros
Trevor Darrell

Recent self-supervised contrastive methods have been able to produce impressive transferable visual representations by learning to be invariant to different data augmentations. However, these methods implicitly assume a particular set of representational invariances (e.g., invariance to color), and can perform poorly when a downstream task violates this assumption (e.g., distinguishing red vs. yellow cars). We introduce a contrastive learning framework which does not require prior knowledge of specific, task-dependent invariances. Our model learns to capture varying and invariant factors for visual representations by constructing separate embedding spaces, each of which is invariant to all but one augmentation. We use a multi-head network with a shared backbone which captures information across each augmentation and alone outperforms all baselines on downstream tasks. We further find that the concatenation of the invariant and varying spaces performs best across all tasks we investigate, including coarse-grained, fine-grained, and few-shot downstream classification tasks, and various data corruptions.

Details

ICRA Conference 2021 Conference Paper

Zero-shot Policy Learning with Spatial Temporal Reward Decomposition on Contingency-aware Observation

Huazhe Xu
Boyuan Chen 0003
Yang Gao 0029
Trevor Darrell

It is a long-standing challenge to enable an intelligent agent to learn in one environment and generalize to an unseen environment without further data collection and finetuning. In this paper, we consider a zero shot generalization problem setup that complies with biological intelligent agents’ learning and generalization processes. The agent is first presented with previous experiences in the training environment, along with task description in the form of trajectory-level sparse rewards. Later when it is placed in the new testing environment, it is asked to perform the task without any interaction with the testing environment. We find this setting natural for biological creatures and at the same time, challenging for previous methods. Behavior cloning, state-of-art RL along with other zero-shot learning methods perform poorly on this benchmark. Given a set of experiences in the training environment, our method learns a neural function that decomposes the sparse reward into particular regions in a contingency-aware observation as a per step reward. Based on such decomposed rewards, we further learn a dynamics model and use Model Predictive Control (MPC) to obtain a policy. Since the rewards are decomposed to finer-granularity observations, they are naturally generalizable to new environments that are composed of similar basic elements. We demonstrate our method on a wide range of environments, including a classic video game – Super Mario Bros, as well as a robotic continuous control task. Please refer to the project page for more visualized results. 1

Details

NeurIPS Conference 2020 Conference Paper

Auxiliary Task Reweighting for Minimum-data Learning

Baifeng Shi
Judy Hoffman
Kate Saenko
Trevor Darrell
Huijuan Xu

Supervised learning requires a large amount of training data, limiting its application where labeled data is scarce. To compensate for data scarcity, one possible method is to utilize auxiliary tasks to provide additional supervision for the main task. Assigning and optimizing the importance weights for different auxiliary tasks remains an crucial and largely understudied research question. In this work, we propose a method to automatically reweight auxiliary tasks in order to reduce the data requirement on the main task. Specifically, we formulate the weighted likelihood function of auxiliary tasks as a surrogate prior for the main task. By adjusting the auxiliary task weights to minimize the divergence between the surrogate prior and the true prior of the main task, we obtain a more accurate prior estimation, achieving the goal of minimizing the required amount of training data for the main task and avoiding a costly grid search. In multiple experimental settings (e. g. semi-supervised learning, multi-label classification), we demonstrate that our algorithm can effectively utilize limited labeled data of the main task with the benefit of auxiliary tasks compared with previous task reweighting methods. We also show that under extreme cases with only a few extra examples (e. g. few-shot domain adaptation), our algorithm results in significant improvement over the baseline. Our code and video is available at https: //sites. google. com/view/auxiliary-task-reweighting.

PDF Details

NeurIPS Conference 2020 Conference Paper

Fighting Copycat Agents in Behavioral Cloning from Observation Histories

Chuan Wen
Jierui Lin
Trevor Darrell
Dinesh Jayaraman
Yang Gao

Imitation learning trains policies to map from input observations to the actions that an expert would choose. In this setting, distribution shift frequently exacerbates the effect of misattributing expert actions to nuisance correlates among the observed variables. We observe that a common instance of this causal confusion occurs in partially observed settings when expert actions are strongly correlated over time: the imitator learns to cheat by predicting the expert's previous action, rather than the next action. To combat this "copycat problem", we propose an adversarial approach to learn a feature representation that removes excess information about the previous expert action nuisance correlate, while retaining the information necessary to predict the next action. In our experiments, our approach improves performance significantly across a variety of partially observed imitation learning tasks.

PDF Details

ICML Conference 2020 Conference Paper

Frustratingly Simple Few-Shot Object Detection

Xin Wang 0066
Thomas E. Huang
Joseph E. Gonzalez
Trevor Darrell
Fisher Yu 0001

Detecting rare objects from a few examples is an emerging problem. Prior works show meta-learning is a promising approach. But, fine-tuning techniques have drawn scant attention. We find that fine-tuning only the last layer of existing detectors on rare classes is crucial to the few-shot object detection task. Such a simple approach outperforms the meta-learning methods by roughly 2 20 points on current benchmarks and sometimes even doubles the accuracy of the prior methods. However, the high variance in the few samples often leads to the unreliability of existing benchmarks. We revise the evaluation protocols by sampling multiple groups of training examples to obtain stable comparisons and build new benchmarks based on three datasets: PASCAL VOC, COCO and LVIS. Again, our fine-tuning approach establishes a new state of the art on the revised benchmarks. The code as well as the pretrained models are available at https: //github. com/ucbdrive/few-shot-object-detection.

Details

ICRA Conference 2020 Conference Paper

Towards Practical Multi-Object Manipulation using Relational Reinforcement Learning

Richard Li
Allan Jabri
Trevor Darrell
Pulkit Agrawal 0001

Learning robotic manipulation tasks using reinforcement learning with sparse rewards is currently impractical due to the outrageous data requirements. Many practical tasks require manipulation of multiple objects, and the complexity of such tasks increases with the number of objects. Learning from a curriculum of increasingly complex tasks appears to be a natural solution, but unfortunately, does not work for many scenarios. We hypothesize that the inability of the state- of-the-art algorithms to effectively utilize a task curriculum stems from the absence of inductive biases for transferring knowledge from simpler to complex tasks. We show that graph-based relational architectures overcome this limitation and enable learning of complex tasks when provided with a simple curriculum of tasks with increasing numbers of objects. We demonstrate the utility of our framework on a simulated block stacking task. Starting from scratch, our agent learns to stack six blocks into a tower. Despite using step-wise sparse rewards, our method is orders of magnitude more data- efficient and outperforms the existing state-of-the-art method that utilizes human demonstrations. Furthermore, the learned policy exhibits zero-shot generalization, successfully stacking blocks into taller towers and previously unseen configurations such as pyramids, without any further training.

Details

ICLR Conference 2020 Conference Paper

Uncertainty-guided Continual Learning with Bayesian Neural Networks

Sayna Ebrahimi
Mohamed Elhoseiny
Trevor Darrell
Marcus Rohrbach

Continual learning aims to learn new tasks without forgetting previously learned ones. This is especially challenging when one cannot access data from previous tasks and when the model has a fixed capacity. Current regularization-based continual learning algorithms need an external representation and extra computation to measure the parameters' \textit{importance}. In contrast, we propose Uncertainty-guided Continual Bayesian Neural Networks (UCB), where the learning rate adapts according to the uncertainty defined in the probability distribution of the weights in networks. Uncertainty is a natural way to identify \textit{what to remember} and \textit{what to change} as we continually learn, and thus mitigate catastrophic forgetting. We also show a variant of our model, which uses uncertainty for weight pruning and retains task performance after pruning by saving binary masks per tasks. We evaluate our UCB approach extensively on diverse object classification datasets with short and long sequences of tasks and report superior or on-par performance compared to existing approaches. Additionally, we show that our model does not necessarily need task information at test time, i.e. it does not presume knowledge of which task a sample belongs to.

Details

ICML Conference 2020 Conference Paper

Video Prediction via Example Guidance

Jingwei Xu 0005
Huazhe Xu
Bingbing Ni
Xiaokang Yang 0001
Trevor Darrell

In video prediction tasks, one major challenge is to capture the multi-modal nature of future contents and dynamics. In this work, we propose a simple yet effective framework that can efficiently predict plausible future states, where the key insight is that the potential distribution of a sequence could be approximated with analogous ones in a repertoire of training pool, namely, expert examples. By further incorporating a novel optimization scheme into the training procedure, plausible predictions can be sampled efficiently from distribution constructed from the retrieved examples. Meanwhile, our method could be seamlessly integrated with existing stochastic predictive models; significant enhancement is observed with comprehensive experiments in both quantitative and qualitative aspects. We also demonstrate the generalization ability to predict the motion of unseen class, i. e. , without access to corresponding data during training phase. Project Page: \hyperlink{https: //sites. google. com/view/vpeg-supp/home. }{https: //sites. google. com/view/vpeg-supp/home. }

Details

NeurIPS Conference 2019 Conference Paper

Compositional Plan Vectors

Coline Devin
Daniel Geng
Pieter Abbeel
Trevor Darrell
Sergey Levine

Autonomous agents situated in real-world environments must be able to master large repertoires of skills. While a single short skill can be learned quickly, it would be impractical to learn every task independently. Instead, the agent should share knowledge across behaviors such that each task can be learned efficiently, and such that the resulting model can generalize to new tasks, especially ones that are compositions or subsets of tasks seen previously. A policy conditioned on a goal or demonstration has the potential to share knowledge between tasks if it sees enough diversity of inputs. However, these methods may not generalize to a more complex task at test time. We introduce compositional plan vectors (CPVs) to enable a policy to perform compositions of tasks without additional supervision. CPVs represent trajectories as the sum of the subtasks within them. We show that CPVs can be learned within a one-shot imitation learning framework without any additional supervision or information about task hierarchy, and enable a demonstration-conditioned policy to generalize to tasks that sequence twice as many skills as the tasks seen during training. Analogously to embeddings such as word2vec in NLP, CPVs can also support simple arithmetic operations -- for example, we can add the CPVs for two different tasks to command an agent to compose both tasks, without any additional training.

PDF Details

UAI Conference 2019 Conference Paper

Deep Mixture of Experts via Shallow Embedding

Xin Wang 0066
Fisher Yu 0001
Lisa Dunlap
Yian Ma
Ruth Wang
Azalia Mirhoseini
Trevor Darrell
Joseph E. Gonzalez

Larger networks generally have greater representational power at the cost of increased computational complexity. Sparsifying such networks has been an active area of research but has been generally limited to static regularization or dynamic approaches using reinforcement learning. We explore a mixture of experts (MoE) approach to deep dynamic routing, which activates certain experts in the network on a per-example basis. Our novel DeepMoE architecture increases the representational power of standard convolutional networks by adaptively sparsifying and recalibrating channel-wise features in each convolutional layer. We employ a multi-headed sparse gating network to determine the selection and scaling of channels for each input, leveraging exponential combinations of experts within a single convolutional network. Our proposed architecture is evaluated on four benchmark datasets and tasks, and we show that Deep-MoEs are able to achieve higher accuracy with lower computation than standard convolutional networks.

Details

ICRA Conference 2019 Conference Paper

Deep Object-Centric Policies for Autonomous Driving

Dequan Wang
Coline Devin
Qi-Zhi Cai
Fisher Yu 0001
Trevor Darrell

While learning visuomotor skills in an end-to-end manner is appealing, deep neural networks are often uninterpretable and fail in surprising ways. For robotics tasks, such as autonomous driving, models that explicitly represent objects may be more robust to new scenes and provide intuitive visualizations. We describe a taxonomy of “object-centric” models which leverage both object instances and end-to-end learning. In the Grand Theft Auto V simulator, we show that object-centric models outperform object-agnostic methods in scenes with other vehicles and pedestrians, even with an imperfect detector. We also demonstrate that our architectures perform well on real-world environments by evaluating on the Berkeley DeepDrive Video dataset, where an object-centric model outperforms object-agnostic models in the low-data regimes.

Details

NeurIPS Conference 2019 Conference Paper

Learning to Control Self-Assembling Morphologies: A Study of Generalization via Modularity

Deepak Pathak
Christopher Lu
Trevor Darrell
Phillip Isola
Alexei Efros

Contemporary sensorimotor learning approaches typically start with an existing complex agent (e. g. , a robotic arm), which they learn to control. In contrast, this paper investigates a modular co-evolution strategy: a collection of primitive agents learns to dynamically self-assemble into composite bodies while also learning to coordinate their behavior to control these bodies. Each primitive agent consists of a limb with a motor attached at one end. Limbs may choose to link up to form collectives. When a limb initiates a link-up action and there is another limb nearby, the latter is magnetically connected to the 'parent' limb's motor. This forms a new single agent, which may further link with other agents. In this way, complex morphologies can emerge, controlled by a policy whose architecture is in explicit correspondence with the morphology. We evaluate the performance of these dynamic and modular agents in simulated environments. We demonstrate better generalization to test-time changes both in the environment, as well as in the structure of the agent, compared to static and monolithic baselines. Project videos and source code are provided in the supplementary material.

PDF Details

IROS Conference 2019 Conference Paper

Monocular Plan View Networks for Autonomous Driving

Dequan Wang
Coline Devin
Qi-Zhi Cai
Philipp Krähenbühl
Trevor Darrell

Convolutions on monocular dash cam videos capture spatial invariances in the image plane but do not explicitly reason about distances and depth. We propose a simple transformation of observations into a bird’s eye view, also known as plan view, for end-to-end control. We detect vehicles and pedestrians in the first person view and project them into an overhead plan view. This representation provides an abstraction of the environment from which a deep network can easily deduce the positions and directions of entities. Additionally, the plan view enables us to leverage advances in 3D object detection in conjunction with deep policy learning. We evaluate our monocular plan view network on the photo-realistic Grand Theft Auto V simulator. A network using both a plan view and front view causes less than half as many collisions as previous detection-based methods and an order of magnitude fewer collisions than pure pixel-based policies.

Details

ICML Conference 2018 Conference Paper

CyCADA: Cycle-Consistent Adversarial Domain Adaptation

Judy Hoffman
Eric Tzeng
Taesung Park
Jun-Yan Zhu
Phillip Isola
Kate Saenko
Alexei A. Efros
Trevor Darrell

Domain adaptation is critical for success in new, unseen environments. Adversarial adaptation models have shown tremendous progress towards adapting to new environments by focusing either on discovering domain invariant representations or by mapping between unpaired image domains. While feature space methods are difficult to interpret and sometimes fail to capture pixel-level and low-level domain shifts, image space methods sometimes fail to incorporate high level semantic knowledge relevant for the end task. We propose a model which adapts between domains using both generative image space alignment and latent representation space alignment. Our approach, Cycle-Consistent Adversarial Domain Adaptation (CyCADA), guides transfer between domains according to a specific discriminatively trained task and avoids divergence by enforcing consistency of the relevant semantics before and after adaptation. We evaluate our method on a variety of visual recognition and prediction settings, including digit classification and semantic segmentation of road scenes, advancing state-of-the-art performance for unsupervised adaptation from synthetic to real world driving domains.

Details

ICRA Conference 2018 Conference Paper

Deep Object-Centric Representations for Generalizable Robot Learning

Coline Devin
Pieter Abbeel
Trevor Darrell
Sergey Levine

Robotic manipulation in complex open-world scenarios requires both reliable physical manipulation skills and effective and generalizable perception. In this paper, we propose using an object-centric prior and a semantic feature space for the perception system of a learned policy. We devise an object-level attentional mechanism that can be used to determine relevant objects from a few trajectories or demonstrations, and then immediately incorporate those objects into a learned policy. A task-independent attention locates possible objects in the scene, and a task-specific attention identifies which objects are predictive of the trajectories. The scope of the task-specific attention is easily adjusted by showing demonstrations with distractor objects or with diverse relevant objects. Our results indicate that this approach exhibits good generalization across object instances using very few samples, and can be used to learn a variety of manipulation tasks using reinforcement learning.

Details

NeurIPS Conference 2018 Conference Paper

Speaker-Follower Models for Vision-and-Language Navigation

Daniel Fried
Ronghang Hu
Volkan Cirik
Anna Rohrbach
Jacob Andreas
Louis-Philippe Morency
Taylor Berg-Kirkpatrick
Kate Saenko

Navigation guided by natural language instructions presents a challenging reasoning problem for instruction followers. Natural language instructions typically identify only a few high-level decisions and landmarks rather than complete low-level motor behaviors; much of the missing information must be inferred based on perceptual context. In machine learning settings, this is doubly challenging: it is difficult to collect enough annotated data to enable learning of this reasoning process from scratch, and also difficult to implement the reasoning process using generic sequence models. Here we describe an approach to vision-and-language navigation that addresses both these issues with an embedded speaker model. We use this speaker model to (1) synthesize new instructions for data augmentation and to (2) implement pragmatic reasoning, which evaluates how well candidate action sequences explain an instruction. Both steps are supported by a panoramic action space that reflects the granularity of human-generated instructions. Experiments show that all three components of this approach---speaker-driven data augmentation, pragmatic reasoning and panoramic action space---dramatically improve the performance of a baseline instruction follower, more than doubling the success rate over the best existing approach on a standard benchmark.

PDF Details

ICLR Conference 2018 Conference Paper

Zero-Shot Visual Imitation

Deepak Pathak
Parsa Mahmoudieh
Guanghao Luo
Pulkit Agrawal 0001
Dian Chen 0001
Yide Shentu
Evan Shelhamer
Jitendra Malik

The current dominant paradigm for imitation learning relies on strong supervision of expert actions to learn both 'what' and 'how' to imitate. We pursue an alternative paradigm wherein an agent first explores the world without any expert supervision and then distills its experience into a goal-conditioned skill policy with a novel forward consistency loss. In our framework, the role of the expert is only to communicate the goals (i.e., what to imitate) during inference. The learned policy is then employed to mimic the expert (i.e., how to imitate) after seeing just a sequence of images demonstrating the desired task. Our method is 'zero-shot' in the sense that the agent never has access to expert actions during training or for the task demonstration at inference. We evaluate our zero-shot imitator in two real-world settings: complex rope manipulation with a Baxter robot and navigation in previously unseen office environments with a TurtleBot. Through further experiments in VizDoom simulation, we provide evidence that better mechanisms for exploration lead to learning a more capable policy which in turn improves end task performance. Videos, models, and more details are available at https://pathak22.github.io/zeroshot-imitation/.

Details

ICML Conference 2017 Conference Paper

Curiosity-driven Exploration by Self-supervised Prediction

Deepak Pathak
Pulkit Agrawal 0001
Alexei A. Efros
Trevor Darrell

In many real-world scenarios, rewards extrinsic to the agent are extremely sparse, or absent altogether. In such cases, curiosity can serve as an intrinsic reward signal to enable the agent to explore its environment and learn skills that might be useful later in its life. We formulate curiosity as the error in an agent’s ability to predict the consequence of its own actions in a visual feature space learned by a self-supervised inverse dynamics model. Our formulation scales to high-dimensional continuous state spaces like images, bypasses the difficulties of directly predicting pixels, and, critically, ignores the aspects of the environment that cannot affect the agent. The proposed approach is evaluated in two environments: VizDoom and Super Mario Bros. Three broad settings are investigated: 1) sparse extrinsic reward, where curiosity allows for far fewer interactions with the environment to reach the goal; 2) exploration with no extrinsic reward, where curiosity pushes the agent to explore more efficiently; and 3) generalization to unseen scenarios (e. g. new levels of the same game) where the knowledge gained from earlier experience helps the agent explore new places much faster than starting from scratch.

Details

ICRA Conference 2017 Conference Paper

Learning modular neural network policies for multi-task and multi-robot transfer

Coline Devin
Abhishek Gupta 0004
Trevor Darrell
Pieter Abbeel
Sergey Levine

Reinforcement learning (RL) can automate a wide variety of robotic skills, but learning each new skill requires considerable real-world data collection and manual representation engineering to design policy classes or features. Using deep reinforcement learning to train general purpose neural network policies alleviates some of the burden of manual representation engineering by using expressive policy classes, but exacerbates the challenge of data collection, since such methods tend to be less efficient than RL with low-dimensional, hand-designed representations. Transfer learning can mitigate this problem by enabling us to transfer information from one skill to another and even from one robot to another. We show that neural network policies can be decomposed into “task-specific” and “robot-specific” modules, where the task-specific modules are shared across robots, and the robot-specific modules are shared across all tasks on that robot. This allows for sharing task information, such as perception, between robots and sharing robot information, such as dynamics and kinematics, between tasks. We exploit this decomposition to train mix-and-match modules that can solve new robot-task combinations that were not seen during training. Using a novel approach to train modular neural networks, we demonstrate the effectiveness of our transfer method for enabling zero-shot generalization with a variety of robots and tasks in simulation for both visual and non-visual tasks.

Details

NeurIPS Conference 2017 Conference Paper

Toward Multimodal Image-to-Image Translation

Jun-Yan Zhu
Richard Zhang
Deepak Pathak
Trevor Darrell
Alexei Efros
Oliver Wang
Eli Shechtman

Many image-to-image translation problems are ambiguous, as a single input image may correspond to multiple possible outputs. In this work, we aim to model a distribution of possible outputs in a conditional generative modeling setting. The ambiguity of the mapping is distilled in a low-dimensional latent vector, which can be randomly sampled at test time. A generator learns to map the given input, combined with this latent code, to the output. We explicitly encourage the connection between output and the latent code to be invertible. This helps prevent a many-to-one mapping from the latent code to the output during training, also known as the problem of mode collapse, and produces more diverse results. We explore several variants of this approach by employing different training objectives, network architectures, and methods of injecting the latent code. Our proposed method encourages bijective consistency between the latent encoding and output modes. We present a systematic comparison of our method and other variants on both perceptual realism and diversity.

PDF Details

ICRA Conference 2016 Conference Paper

Cross-modal adaptation for RGB-D detection

Judy Hoffman
Saurabh Gupta 0001
Jian Leong
Sergio Guadarrama
Trevor Darrell

In this paper we propose a technique to adapt convolutional neural network (CNN) based object detectors trained on RGB images to effectively leverage depth images at test time to boost detection performance. Given labeled depth images for a handful of categories we adapt an RGB object detector for a new category such that it can now use depth images in addition to RGB images at test time to produce more accurate detections. Our approach is built upon the observation that lower layers of a CNN are largely task and category agnostic and domain specific while higher layers are largely task and category specific while being domain agnostic. We operationalize this observation by proposing a mid-level fusion of RGB and depth CNNs. Experimental evaluation on the challenging NYUD2 dataset shows that our proposed adaptation technique results in an average 21% relative improvement in detection performance over an RGB-only baseline even when no depth training data is available for the particular category evaluated. We believe our proposed technique will extend advances made in computer vision to RGB-D data leading to improvements in performance at little additional annotation effort.

Details

ICRA Conference 2016 Conference Paper

Deep learning for tactile understanding from visual and haptic data

Yang Gao 0029
Lisa Anne Hendricks
Katherine J. Kuchenbecker
Trevor Darrell

Robots which interact with the physical world will benefit from a fine-grained tactile understanding of objects and surfaces. Additionally, for certain tasks, robots may need to know the haptic properties of an object before touching it. To enable better tactile understanding for robots, we propose a method of classifying surfaces with haptic adjectives (e. g. , compressible or smooth) from both visual and physical interaction data. Humans typically combine visual predictions and feedback from physical interactions to accurately predict haptic properties and interact with the world. Inspired by this cognitive pattern, we propose and explore a purely visual haptic prediction model. Purely visual models enable a robot to “feel” without physical interaction. Furthermore, we demonstrate that using both visual and physical interaction signals together yields more accurate haptic classification. Our models take advantage of recent advances in deep neural networks by employing a unified approach to learning features for physical interaction and visual observations. Even though we employ little domain specific knowledge, our model still achieves better results than methods based on hand-designed features.

Details

ICRA Conference 2016 Conference Paper

Deep spatial autoencoders for visuomotor learning

Chelsea Finn
Xin Yu Tan
Yan Duan
Trevor Darrell
Sergey Levine
Pieter Abbeel

Reinforcement learning provides a powerful and flexible framework for automated acquisition of robotic motion skills. However, applying reinforcement learning requires a sufficiently detailed representation of the state, including the configuration of task-relevant objects. We present an approach that automates state-space construction by learning a state representation directly from camera images. Our method uses a deep spatial autoencoder to acquire a set of feature points that describe the environment for the current task, such as the positions of objects, and then learns a motion skill with these feature points using an efficient reinforcement learning method based on local linear models. The resulting controller reacts continuously to the learned feature points, allowing the robot to dynamically manipulate objects in the world with closed-loop control. We demonstrate our method with a PR2 robot on tasks that include pushing a free-standing toy block, picking up a bag of rice using a spatula, and hanging a loop of rope on a hook at various positions. In each task, our method automatically learns to track task-relevant objects and manipulate their configuration with the robot's arm.

Details

JMLR Journal 2016 Journal Article

End-to-End Training of Deep Visuomotor Policies

Sergey Levine
Chelsea Finn
Trevor Darrell
Pieter Abbeel

Policy search methods can allow robots to learn control policies for a wide range of tasks, but practical applications of policy search often require hand-engineered components for perception, state estimation, and low-level control. In this paper, we aim to answer the following question: does training the perception and control systems jointly end-to-end provide better performance than training each component separately? To this end, we develop a method that can be used to learn policies that map raw image observations directly to torques at the robot's motors. The policies are represented by deep convolutional neural networks (CNNs) with 92,000 parameters, and are trained using a guided policy search method, which transforms policy search into supervised learning, with supervision provided by a simple trajectory-centric reinforcement learning method. We evaluate our method on a range of real-world manipulation tasks that require close coordination between vision and control, such as screwing a cap onto a bottle, and present simulated comparisons to a range of prior policy search methods. [abs] [ pdf ][ bib ] &copy JMLR 2016. ( edit, beta )

PDF Details

JMLR Journal 2016 Journal Article

Large Scale Visual Recognition through Adaptation using Joint Representation and Multiple Instance Learning

Judy Hoffman
Deepak Pathak
Eric Tzeng
Jonathan Long
Sergio Guadarrama
Trevor Darrell
Kate Saenko

A major barrier towards scaling visual recognition systems is the difficulty of obtaining labeled images for large numbers of categories. Recently, deep convolutional neural networks (CNNs) trained used 1.2M+ labeled images have emerged as clear winners on object classification benchmarks. Unfortunately, only a small fraction of those labels are available with bounding box localization for training the detection task and even fewer pixel level annotations are available for semantic segmentation. It is much cheaper and easier to collect large quantities of image-level labels from search engines than it is to collect scene-centric images with precisely localized labels. We develop methods for learning large scale recognition models which exploit joint training over both weak (image-level) and strong (bounding box) labels and which transfer learned perceptual representations from strongly-labeled auxiliary tasks. We provide a novel formulation of a joint multiple instance learning method that includes examples from object-centric data with image-level labels when available, and also performs domain transfer learning to improve the underlying detector representation. We then show how to use our large scale detectors to produce pixel level annotations. Using our method, we produce a $>$7.6K category detector and release code and models at lsda.berkeley vision.org. [abs] [ pdf ][ bib ] &copy JMLR 2016. ( edit, beta )

PDF Details

ICRA Conference 2016 Conference Paper

TSC-DL: Unsupervised trajectory segmentation of multi-modal surgical demonstrations with Deep Learning

Adithyavairavan Murali
Animesh Garg
Sanjay Krishnan
Florian T. Pokorny
Pieter Abbeel
Trevor Darrell
Ken Goldberg

The growth of robot-assisted minimally invasive surgery has led to sizable datasets of fixed-camera video and kinematic recordings of surgical subtasks. Segmentation of these trajectories into locally-similar contiguous sections can facilitate learning from demonstrations, skill assessment, and salvaging good segments from otherwise inconsistent demonstrations. Manual, or supervised, segmentation can be prone to error and impractical for large datasets. We present Transition State Clustering with Deep Learning (TSC-DL), a new unsupervised algorithm that leverages video and kinematic data for task-level segmentation, and finds regions of the visual feature space that correlate with transition events using features constructed from layers of pre-trained image classification Deep Convolutional Neural Networks (CNNs). We report results on three datasets comparing Deep Learning architectures (AlexNet and VGG), choice of convolutional layer, dimensionality reduction techniques, visual encoding, and the use of Scale Invariant Feature Transforms (SIFT). We find that the deep architectures extract features that result in up-to a 30. 4% improvement in Silhouette Score (a measure of cluster tightness) over the traditional “shallow” features from SIFT. We also present cases where TSC-DL discovers human annotator omissions. Supplementary material, data and code is available at: http://berkeleyautomation.github.io/tsc-dl/

Details

ICML Conference 2014 Conference Paper

DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition

Jeff Donahue
Yangqing Jia
Oriol Vinyals
Judy Hoffman
Ning Zhang 0014
Eric Tzeng
Trevor Darrell

We evaluate whether features extracted from the activation of a deep convolutional network trained in a fully supervised fashion on a large, fixed set of object recognition tasks can be re-purposed to novel generic tasks. Our generic tasks may differ significantly from the originally trained tasks and there may be insufficient labeled or unlabeled data to conventionally train or adapt a deep architecture to the new tasks. We investigate and visualize the semantic clustering of deep convolutional features with respect to a variety of such tasks, including scene recognition, domain adaptation, and fine-grained recognition challenges. We compare the efficacy of relying on various network levels to define a fixed feature, and report novel results that significantly outperform the state-of-the-art on several important vision challenges. We are releasing DeCAF, an open-source implementation of these deep convolutional activation features, along with all associated network parameters to enable vision researchers to be able to conduct experimentation with deep representations across a range of visual concept learning paradigms.

Details

NeurIPS Conference 2014 Conference Paper

Do Convnets Learn Correspondence?

Jonathan Long
Ning Zhang
Trevor Darrell

Convolutional neural nets (convnets) trained from massive labeled datasets have substantially improved the state-of-the-art in image classification and object detection. However, visual understanding requires establishing correspondence on a finer level than object category. Given their large pooling regions and training from whole-image labels, it is not clear that convnets derive their success from an accurate correspondence model which could be used for precise localization. In this paper, we study the effectiveness of convnet activation features for tasks requiring correspondence. We present evidence that convnet features localize at a much finer scale than their receptive field sizes, that they can be used to perform intraclass aligment as well as conventional hand-engineered features, and that they outperform conventional features in keypoint prediction on objects from PASCAL VOC 2011.

PDF Details

ICRA Conference 2014 Conference Paper

Interactive adaptation of real-time object detectors

Daniel Goehring
Judy Hoffman
Erik Rodner
Kate Saenko
Trevor Darrell

In the following paper, we present a framework for quickly training 2D object detectors for robotic perception. Our method can be used by robotics practitioners to quickly (under 30 seconds per object) build a large-scale real-time perception system. In particular, we show how to create new detectors on the fly using large-scale internet image databases, thus allowing a user to choose among thousands of available categories to build a detection system suitable for the particular robotic application. Furthermore, we show how to adapt these models to the current environment with just a few in-situ images. Experiments on existing 2D benchmarks evaluate the speed, accuracy, and flexibility of our system.

Details

NeurIPS Conference 2014 Conference Paper

LSDA: Large Scale Detection through Adaptation

Judy Hoffman
Sergio Guadarrama
Eric Tzeng
Ronghang Hu
Jeff Donahue
Ross Girshick
Trevor Darrell
Kate Saenko

A major challenge in scaling object detection is the difficulty of obtaining labeled images for large numbers of categories. Recently, deep convolutional neural networks (CNNs) have emerged as clear winners on object classification benchmarks, in part due to training with 1. 2M+ labeled classification images. Unfortunately, only a small fraction of those labels are available for the detection task. It is much cheaper and easier to collect large quantities of image-level labels from search engines than it is to collect detection data and label it with precise bounding boxes. In this paper, we propose Large Scale Detection through Adaptation (LSDA), an algorithm which learns the difference between the two tasks and transfers this knowledge to classifiers for categories without bounding box annotated data, turning them into detectors. Our method has the potential to enable detection for the tens of thousands of categories that lack bounding box annotations, yet have plenty of classification data. Evaluation on the ImageNet LSVRC-2013 detection challenge demonstrates the efficacy of our approach. This algorithm enables us to produce a >7. 6K detector by using available classification data from leaf nodes in the ImageNet tree. We additionally demonstrate how to modify our architecture to produce a fast detector (running at 2fps for the 7. 6K detector). Models and software are available at

PDF Details

ICML Conference 2014 Conference Paper

On learning to localize objects with minimal supervision

Hyun Oh Song
Ross B. Girshick
Stefanie Jegelka
Julien Mairal
Zaïd Harchaoui
Trevor Darrell

Learning to localize objects with minimal supervision is an important problem in computer vision, since large fully annotated datasets are extremely costly to obtain. In this paper, we propose a new method that achieves this goal with only image-level labels of whether the objects are present or not. Our approach combines a discriminative submodular cover problem for automatically discovering a set of positive object windows with a smoothed latent SVM formulation. The latter allows us to leverage efficient quasi-Newton optimization techniques. Our experiments demonstrate that the proposed approach provides a 50% relative improvement in mean average precision over the current state-of-the-art on PASCAL VOC 2007 detection.

Details

NeurIPS Conference 2014 Conference Paper

Weakly-supervised Discovery of Visual Pattern Configurations

Hyun Oh Song
Yong Jae Lee
Stefanie Jegelka
Trevor Darrell

The prominence of weakly labeled data gives rise to a growing demand for object detection methods that can cope with minimal supervision. We propose an approach that automatically identifies discriminative configurations of visual patterns that are characteristic of a given object class. We formulate the problem as a constrained submodular optimization problem and demonstrate the benefits of the discovered configurations in remedying mislocalizations and finding informative positive and negative training examples. Together, these lead to state-of-the-art weakly-supervised detection results on the challenging PASCAL VOC dataset.

PDF Details

ICML Conference 2013 Conference Paper

Discriminatively Activated Sparselets

Ross B. Girshick
Hyun Oh Song
Trevor Darrell

Shared representations are highly appealing due to their potential for gains in computational and statistical efficiency. Compressing a shared representation leads to greater computational savings, but at the same time can severely decrease performance on a target task. Recently, sparselets (Song et al. , 2012) were introduced as a new shared intermediate representation for multiclass object detection with deformable part models (Felzenszwalb et al. , 2010a), showing significant speedup factors, but with a large decrease in task performance. In this paper we describe a new training framework that learns which sparselets to activate in order to optimize a discriminative objective, leading to larger speedup factors with no decrease in task performance. We first reformulate sparselets in a general structured output prediction framework, then analyze when sparselets lead to computational efficiency gains, and lastly show experimental results on object detection and image classification tasks. Our experimental results demonstrate that discriminative activation substantially outperforms the previous reconstructive approach which, together with our structured output prediction formulation, make sparselets broadly applicable and significantly more effective.

Details

IROS Conference 2013 Conference Paper

Grounding spatial relations for human-robot interaction

Sergio Guadarrama
Lorenzo Riano
Dave Golland
Daniel Goehring
Yangqing Jia
Dan Klein 0001
Pieter Abbeel
Trevor Darrell

We propose a system for human-robot interaction that learns both models for spatial prepositions and for object recognition. Our system grounds the meaning of an input sentence in terms of visual percepts coming from the robot's sensors in order to send an appropriate command to the PR2 or respond to spatial queries. To perform this grounding, the system recognizes the objects in the scene, determines which spatial relations hold between those objects, and semantically parses the input sentence. The proposed system uses the visual and spatial information in conjunction with the semantic parse to interpret statements that refer to objects (nouns), their spatial relationships (prepositions), and to execute commands (actions). The semantic parse is inherently compositional, allowing the robot to understand complex commands that refer to multiple objects and relations such as: “Move the cup close to the robot to the area in front of the plate and behind the tea box”. Our system correctly parses 94% of the 210 online test sentences, correctly interprets 91% of the correctly parsed sentences, and correctly executes 89% of the correctly interpreted sentences.

Details

ICML Conference 2013 Conference Paper

On Compact Codes for Spatially Pooled Features

Yangqing Jia
Oriol Vinyals
Trevor Darrell

Feature encoding with an overcomplete dictionary has demonstrated good performance in many applications, especially computer vision. In this paper we analyze the classification accuracy with respect to dictionary size by linking the encoding stage to kernel methods and \nystrom sampling, and obtain useful bounds on accuracy as a function of size. The \nystrom method also inspires us to revisit dictionary learning from local patches, and we propose to learn the dictionary in an end-to-end fashion taking into account pooling, a common computational layer in vision. We validate our contribution by showing how the derived bounds are able to explain the observed behavior of multiple datasets, and show that the pooling aware method efficiently reduces the dictionary size by a factor of two for a given accuracy.

Details

ICRA Conference 2013 Conference Paper

Using robotic exploratory procedures to learn the meaning of haptic adjectives

Vivian Chu
Ian McMahon
Lorenzo Riano
Craig G. McDonald
Qin He
Jorge Martinez Perez-Tejada
Michael Arrigo
Naomi T. Fitter

Delivering on the promise of real-world robotics will require robots that can communicate with humans through natural language by learning new words and concepts through their daily experiences. Our research strives to create a robot that can learn the meaning of haptic adjectives by directly touching objects. By equipping the PR2 humanoid robot with state-of-the-art biomimetic tactile sensors that measure temperature, pressure, and fingertip deformations, we created a platform uniquely capable of feeling the physical properties of everyday objects. The robot used five exploratory procedures to touch 51 objects that were annotated by human participants with 34 binary adjective labels. We present both static and dynamic learning methods to discover the meaning of these adjectives from the labeled objects, achieving average F1 scores of 0. 57 and 0. 79 on a set of eight previously unfelt items.

Details

NeurIPS Conference 2013 Conference Paper

Visual Concept Learning: Combining Machine Vision and Bayesian Generalization on Concept Hierarchies

Yangqing Jia
Joshua Abbott
Joseph Austerweil
Tom Griffiths
Trevor Darrell

Learning a visual concept from a small number of positive examples is a significant challenge for machine learning algorithms. Current methods typically fail to find the appropriate level of generalization in a concept hierarchy for a given set of visual examples. Recent work in cognitive science on Bayesian models of generalization addresses this challenge, but prior results assumed that objects were perfectly recognized. We present an algorithm for learning visual concepts directly from images, using probabilistic predictions generated by visual classifiers as the input to a Bayesian generalization model. As no existing challenge data tests this paradigm, we collect and make available a new, large-scale dataset for visual concept learning using the ImageNet hierarchy as the source of possible concepts, with human annotators to provide ground truth labels as to whether a new image is an instance of each concept using a paradigm similar to that used in experiments studying word learning in children. We compare the performance of our system to several baseline algorithms, and show a significant advantage results from combining visual classifiers with the ability to identify an appropriate level of abstraction using Bayesian generalization.

PDF Details

UAI Conference 2012 Conference Paper

Factorized Multi-Modal Topic Model

Seppo Virtanen
Yangqing Jia
Arto Klami
Trevor Darrell

Multi-modal data collections, such as corpora of paired images and text snippets, require analysis methods beyond single-view component and topic models. For continuous observations the current dominant approach is based on extensions of canonical correlation analysis, factorizing the variation into components shared by the different modalities and those private to each of them. For count data, multiple variants of topic models attempting to tie the modalities together have been presented. All of these, however, lack the ability to learn components private to one modality, and consequently will try to force dependencies even between minimally correlating modalities. In this work we combine the two approaches by presenting a novel HDP-based topic model that automatically learns both shared and private topics. The model is shown to be especially useful for querying the contents of one domain given samples of the other.

Details

NeurIPS Conference 2012 Conference Paper

Learning with Recursive Perceptual Representations

Oriol Vinyals
Yangqing Jia
Li Deng
Trevor Darrell

Linear Support Vector Machines (SVMs) have become very popular in vision as part of state-of-the-art object recognition and other classification tasks but require high dimensional feature spaces for good performance. Deep learning methods can find more compact representations but current methods employ multilayer perceptrons that require solving a difficult, non-convex optimization problem. We propose a deep non-linear classifier whose layers are SVMs and which incorporates random projection as its core stacking element. Our method learns layers of linear SVMs recursively transforming the original data manifold through a random projection of the weak prediction computed from each layer. Our method scales as linear SVMs, does not rely on any kernel computations or nonconvex optimization, and exhibits better generalization ability than kernel-based SVMs. This is especially true when the number of training samples is smaller than the dimensionality of data, a common scenario in many real-world applications. The use of random projections is key to our method, as we show in the experiments section, in which we observe a consistent improvement over previous --often more complicated-- methods on several vision and speech benchmarks.

PDF Details

NeurIPS Conference 2012 Conference Paper

Timely Object Recognition

Sergey Karayev
Tobias Baumgartner
Mario Fritz
Trevor Darrell

In a large visual multi-class detection framework, the timeliness of results can be crucial. Our method for timely multi-class detection aims to give the best possible performance at any single point after a start time; it is terminated at a deadline time. Toward this goal, we formulate a dynamic, closed-loop policy that infers the contents of the image in order to decide which detector to deploy next. In contrast to previous work, our method significantly diverges from the predominant greedy strategies, and is able to learn to take actions with deferred values. We evaluate our method with a novel timeliness measure, computed as the area under an Average Precision vs. Time curve. Experiments are conducted on the eminent PASCAL VOC object detection dataset. If execution is stopped when only half the detectors have been run, our method obtains $66\%$ better AP than a random ordering, and $14\%$ better performance than an intelligent baseline. On the timeliness measure, our method obtains at least $11\%$ better performance. Our code, to be made available upon publication, is easily extensible as it treats detectors and classifiers as black boxes and learns from execution traces using reinforcement learning.

PDF Details

NeurIPS Conference 2011 Conference Paper

Heavy-tailed Distances for Gradient Based Image Descriptors

Yangqing Jia
Trevor Darrell

Many applications in computer vision measure the similarity between images or image patches based on some statistics such as oriented gradients. These are often modeled implicitly or explicitly with a Gaussian noise assumption, leading to the use of the Euclidean distance when comparing image descriptors. In this paper, we show that the statistics of gradient based image descriptors often follow a heavy-tailed distribution, which undermines any principled motivation for the use of Euclidean distances. We advocate for the use of a distance measure based on the likelihood ratio test with appropriate probabilistic models that fit the empirical data distribution. We instantiate this similarity measure with the Gamma-compound-Laplace distribution, and show significant improvement over existing distance measures in the application of SIFT feature matching, at relatively low computational cost.

PDF Details

ICRA Conference 2011 Conference Paper

Parametrized shape models for clothing

Stephen Miller
Mario Fritz
Trevor Darrell
Pieter Abbeel

We consider the problem of recognizing the configuration of clothing articles when crudely spread out on a flat surface, prior to and during folding. At the core of our approach are parametrized shape models for clothing articles. Each clothing category has its own shape model, and the variety in shapes for a given category is achieved through variation of the parameters. We present an efficient algorithm to find the parameters that provide the best fit when given an image of a clothing article. The models are such that, once the parameters have been fit, they provide a basic parse of the clothing article, allowing it to be followed by autonomous folding from category level specifications of fold sequences. Our approach is also able to recover the configuration of a clothing article when folds are being introduced-an important feature towards closing the perception-action loop. Additionally, our approach provides a reliable method of shape-based classification, simply by examining which model yields the best fit. Our experiments illustrate the effectiveness of our approach on a large set of clothing articles. Furthermore, we present an end-to-end system, which starts from an unknown spread-out clothing article, performs a parametrized model fit, then follows a category-level (rather than article specific) set of folding instructions, closing the loop with perceptual feedback by re-fitting between folds.

Details

IROS Conference 2011 Conference Paper

Perception for the manipulation of socks

Ping Chuan Wang
Stephen Miller
Mario Fritz
Trevor Darrell
Pieter Abbeel

We consider the perceptual challenges inherent in the robotic manipulation of previously unseen socks, with the end goal of manipulation by a household robot for laundry. The task poses challenging problems in modeling the appearance, shape and configuration of these textile items that tend to exhibit high variability in texture, design, and style while being highly articulated objects.

Details

IROS Conference 2011 Conference Paper

Practical 3-D object detection using category and instance-level appearance models

Kate Saenko
Sergey Karayev
Yangqing Jia
Alex Shyr
Allison Janoch
Jonathan Long
Mario Fritz
Trevor Darrell

Bipedal walking in human environments is made difficult by the unevenness of the terrain and by external disturbances. Most approaches to bipedal walking in such environments either rely upon a precise model of the surface or special hardware designed for uneven terrain. In this paper, we present an alternative approach to stabilize the walking of an inexpensive, commercially-available, position-controlled humanoid robot in difficult environments. We use electrically compliant swing foot dynamics and onboard sensors to estimate the inclination of the local surface, and use a online learning algorithm to learn an adaptive surface model. Perturbations due to external disturbances or model errors are rejected by a hierarchical push recovery controller, which modulates three biomechanically motivated push recovery controllers according to the current estimated state. We use a physically realistic simulation with an articulated robot model and reinforcement learning algorithm to train the push recovery controller, and implement the learned controller on a commercial DARwIn-OP small humanoid robot. Experimental results show that this combined approach enables the robot to walk over unknown, uneven surfaces without falling down.

Details

NeurIPS Conference 2010 Conference Paper

Factorized Latent Spaces with Structured Sparsity

Yangqing Jia
Mathieu Salzmann
Trevor Darrell

Recent approaches to multi-view learning have shown that factorizing the information into parts that are shared across all views and parts that are private to each view could effectively account for the dependencies and independencies between the different input modalities. Unfortunately, these approaches involve minimizing non-convex objective functions. In this paper, we propose an approach to learning such factorized representations inspired by sparse coding techniques. In particular, we show that structured sparsity allows us to address the multi-view learning problem by alternately solving two convex optimization problems. Furthermore, the resulting factorized latent spaces generalize over existing approaches in that they allow: having latent dimensions shared between any subset of the views instead of between all the views only. We show that our approach outperforms state-of-the-art methods on the task of human pose estimation.

PDF Details

NeurIPS Conference 2010 Conference Paper

Size Matters: Metric Visual Search Constraints from Monocular Metadata

Mario Fritz
Kate Saenko
Trevor Darrell

Metric constraints are known to be highly discriminative for many objects, but if training is limited to data captured from a particular 3-D sensor the quantity of training data may be severly limited. In this paper, we show how a crucial aspect of 3-D information–object and feature absolute size–can be added to models learned from commonly available online imagery, without use of any 3-D sensing or re- construction at training time. Such models can be utilized at test time together with explicit 3-D sensing to perform robust search. Our model uses a “2. 1D” local feature, which combines traditional appearance gradient statistics with an estimate of average absolute depth within the local window. We show how category size information can be obtained from online images by exploiting relatively unbiquitous metadata fields specifying camera intrinstics. We develop an efficient metric branch-and-bound algorithm for our search task, imposing 3-D size constraints as part of an optimal search for a set of features which indicate the presence of a category. Experiments on test scenes captured with a traditional stereo rig are shown, exploiting training data from from purely monocular sources with associated EXIF metadata.

PDF Details

NeurIPS Conference 2009 Conference Paper

An Additive Latent Feature Model for Transparent Object Recognition

Mario Fritz
Gary Bradski
Sergey Karayev
Trevor Darrell
Michael Black

Existing methods for recognition of object instances and categories based on quantized local features can perform poorly when local features exist on transparent surfaces, such as glass or plastic objects. There are characteristic patterns to the local appearance of transparent objects, but they may not be well captured by distances to individual examples or by a local pattern codebook obtained by vector quantization. The appearance of a transparent patch is determined in part by the refraction of a background pattern through a transparent medium: the energy from the background usually dominates the patch appearance. We model transparent local patch appearance using an additive model of latent factors: background factors due to scene content, and factors which capture a local edge energy distribution characteristic of the refraction. We implement our method using a novel LDA-SIFT formulation which performs LDA prior to any vector quantization step; we discover latent topics which are characteristic of particular transparent patches and quantize the SIFT space into transparent visual words according to the latent topic dimensions. No knowledge of the background scene is required at test time; we show examples recognizing transparent glasses in a domestic environment.

PDF Details

ICML Conference 2009 Conference Paper

An efficient projection for l 1, infinity regularization

Ariadna Quattoni
Xavier Carreras
Michael Collins 0001
Trevor Darrell

Details

NeurIPS Conference 2009 Conference Paper

Filtering Abstract Senses From Image Search Results

Kate Saenko
Trevor Darrell

We propose an unsupervised method that, given a word, automatically selects non-abstract senses of that word from an online ontology and generates images depicting the corresponding entities. When faced with the task of learning a visual model based only on the name of an object, a common approach is to find images on the web that are associated with the object name, and then train a visual classifier from the search result. As words are generally polysemous, this approach can lead to relatively noisy models if many examples due to outlier senses are added to the model. We argue that images associated with an abstract word sense should be excluded when training a visual classifier to learn a model of a physical object. While image clustering can group together visually coherent sets of returned images, it can be difficult to distinguish whether an image cluster relates to a desired object or to an abstract sense of the word. We propose a method that uses both image features and the text associated with the images to relate latent topics to particular senses. Our model does not require any human supervision, and takes as input only the name of an object category. We show results of retrieving concrete-sense images in two available multimodal, multi-sense databases, as well as experiment with object classifiers trained on concrete-sense images returned by our method for a set of ten common office objects.

PDF Details

NeurIPS Conference 2009 Conference Paper

Learning to Hash with Binary Reconstructive Embeddings

Brian Kulis
Trevor Darrell

Fast retrieval methods are increasingly critical for many large-scale analysis tasks, and there have been several recent methods that attempt to learn hash functions for fast and accurate nearest neighbor searches. In this paper, we develop an algorithm for learning hash functions based on explicitly minimizing the reconstruction error between the original distances and the Hamming distances of the corresponding binary embeddings. We develop a scalable coordinate-descent algorithm for our proposed hashing objective that is able to efficiently learn hash functions in a variety of settings. Unlike existing methods such as semantic hashing and spectral hashing, our method is easily kernelized and does not require restrictive assumptions about the underlying distribution of the data. We present results over several domains to demonstrate that our method outperforms existing state-of-the-art techniques.

PDF Details

ICML Conference 2008 Conference Paper

Topologically-constrained latent variable models

Raquel Urtasun
David J. Fleet
Andreas Geiger 0001
Jovan Popovic
Trevor Darrell
Neil D. Lawrence

Details

NeurIPS Conference 2008 Conference Paper

Unsupervised Learning of Visual Sense Models for Polysemous Words

Kate Saenko
Trevor Darrell

Polysemy is a problem for methods that exploit image search engines to build object category models. Existing unsupervised approaches do not take word sense into consideration. We propose a new method that uses a dictionary to learn models of visual word sense from a large collection of unlabeled web data. The use of LDA to discover a latent sense space makes the model robust despite the very limited nature of dictionary definitions. The definitions are used to learn a distribution in the latent space that best represents a sense. The algorithm then uses the text surrounding image links to retrieve images with high probability of a particular dictionary sense. An object classifier is trained on the resulting sense-specific images. We evaluate our method on a dataset obtained by searching the web for polysemous words. Category classification experiments show that our dictionary-based approach outperforms baseline methods.

PDF Details

ICML Conference 2007 Conference Paper

Discriminative Gaussian process latent variable model for classification

Raquel Urtasun
Trevor Darrell

Supervised learning is difficult with high dimensional input spaces and very small training sets, but accurate classification may be possible if the data lie on a low-dimensional manifold. Gaussian Process Latent Variable Models can discover low dimensional manifolds given only a small number of examples, but learn a latent space without regard for class labels. Existing methods for discriminative manifold learning (e.g., LDA, GDA) do constrain the class distribution in the latent space, but are generally deterministic and may not generalize well with limited training data. We introduce a method for Gaussian Process Classification using latent variable models trained with discriminative priors over the latent space, which can learn a discriminative latent space from a small training set.

Details

AIJ Journal 2007 Journal Article

Head gestures for perceptual interfaces: The role of context in improving recognition

Louis-Philippe Morency
Candace Sidner
Christopher Lee
Trevor Darrell

Head pose and gesture offer several conversational grounding cues and are used extensively in face-to-face interaction among people. To accurately recognize visual feedback, humans often use contextual knowledge from previous and current events to anticipate when feedback is most likely to occur. In this paper we describe how contextual information can be used to predict visual feedback and improve recognition of head gestures in human–computer interfaces. Lexical, prosodic, timing, and gesture features can be used to predict a user's visual feedback during conversational dialog with a robotic or virtual agent. In non-conversational interfaces, context features based on user–interface system events can improve detection of head gestures for dialog box confirmation or document browsing. Our user study with prototype gesture-based components indicate quantitative and qualitative benefits of gesture-based confirmation over conventional alternatives. Using a discriminative approach to contextual prediction and multi-modal integration, performance of head gesture detection was improved with context features even when the topic of the test set was significantly different than the training set.

Details DOI

JMLR Journal 2007 Journal Article

The Pyramid Match Kernel: Efficient Learning with Sets of Features

Kristen Grauman
Trevor Darrell

In numerous domains it is useful to represent a single example by the set of the local features or parts that comprise it. However, this representation poses a challenge to many conventional machine learning techniques, since sets may vary in cardinality and elements lack a meaningful ordering. Kernel methods can learn complex functions, but a kernel over unordered set inputs must somehow solve for correspondences---generally a computationally expensive task that becomes impractical for large set sizes. We present a new fast kernel function called the pyramid match that measures partial match similarity in time linear in the number of features. The pyramid match maps unordered feature sets to multi-resolution histograms and computes a weighted histogram intersection in order to find implicit correspondences based on the finest resolution histogram cell where a matched pair first appears. We show the pyramid match yields a Mercer kernel, and we prove bounds on its error relative to the optimal partial matching cost. We demonstrate our algorithm on both classification and regression tasks, including object recognition, 3-D human pose inference, and time of publication estimation for documents, and we show that the proposed method is accurate and significantly more efficient than current approaches. [abs] [ pdf ][ bib ] &copy JMLR 2007. ( edit, beta )

PDF Details

NeurIPS Conference 2006 Conference Paper

Approximate Correspondences in High Dimensions

Kristen Grauman
Trevor Darrell

Pyramid intersection is an efﬁcient method for computing an approximate partial matching between two sets of feature vectors. We introduce a novel pyramid em- bedding based on a hierarchy of non-uniformly shaped bins that takes advantage of the underlying structure of the feature space and remains accurate even for sets with high-dimensional feature vectors. The matching similarity is computed in linear time and forms a Mercer kernel. Whereas previous matching approxima- tion algorithms suffer from distortion factors that increase linearly with the fea- ture dimension, we demonstrate that our approach can maintain constant accuracy even as the feature dimension increases. When used as a kernel in a discrimina- tive classiﬁer, our approach achieves improved object recognition results over a state-of-the-art set kernel.

PDF Details

NeurIPS Conference 2004 Conference Paper

Conditional Random Fields for Object Recognition

Ariadna Quattoni
Michael Collins
Trevor Darrell

We present a discriminative part-based approach for the recognition of object classes from unsegmented cluttered scenes. Objects are modeled as flexible constellations of parts conditioned on local observations found by an interest operator. For each object class the probability of a given assignment of parts to local features is modeled by a Conditional Ran- dom Field (CRF). We propose an extension of the CRF framework that incorporates hidden variables and combines class conditional CRFs into a unified framework for part-based object recognition. The parameters of the CRF are estimated in a maximum likelihood framework and recogni- tion proceeds by finding the most likely class under our model. The main advantage of the proposed CRF framework is that it allows us to relax the assumption of conditional independence of the observed data (i. e. local features) often used in generative approaches, an assumption that might be too restrictive for a considerable number of object classes.

PDF Details

IROS Conference 2002 Conference Paper

Bayesian network for online global pose estimation

Ali Rahimi
Trevor Darrell

We cast the location estimation problem in vision-based robotic navigation in a Bayesian framework. We derive an efficient online algorithm for updating the trajectory of a robot as new frames of data become available. For each new frame, the algorithm computes the pose of the robot relative to past frames and combines these relative pose changes to obtain a robust estimate of its trajectory. The complexity of this algorithm grows linearly with the number of frames so far processed. Since it is effectively tracking against an appearance-based map, our algorithm provides consistent results in circular environments, where the robot returns to places already visited.

Details

NeurIPS Conference 2002 Conference Paper

Location Estimation with a Differential Update Network

Ali Rahimi
Trevor Darrell

Given a set of hidden variables with an a-priori Markov structure, we derive an online algorithm which approximately updates the posterior as pairwise measurements between the hidden variables become available. The update is performed using Assumed Density Filtering: to incorporate each pairwise measurement, we compute the optimal Markov structure which represents the true posterior and use it as a prior for incorporating the next measurement. We demonstrate the resulting algorithm by cal- culating globally consistent trajectories of a robot as it navigates along a 2D trajectory. To update a trajectory of length t, the update takes O(t). When all conditional distributions are linear-Gaussian, the algorithm can be thought of as a Kalman Filter which simpliﬁes the state covariance matrix after incorporating each measurement.

PDF Details

NeurIPS Conference 2002 Conference Paper

Recovering Articulated Model Topology from Observed Rigid Motion

Leonid Taycher
John Iii
Trevor Darrell

Accurate representation of articulated motion is a challenging problem for machine perception. Several successful tracking algorithms have been developed that model human body as an articulated tree. We pro- pose a learning-based method for creating such articulated models from observations of multiple rigid motions. This paper is concerned with recovering topology of the articulated model, when the rigid motion of constituent segments is known. Our approach is based on ﬁnding the Maximum Likelihood tree shaped factorization of the joint probability density function (PDF) of rigid segment motions. The topology of graph- ical model formed from this factorization corresponds to topology of the underlying articulated body. We demonstrate the performance of our al- gorithm on both synthetic and real motion capture data.

PDF Details

NeurIPS Conference 2000 Conference Paper

Learning Joint Statistical Models for Audio-Visual Fusion and Segregation

John Fisher III
Trevor Darrell
William Freeman
Paul Viola

People can understand complex auditory and visual information, often using one to disambiguate the other. Automated analysis, even at a low(cid: 173) level, faces severe challenges, including the lack of accurate statistical models for the signals, and their high-dimensionality and varied sam(cid: 173) pling rates. Previous approaches [6] assumed simple parametric models for the joint distribution which, while tractable, cannot capture the com(cid: 173) plex signal relationships. We learn the joint distribution of the visual and auditory signals using a non-parametric approach. First, we project the data into a maximally informative, low-dimensional subspace, suitable for density estimation. We then model the complicated stochastic rela(cid: 173) tionships between the signals using a nonparametric density estimator. These learned densities allow processing across signal modalities. We demonstrate, on synthetic and real signals, localization in video of the face that is speaking in audio, and, conversely, audio enhancement of a particular speaker selected from the video.

PDF Details

NeurIPS Conference 1998 Conference Paper

Example-Based Image Synthesis of Articulated Figures

Trevor Darrell

We present a method for learning complex appearance mappings. such as occur with images of articulated objects. Traditional interpolation networks fail on this case since appearance is not necessarily a smooth function nor a linear manifold for articulated objects. We define an ap(cid: 173) pearance mapping from examples by constructing a set of independently smooth interpolation networks; these networks can cover overlapping re(cid: 173) gions of parameter space. A set growing procedure is used to find ex(cid: 173) ample clusters which are well-approximated within their convex hull; interpolation then proceeds only within these sets of examples. With this method physically valid images are produced even in regions of param(cid: 173) eter space where nearby examples have different appearances. We show results generating both simulated and real arm images.

PDF Details

NeurIPS Conference 1995 Conference Paper

Active Gesture Recognition using Learned Visual Attention

Trevor Darrell
Alex Pentland

We have developed a foveated gesture recognition system that runs in an unconstrained office environment with an active camera. Us(cid: 173) ing vision routines previously implemented for an interactive envi(cid: 173) ronment, we determine the spatial location of salient body parts of a user and guide an active camera to obtain images of gestures or expressions. A hidden-state reinforcement learning paradigm is used to implement visual attention. The attention module selects targets to foveate based on the goal of successful recognition, and uses a new multiple-model Q-Iearning formulation. Given a set of target and distractor gestures, our system can learn where to foveate to maximally discriminate a particular gesture.

PDF Details

NeurIPS Conference 1994 Conference Paper

Correlation and Interpolation Networks for Real-time Expression Analysis/Synthesis

Trevor Darrell
Irfan Essa
Alex Pentland

We describe a framework for real-time tracking of facial expressions that uses neurally-inspired correlation and interpolation methods. A distributed view-based representation is used to characterize facial state, and is computed using a replicated correlation network. The ensemble response of the set of view correlation scores is input to a network based interpolation method, which maps perceptual state to motor control states for a simulated 3-D face model. Activation levels of the motor state correspond to muscle activations in an anatomically derived model. By integrating fast and robust 2-D processing with 3-D models, we obtain a system that is able to quickly track and interpret complex facial motions in real-time.

PDF Details

NeurIPS Conference 1993 Conference Paper

Classifying Hand Gestures with a View-Based Distributed Representation

Trevor Darrell
Alex Pentland

We present a method for learning, tracking, and recognizing human hand gestures recorded by a conventional CCD camera without any special gloves or other sensors. A view-based representation is used to model aspects of the hand relevant to the trained gestures, and is found using an unsupervised clustering technique. We use normalized correlation net(cid: 173) works, with dynamic time warping in the temporal domain, as a distance function for unsupervised clustering. Views are computed separably for space and time dimensions; the distributed response of the combination of these units characterizes the input data with a low dimensional repre(cid: 173) sentation. A supervised classification stage uses labeled outputs of the spatio-temporal units as training data. Our system can correctly classify gestures in real time with a low-cost image processing accelerator.

PDF Details

NeurIPS Conference 1991 Conference Paper

Against Edges: Function Approximation with Multiple Support Maps

Trevor Darrell
Alex Pentland

Networks for reconstructing a sparse or noisy function often use an edge field to segment the function into homogeneous regions, This approach assumes that these regions do not overlap or have disjoint parts, which is often false. For example, images which contain regions split by an occlud(cid: 173) ing object can't be properly reconstructed using this type of network. We have developed a network that overcomes these limitations, using support maps to represent the segmentation of a signal. In our approach, the sup(cid: 173) port of each region in the signal is explicitly represented. Results from an initial implementation demonstrate that this method can reconstruct images and motion sequences which contain complicated occlusion.

PDF Details