Arrow Research search

Author name cluster

Antonio Torralba

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

49 papers
1 author row

Possible papers

49

AAAI Conference 2026 Conference Paper

VirtualEnv: A Platform for Embodied AI Research

  • Kabir Swain
  • Sijie Han
  • Ayush Raina
  • Jin Zhang
  • Shuang Li
  • Michael Stopa
  • Antonio Torralba

As large language models (LLMs) continue to improve in reasoning and decision-making, there is a growing need for realistic and interactive environments where their abilities can be rigorously evaluated. We present VirtualEnv, a next-generation simulation platform built on Unreal Engine 5 that enables fine-grained benchmarking of LLMs in embodied and interactive scenarios. VirtualEnv supports rich agent–environment interactions, including object manipulation, navigation, and adaptive multi-agent collaboration, as well as game-inspired mechanics like escape rooms and procedurally generated environments. We provide a user-friendly API built on top of Unreal Engine, allowing researchers to deploy and control LLM-driven agents using natural language instructions. We integrate large-scale LLMs and vision-language models (VLMs), such as GPT-based models, to generate novel environments and structured tasks from multimodal inputs. Our experiments benchmark the performance of several popular LLMs across tasks of increasing complexity, analyzing differences in adaptability, planning, and multi-agent coordination. We also describe our methodology for procedural task generation, task validation, and real-time environment control. VirtualEnv is released as an open-source platform, we aim to advance research at the intersection of AI and gaming, enable standardized evaluation of LLMs in embodied AI settings, and pave the way for future developments in immersive simulations and interactive entertainment.

NeurIPS Conference 2025 Conference Paper

Ambient Diffusion Omni: Training Good Models with Bad Data

  • Giannis Daras
  • Adrian Rodriguez-Munoz
  • Adam Klivans
  • Antonio Torralba
  • Constantinos Daskalakis

We show how to use low-quality, synthetic, and out-of-distribution images to improve the quality of a diffusion model. Typically, diffusion models are trained on curated datasets that emerge from highly filtered data pools from the Web and other sources. We show that there is immense value in the lower-quality images that are often discarded. We present Ambient Diffusion Omni, a simple, principled framework to train diffusion models that can extract signal from arbitrarily images during training. Our framework exploits two properties of natural images -- spectral power law decay and locality. We first validate our framework by successfully training diffusion models with images synthetically corrupted by Gaussian blur, JPEG compression, and motion blur. We use our framework to achieve state-of-the-art ImageNet FID and we show significant improvements in both image quality and diversity for text-to-image generative modeling. The core insight is that noise dampens the initial skew between the desired high-quality distribution and the mixed distribution we actually observe. We provide rigorous theoretical justification for our approach by analyzing the trade-off between learning from biased data versus limited unbiased data across diffusion times.

NeurIPS Conference 2025 Conference Paper

Automated Detection of Visual Attribute Reliance with a Self-Reflective Agent

  • Christy Li
  • Josep Lopez Camuñas
  • Jake Touchet
  • Jacob Andreas
  • Agata Lapedriza
  • Antonio Torralba
  • Tamar Rott Shaham

When a vision model performs image recognition, which visual attributes drive its predictions? Detecting unintended reliance on specific visual features is critical for ensuring model robustness, preventing overfitting, and avoiding spurious correlations. We introduce an automated framework for detecting such dependencies in trained vision models. At the core of our method is a self-reflective agent that systematically generates and tests hypotheses about visual attributes that a model may rely on. This process is iterative: the agent refines its hypotheses based on experimental outcomes and uses a self-evaluation protocol to assess whether its findings accurately explain model behavior. When inconsistencies arise, the agent self-reflects over its findings and triggers a new cycle of experimentation. We evaluate our approach on a novel benchmark of 130 models designed to exhibit diverse visual attribute dependencies across 18 categories. Our results show that the agent's performance consistently improves with self-reflection, with a significant performance increase over non-reflective baselines. We further demonstrate that the agent identifies real-world visual attribute dependencies in state-of-the-art models, including CLIP's vision encoder and the YOLOv8 object detector.

NeurIPS Conference 2025 Conference Paper

Dataset Distillation for Pre-Trained Self-Supervised Vision Models

  • George Cazenavette
  • Antonio Torralba
  • Vincent Sitzmann

The task of dataset distillation aims to find a small set of synthetic images such that training a model on them reproduces the performance of the same model trained on a much larger dataset of real samples. Existing distillation methods focus on synthesizing datasets that enable training randomly initialized models. In contrast, state-of-the-art vision approaches are increasingly building on large, pre-trained self-supervised models rather than training from scratch. In this paper, we investigate the problem of distilling datasets that enable us to optimally train linear probes on top of such large, pre-trained vision models. We introduce a method of dataset distillation for this task called Linear Gradient Matching that optimizes the synthetic images such that, when passed through a pre-trained feature extractor, they induce gradients in the linear classifier similar to those produced by the real data. Our method yields synthetic data that out-perform all real-image baselines and, remarkably, generalize across pre-trained vision models, enabling us, for instance, to train a linear CLIP probe that performs competitively using a dataset distilled via a DINO backbone. Further, we show that distilled datasets provide a valuable tool for model interpretability, predicting, among other things, how similar two model's representations spaces are under the platonic representation hypothesis or whether a model is sensitive to spurious correlations in adversarial datasets.

NeurIPS Conference 2025 Conference Paper

Feed-Forward Bullet-Time Reconstruction of Dynamic Scenes from Monocular Videos

  • hanxue liang
  • Jiawei Ren
  • Ashkan Mirzaei
  • Antonio Torralba
  • Ziwei Liu
  • Igor Gilitschenski
  • Sanja Fidler
  • Cengiz Oztireli

Recent advancements in static feed-forward scene reconstruction have demonstrated significant progress in high-quality novel view synthesis. However, these models often struggle with generalizability across diverse environments and fail to effectively handle dynamic content. We present BTimer (short for Bullet Timer), the first motion-aware feed-forward model for real-time reconstruction and novel view synthesis of dynamic scenes. Our approach reconstructs the full scene in a 3D Gaussian Splatting representation at a given target (‘bullet’) timestamp by aggregating information from all the context frames. Such a formulation allows BTimer to gain scalability and generalization by leveraging both static and dynamic scene datasets. Given a casual monocular dynamic video, BTimer reconstructs a bullet-time scene within 150ms while reaching state-of-the-art performance on both static and dynamic scene datasets, even compared with optimization-based approaches.

NeurIPS Conference 2025 Conference Paper

LoRA vs Full Fine-tuning: An Illusion of Equivalence

  • Reece Shuttleworth
  • Jacob Andreas
  • Antonio Torralba
  • Pratyusha Sharma

Fine-tuning is a crucial paradigm for adapting pre-trained large language models to downstream tasks. Recently, methods like Low-Rank Adaptation (LoRA) have been shown to effectively fine-tune LLMs with an extreme reduction in trainable parameters. But, \emph{are their learned solutions really equivalent? } We study how LoRA and full-finetuning change pre-trained models by analyzing the model's weight matrices through the lens of their spectral properties. We find that LoRA and full fine-tuning yield weight matrices whose singular value decompositions exhibit very different structure: weight matrices trained with LoRA have new, high-ranking singular vectors, which we call \emph{intruder dimensions}, while those trained with full fine-tuning do not. Further, we extend the finding that LoRA forgets less than full fine-tuning and find its forgetting is vastly localized to the intruder dimension -- by causally intervening on the intruder dimensions by changing their associated singular values post-fine-tuning, we show that they cause forgetting. Moreover, scaling them down significantly improves modeling of the pre-training distribution with a minimal drop in downstream task performance. Given this, we should expect accumulating intruder dimensions to be harmful and lead to more forgetting. This will be amplified during continual learning because of sequentially fine-tuning, and we show that LoRA models do accumulate intruder dimensions here tend to perform worse in this setting, emphasizing the practicality of our findings.

NeurIPS Conference 2025 Conference Paper

Single-pass Adaptive Image Tokenization for Minimum Program Search

  • Shivam Duggal
  • Sanghyun Byun
  • Bill Freeman
  • Antonio Torralba
  • Phillip Isola

According to Algorithmic Information Theory (AIT), intelligent representations compress data into the shortest possible program while remaining predictive of its content—exhibiting low Kolmogorov Complexity (KC). In contrast, most visual representation learning systems assign fixed-length representations to all inputs, ignoring variations in complexity or familiarity. Recent adaptive tokenization methods address this by allocating variable-length representations but typically require test-time search over multiple hypotheses to identify the most predictive one. Inspired by KC principles, we propose a one-shot adaptive tokenizer, KARL, that predicts the appropriate number of tokens for an image in a single forward pass, halting once its approximate KC is reached. The token count serves as a proxy for the minimum description length. KARL performs comparably to recent adaptive tokenizers while operating in a one-pass manner. Additionally, we present a conceptual study showing a correlation between adaptive tokenization and core ideas from AIT. We demonstrate that adaptive tokenization not only aligns with KC but also reveals empirical signals approximating AIT concepts such as sophistication and logical depth. Finally, we analyze predicted image complexity and interestingness across axes such as structure vs. noise and in-distribution vs. out-of-distribution familiarity, highlighting alignment with human annotations.

NeurIPS Conference 2024 Conference Paper

L4GM: Large 4D Gaussian Reconstruction Model

  • Jiawei Ren
  • Kevin Xie
  • Ashkan Mirzaei
  • hanxue liang
  • Xiaohui Zeng
  • Karsten Kreis
  • Ziwei Liu
  • Antonio Torralba

We present L4GM, the first 4D Large Reconstruction Model that produces animated objects from a single-view video input -- in a single feed-forward pass that takes only a second. Key to our success is a novel dataset of multiview videos containing curated, rendered animated objects from Objaverse. This dataset depicts 44K diverse objects with 110K animations rendered in 48 viewpoints, resulting in 12M videos with a total of 300M frames. We keep our L4GM simple for scalability and build directly on top of LGM, a pretrained 3D Large Reconstruction Model that outputs 3D Gaussian ellipsoids from multiview image input. L4GM outputs a per-frame 3D Gaussian splat representation from video frames sampled at a low fps and then upsamples the representation to a higher fps to achieve temporal smoothness. We add temporal self-attention layers to the base LGM to help it learn consistency across time, and utilize a per-timestep multiview rendering loss to train the model. The representation is upsampled to a higher framerate by training an interpolation model which produces intermediate 3D Gaussian representations. We showcase that L4GM that is only trained on synthetic data generalizes well on in-the-wild videos, producing high quality animated 3D assets.

NeurIPS Conference 2023 Conference Paper

3D-IntPhys: Towards More Generalized 3D-grounded Visual Intuitive Physics under Challenging Scenes

  • Haotian Xue
  • Antonio Torralba
  • Josh Tenenbaum
  • Dan Yamins
  • Yunzhu Li
  • Hsiao-Yu Tung

Given a visual scene, humans have strong intuitions about how a scene can evolve over time under given actions. The intuition, often termed visual intuitive physics, is a critical ability that allows us to make effective plans to manipulate the scene to achieve desired outcomes without relying on extensive trial and error. In this paper, we present a framework capable of learning 3D-grounded visual intuitive physics models from videos of complex scenes with fluids. Our method is composed of a conditional Neural Radiance Field (NeRF)-style visual frontend and a 3D point-based dynamics prediction backend, using which we can impose strong relational and structural inductive bias to capture the structure of the underlying environment. Unlike existing intuitive point-based dynamics works that rely on the supervision of dense point trajectory from simulators, we relax the requirements and only assume access to multi-view RGB images and (imperfect) instance masks acquired using color prior. This enables the proposed model to handle scenarios where accurate point estimation and tracking are hard or impossible. We generate datasets including three challenging scenarios involving fluid, granular materials, and rigid objects in the simulation. The datasets do not include any dense particle information so most previous 3D-based intuitive physics pipelines can barely deal with that. We show our model can make long-horizon future predictions by learning from raw images and significantly outperforms models that do not employ an explicit 3D representation space. We also show that once trained, our model can achieve strong generalization in complex scenarios under extrapolate settings.

NeurIPS Conference 2023 Conference Paper

FIND: A Function Description Benchmark for Evaluating Interpretability Methods

  • Sarah Schwettmann
  • Tamar Shaham
  • Joanna Materzynska
  • Neil Chowdhury
  • Shuang Li
  • Jacob Andreas
  • David Bau
  • Antonio Torralba

Labeling neural network submodules with human-legible descriptions is useful for many downstream tasks: such descriptions can surface failures, guide interventions, and perhaps even explain important model behaviors. To date, most mechanistic descriptions of trained networks have involved small models, narrowly delimited phenomena, and large amounts of human labor. Labeling all human-interpretable sub-computations in models of increasing size and complexity will almost certainly require tools that can generate and validate descriptions automatically. Recently, techniques that use learned models in-the-loop for labeling have begun to gain traction, but methods for evaluating their efficacy are limited and ad-hoc. How should we validate and compare open-ended labeling tools? This paper introduces FIND (Function INterpretation and Description), a benchmark suite for evaluating the building blocks of automated interpretability methods. FIND contains functions that resemble components of trained neural networks, and accompanying descriptions of the kind we seek to generate. The functions are procedurally constructed across textual and numeric domains, and involve a range of real-world complexities, including noise, composition, approximation, and bias. We evaluate methods that use pretrained language models (LMs) to produce code-based and natural language descriptions of function behavior. Additionally, we introduce a new interactive method in which an Automated Interpretability Agent (AIA) generates function descriptions. We find that an AIA, built with an off-the-shelf LM augmented with black-box access to functions, can sometimes infer function structure—acting as a scientist by forming hypotheses, proposing experiments, and updating descriptions in light of new data. However, FIND also reveals that LM-based descriptions capture global function behavior while missing local details. These results suggest that FIND will be useful for characterizing the performance of more sophisticated interpretability methods before they are applied to real-world models.

NeurIPS Conference 2023 Conference Paper

Structure from Duplicates: Neural Inverse Graphics from a Pile of Objects

  • Tianhang Cheng
  • Wei-Chiu Ma
  • Kaiyu Guan
  • Antonio Torralba
  • Shenlong Wang

Abstract Our world is full of identical objects (\emph{e. g. }, cans of coke, cars of same model). These duplicates, when seen together, provide additional and strong cues for us to effectively reason about 3D. Inspired by this observation, we introduce Structure from Duplicates (SfD), a novel inverse graphics framework that reconstructs geometry, material, and illumination from a single image containing multiple identical objects. SfD begins by identifying multiple instances of an object within an image, and then jointly estimates the 6DoF pose for all instances. An inverse graphics pipeline is subsequently employed to jointly reason about the shape, material of the object, and the environment light, while adhering to the shared geometry and material constraint across instances. Our primary contributions involve utilizing object duplicates as a robust prior for single-image inverse graphics and proposing an in-plane rotation-robust Structure from Motion (SfM) formulation for joint 6-DoF object pose estimation. By leveraging multi-view cues from a single image, SfD generates more realistic and detailed 3D reconstructions, significantly outperforming existing single image reconstruction models and multi-view reconstruction approaches with a similar or greater number of observations.

NeurIPS Conference 2022 Conference Paper

ActionSense: A Multimodal Dataset and Recording Framework for Human Activities Using Wearable Sensors in a Kitchen Environment

  • Joseph DelPreto
  • Chao Liu
  • Yiyue Luo
  • Michael Foshey
  • Yunzhu Li
  • Antonio Torralba
  • Wojciech Matusik
  • Daniela Rus

This paper introduces ActionSense, a multimodal dataset and recording framework with an emphasis on wearable sensing in a kitchen environment. It provides rich, synchronized data streams along with ground truth data to facilitate learning pipelines that could extract insights about how humans interact with the physical world during activities of daily living, and help lead to more capable and collaborative robot assistants. The wearable sensing suite captures motion, force, and attention information; it includes eye tracking with a first-person camera, forearm muscle activity sensors, a body-tracking system using 17 inertial sensors, finger-tracking gloves, and custom tactile sensors on the hands that use a matrix of conductive threads. This is coupled with activity labels and with externally-captured data from multiple RGB cameras, a depth camera, and microphones. The specific tasks recorded in ActionSense are designed to highlight lower-level physical skills and higher-level scene reasoning or action planning. They include simple object manipulations (e. g. , stacking plates), dexterous actions (e. g. , peeling or cutting vegetables), and complex action sequences (e. g. , setting a table or loading a dishwasher). The resulting dataset and underlying experiment framework are available at https: //action-sense. csail. mit. edu. Preliminary networks and analyses explore modality subsets and cross-modal correlations. ActionSense aims to support applications including learning from demonstrations, dexterous robot control, cross-modal predictions, and fine-grained action segmentation. It could also help inform the next generation of smart textiles that may one day unobtrusively send rich data streams to in-home collaborative or autonomous robot assistants.

NeurIPS Conference 2022 Conference Paper

Learning Neural Acoustic Fields

  • Andrew Luo
  • Yilun Du
  • Michael Tarr
  • Josh Tenenbaum
  • Antonio Torralba
  • Chuang Gan

Our environment is filled with rich and dynamic acoustic information. When we walk into a cathedral, the reverberations as much as appearance inform us of the sanctuary's wide open space. Similarly, as an object moves around us, we expect the sound emitted to also exhibit this movement. While recent advances in learned implicit functions have led to increasingly higher quality representations of the visual world, there have not been commensurate advances in learning spatial auditory representations. To address this gap, we introduce Neural Acoustic Fields (NAFs), an implicit representation that captures how sounds propagate in a physical scene. By modeling acoustic propagation in a scene as a linear time-invariant system, NAFs learn to continuously map all emitter and listener location pairs to a neural impulse response function that can then be applied to arbitrary sounds. We demonstrate NAFs on both synthetic and real data, and show that the continuous nature of NAFs enables us to render spatial acoustics for a listener at arbitrary locations. We further show that the representation learned by NAFs can help improve visual learning with sparse views. Finally we show that a representation informative of scene structure emerges during the learning of NAFs.

NeurIPS Conference 2022 Conference Paper

Pre-Trained Language Models for Interactive Decision-Making

  • Shuang Li
  • Xavier Puig
  • Chris Paxton
  • Yilun Du
  • Clinton Wang
  • Linxi Fan
  • Tao Chen
  • De-An Huang

Language model (LM) pre-training is useful in many language processing tasks. But can pre-trained LMs be further leveraged for more general machine learning problems? We propose an approach for using LMs to scaffold learning and generalization in general sequential decision-making problems. In this approach, goals and observations are represented as a sequence of embeddings, and a policy network initialized with a pre-trained LM predicts the next action. We demonstrate that this framework enables effective combinatorial generalization across different environments and supervisory modalities. We begin by assuming access to a set of expert demonstrations, and show that initializing policies with LMs and fine-tuning them via behavior cloning improves task completion rates by 43. 6% in the VirtualHome environment. Next, we integrate an active data gathering procedure in which agents iteratively interact with the environment, relabel past "failed" experiences with new goals, and update their policies in a self-supervised loop. Active data gathering further improves combinatorial generalization, outperforming the best baseline by 25. 1%. Finally, we explain these results by investigating three possible factors underlying the effectiveness of the LM-based policy. We find that sequential input representations (vs. fixed-dimensional feature vectors) and LM-based weight initialization are both important for generalization. Surprisingly, however, the format of the policy inputs encoding (e. g. as a natural language string vs. an arbitrary sequential encoding) has little influence. Together, these results suggest that language modeling induces representations that are useful for modeling not just language, but also goals and plans; these representations can aid learning and generalization even outside of language processing.

NeurIPS Conference 2022 Conference Paper

Procedural Image Programs for Representation Learning

  • Manel Baradad
  • Richard Chen
  • Jonas Wulff
  • Tongzhou Wang
  • Rogerio Feris
  • Antonio Torralba
  • Phillip Isola

Learning image representations using synthetic data allows training neural networks without some of the concerns associated with real images, such as privacy and bias. Existing work focuses on a handful of curated generative processes which require expert knowledge to design, making it hard to scale up. To overcome this, we propose training with a large dataset of twenty-one thousand programs, each one generating a diverse set of synthetic images. These programs are short code snippets, which are easy to modify and fast to execute using OpenGL. The proposed dataset can be used for both supervised and unsupervised representation learning, and reduces the gap between pre-training with real and procedurally generated images by 38%.

NeurIPS Conference 2021 Conference Paper

EditGAN: High-Precision Semantic Image Editing

  • Huan Ling
  • Karsten Kreis
  • Daiqing Li
  • Seung Wook Kim
  • Antonio Torralba
  • Sanja Fidler

Generative adversarial networks (GANs) have recently found applications in image editing. However, most GAN-based image editing methods often require large-scale datasets with semantic segmentation annotations for training, only provide high-level control, or merely interpolate between different images. Here, we propose EditGAN, a novel method for high-quality, high-precision semantic image editing, allowing users to edit images by modifying their highly detailed part segmentation masks, e. g. , drawing a new mask for the headlight of a car. EditGAN builds on a GAN framework that jointly models images and their semantic segmentation, requiring only a handful of labeled examples – making it a scalable tool for editing. Specifically, we embed an image into the GAN’s latent space and perform conditional latent code optimization according to the segmentation edit, which effectively also modifies the image. To amortize optimization, we find “editing vectors” in latent space that realize the edits. The framework allows us to learn an arbitrary number of editing vectors, which can then be directly applied on other images at interactive rates. We experimentally show that EditGAN can manipulate images with an unprecedented level of detail and freedom while preserving full image quality. We can also easily combine multiple edits and perform plausible edits beyond EditGAN’s training data. We demonstrate EditGAN on a wide variety of image types and quantitatively outperform several previous editing methods on standard editing benchmark tasks.

NeurIPS Conference 2021 Conference Paper

Editing a classifier by rewriting its prediction rules

  • Shibani Santurkar
  • Dimitris Tsipras
  • Mahalaxmi Elango
  • David Bau
  • Antonio Torralba
  • Aleksander Madry

We propose a methodology for modifying the behavior of a classifier by directly rewriting its prediction rules. Our method requires virtually no additional data collection and can be applied to a variety of settings, including adapting a model to new environments, and modifying it to ignore spurious features.

NeurIPS Conference 2021 Conference Paper

Learning to Compose Visual Relations

  • Nan Liu
  • Shuang Li
  • Yilun Du
  • Josh Tenenbaum
  • Antonio Torralba

The visual world around us can be described as a structured set of objects and their associated relations. An image of a room may be conjured given only the description of the underlying objects and their associated relations. While there has been significant work on designing deep neural networks which may compose individual objects together, less work has been done on composing the individual relations between objects. A principal difficulty is that while the placement of objects is mutually independent, their relations are entangled and dependent on each other. To circumvent this issue, existing works primarily compose relations by utilizing a holistic encoder, in the form of text or graphs. In this work, we instead propose to represent each relation as an unnormalized density (an energy-based model), enabling us to compose separate relations in a factorized manner. We show that such a factorized decomposition allows the model to both generate and edit scenes that have multiple sets of relations more faithfully. We further show that decomposition enables our model to effectively understand the underlying relational scene structure.

NeurIPS Conference 2021 Conference Paper

Learning to See by Looking at Noise

  • Manel Baradad Jurjo
  • Jonas Wulff
  • Tongzhou Wang
  • Phillip Isola
  • Antonio Torralba

Current vision systems are trained on huge datasets, and these datasets come with costs: curation is expensive, they inherit human biases, and there are concerns over privacy and usage rights. To counter these costs, interest has surged in learning from cheaper data sources, such as unlabeled images. In this paper we go a step further and ask if we can do away with real image datasets entirely, instead learning from procedural noise processes. We investigate a suite of image generation models that produce images from simple random processes. These are then used as training data for a visual representation learner with a contrastive loss. In particular, we study statistical image models, randomly initialized deep generative models, and procedural graphics models. Our findings show that it is important for the noise to capture certain structural properties of real data but that good performance can be achieved even with processes that are far from realistic. We also find that diversity is a key property to learn good representations.

NeurIPS Conference 2021 Conference Paper

Measuring Generalization with Optimal Transport

  • Ching-Yao Chuang
  • Youssef Mroueh
  • Kristjan Greenewald
  • Antonio Torralba
  • Stefanie Jegelka

Understanding the generalization of deep neural networks is one of the most important tasks in deep learning. Although much progress has been made, theoretical error bounds still often behave disparately from empirical observations. In this work, we develop margin-based generalization bounds, where the margins are normalized with optimal transport costs between independent random subsets sampled from the training distribution. In particular, the optimal transport cost can be interpreted as a generalization of variance which captures the structural properties of the learned feature space. Our bounds robustly predict the generalization error, given training data and network parameters, on large scale datasets. Theoretically, we demonstrate that the concentration and separation of features play crucial roles in generalization, supporting empirical results in the literature.

NeurIPS Conference 2021 Conference Paper

PTR: A Benchmark for Part-based Conceptual, Relational, and Physical Reasoning

  • Yining Hong
  • Li Yi
  • Josh Tenenbaum
  • Antonio Torralba
  • Chuang Gan

A critical aspect of human visual perception is the ability to parse visual scenes into individual objects and further into object parts, forming part-whole hierarchies. Such composite structures could induce a rich set of semantic concepts and relations, thus playing an important role in the interpretation and organization of visual signals as well as for the generalization of visual perception and reasoning. However, existing visual reasoning benchmarks mostly focus on objects rather than parts. Visual reasoning based on the full part-whole hierarchy is much more challenging than object-centric reasoning due to finer-grained concepts, richer geometry relations, and more complex physics. Therefore, to better serve for part-based conceptual, relational and physical reasoning, we introduce a new large-scale diagnostic visual reasoning dataset named PTR. PTR contains around 80k RGBD synthetic images with ground truth object and part level annotations regarding semantic instance segmentation, color attributes, spatial and geometric relationships, and certain physical properties such as stability. These images are paired with 800k machine-generated questions covering various types of reasoning types, making them a good testbed for visual reasoning models. We examine several state-of-the-art visual reasoning models on this dataset and observe that they still make many surprising mistakes in situations where humans can easily infer the correct answer. We believe this dataset will open up new opportunities for part-based reasoning. PTR dataset and baseline models are publicly available.

NeurIPS Conference 2021 Conference Paper

ThreeDWorld: A Platform for Interactive Multi-Modal Physical Simulation

  • Chuang Gan
  • Jeremy Schwartz
  • Seth Alter
  • Damian Mrowca
  • Martin Schrimpf
  • James Traer
  • Julian De Freitas
  • Jonas Kubilius

We introduce ThreeDWorld (TDW), a platform for interactive multi-modal physical simulation. TDW enables the simulation of high-fidelity sensory data and physical interactions between mobile agents and objects in rich 3D environments. Unique properties include real-time near-photo-realistic image rendering; a library of objects and environments, and routines for their customization; generative procedures for efficiently building classes of new environments; high-fidelity audio rendering; realistic physical interactions for a variety of material types, including cloths, liquid, and deformable objects; customizable ``avatars” that embody AI agents; and support for human interactions with VR devices. TDW’s API enables multiple agents to interact within a simulation and returns a range of sensor and physics data representing the state of the world. We present initial experiments enabled by TDW in emerging research directions in computer vision, machine learning, and cognitive science, including multi-modal physical scene understanding, physical dynamics predictions, multi-agent interactions, models that ‘learn like a child’, and attention studies in humans and neural networks.

NeurIPS Conference 2020 Conference Paper

Causal Discovery in Physical Systems from Videos

  • Yunzhu Li
  • Antonio Torralba
  • Anima Anandkumar
  • Dieter Fox
  • Animesh Garg

Causal discovery is at the core of human cognition. It enables us to reason about the environment and make counterfactual predictions about unseen scenarios that can vastly differ from our previous experiences. We consider the task of causal discovery from videos in an end-to-end fashion without supervision on the ground-truth graph structure. In particular, our goal is to discover the structural dependencies among environmental and object variables: inferring the type and strength of interactions that have a causal effect on the behavior of the dynamical system. Our model consists of (a) a perception module that extracts a semantically meaningful and temporally consistent keypoint representation from images, (b) an inference module for determining the graph distribution induced by the detected keypoints, and (c) a dynamics module that can predict the future by conditioning on the inferred graph. We assume access to different configurations and environmental conditions, i. e. , data from unknown interventions on the underlying system; thus, we can hope to discover the correct underlying causal graph without explicit interventions. We evaluate our method in a planar multi-body interaction environment and scenarios involving fabrics of different shapes like shirts and pants. Experiments demonstrate that our model can correctly identify the interactions from a short sequence of images and make long-term future predictions. The causal structure assumed by the model also allows it to make counterfactual predictions and extrapolate to systems of unseen interaction graphs or graphs of various sizes.

NeurIPS Conference 2020 Conference Paper

Debiased Contrastive Learning

  • Ching-Yao Chuang
  • Joshua Robinson
  • Yen-Chen Lin
  • Antonio Torralba
  • Stefanie Jegelka

A prominent technique for self-supervised representation learning has been to contrast semantically similar and dissimilar pairs of samples. Without access to labels, dissimilar (negative) points are typically taken to be randomly sampled datapoints, implicitly accepting that these points may, in reality, actually have the same label. Perhaps unsurprisingly, we observe that sampling negative examples from truly different labels improves performance, in a synthetic setting where labels are available. Motivated by this observation, we develop a debiased contrastive objective that corrects for the sampling of same-label datapoints, even without knowledge of the true labels. Empirically, the proposed objective consistently outperforms the state-of-the-art for representation learning in vision, language, and reinforcement learning benchmarks. Theoretically, we establish generalization bounds for the downstream classification task.

NeurIPS Conference 2018 Conference Paper

3D-Aware Scene Manipulation via Inverse Graphics

  • Shunyu Yao
  • Tzu Ming Hsu
  • Jun-Yan Zhu
  • Jiajun Wu
  • Antonio Torralba
  • Bill Freeman
  • Josh Tenenbaum

We aim to obtain an interpretable, expressive, and disentangled scene representation that contains comprehensive structural and textural information for each object. Previous scene representations learned by neural networks are often uninterpretable, limited to a single object, or lacking 3D knowledge. In this work, we propose 3D scene de-rendering networks (3D-SDN) to address the above issues by integrating disentangled representations for semantics, geometry, and appearance into a deep generative model. Our scene encoder performs inverse graphics, translating a scene into a structured object-wise representation. Our decoder has two components: a differentiable shape renderer and a neural texture generator. The disentanglement of semantics, geometry, and appearance supports 3D-aware scene manipulation, e. g. , rotating and moving objects freely while keeping the consistent shape and texture, and changing the object appearance without affecting its shape. Experiments demonstrate that our editing scheme based on 3D-SDN is superior to its 2D counterpart.

NeurIPS Conference 2018 Conference Paper

Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding

  • Kexin Yi
  • Jiajun Wu
  • Chuang Gan
  • Antonio Torralba
  • Pushmeet Kohli
  • Josh Tenenbaum

We marry two powerful ideas: deep representation learning for visual recognition and language understanding, and symbolic program execution for reasoning. Our neural-symbolic visual question answering (NS-VQA) system first recovers a structural scene representation from the image and a program trace from the question. It then executes the program on the scene representation to obtain an answer. Incorporating symbolic structure as prior knowledge offers three unique advantages. First, executing programs on a symbolic space is more robust to long program traces; our model can solve complex reasoning tasks better, achieving an accuracy of 99. 8% on the CLEVR dataset. Second, the model is more data- and memory-efficient: it performs well after learning on a small number of training data; it can also encode an image into a compact representation, requiring less storage than existing methods for offline question answering. Third, symbolic program execution offers full transparency to the reasoning process; we are thus able to interpret and diagnose each execution step.

NeurIPS Conference 2018 Conference Paper

Visual Object Networks: Image Generation with Disentangled 3D Representations

  • Jun-Yan Zhu
  • Zhoutong Zhang
  • Chengkai Zhang
  • Jiajun Wu
  • Antonio Torralba
  • Josh Tenenbaum
  • Bill Freeman

Recent progress in deep generative models has led to tremendous breakthroughs in image generation. While being able to synthesize photorealistic images, existing models lack an understanding of our underlying 3D world. Different from previous works built on 2D datasets and models, we present a new generative model, Visual Object Networks (VONs), synthesizing natural images of objects with a disentangled 3D representation. Inspired by classic graphics rendering pipelines, we unravel the image formation process into three conditionally independent factors---shape, viewpoint, and texture---and present an end-to-end adversarial learning framework that jointly models 3D shape and 2D texture. Our model first learns to synthesize 3D shapes that are indistinguishable from real shapes. It then renders the object's 2. 5D sketches (i. e. , silhouette and depth map) from its shape under a sampled viewpoint. Finally, it learns to add realistic textures to these 2. 5D sketches to generate realistic images. The VON not only generates images that are more realistic than the state-of-the-art 2D image synthesis methods but also enables many 3D operations such as changing the viewpoint of a generated image, shape and texture editing, linear interpolation in texture and shape space, and transferring appearance across different objects and viewpoints.

NeurIPS Conference 2016 Conference Paper

Generating Videos with Scene Dynamics

  • Carl Vondrick
  • Hamed Pirsiavash
  • Antonio Torralba

We capitalize on large amounts of unlabeled video in order to learn a model of scene dynamics for both video recognition tasks (e. g. action classification) and video generation tasks (e. g. future prediction). We propose a generative adversarial network for video with a spatio-temporal convolutional architecture that untangles the scene's foreground from the background. Experiments suggest this model can generate tiny videos up to a second at full frame rate better than simple baselines, and we show its utility at predicting plausible futures of static images. Moreover, experiments and visualizations show the model internally learns useful features for recognizing actions with minimal supervision, suggesting scene dynamics are a promising signal for representation learning. We believe generative video models can impact many applications in video understanding and simulation.

NeurIPS Conference 2016 Conference Paper

SoundNet: Learning Sound Representations from Unlabeled Video

  • Yusuf Aytar
  • Carl Vondrick
  • Antonio Torralba

We learn rich natural sound representations by capitalizing on large amounts of unlabeled sound data collected in the wild. We leverage the natural synchronization between vision and sound to learn an acoustic representation using two-million unlabeled videos. Unlabeled video has the advantage that it can be economically acquired at massive scales, yet contains useful signals about natural sound. We propose a student-teacher training procedure which transfers discriminative visual knowledge from well established visual recognition models into the sound modality using unlabeled video as a bridge. Our sound representation yields significant performance improvements over the state-of-the-art results on standard benchmarks for acoustic scene/object classification. Visualizations suggest some high-level semantics automatically emerge in the sound network, even though it is trained without ground truth labels.

NeurIPS Conference 2016 Conference Paper

Unsupervised Learning of Spoken Language with Visual Context

  • David Harwath
  • Antonio Torralba
  • James Glass

Humans learn to speak before they can read or write, so why can’t computers do the same? In this paper, we present a deep neural network model capable of rudimentary spoken language acquisition using untranscribed audio training data, whose only supervision comes in the form of contextually relevant visual images. We describe the collection of our data comprised of over 120, 000 spoken audio captions for the Places image dataset and evaluate our model on an image search and annotation task. We also provide some visualizations which suggest that our model is learning to recognize meaningful words within the caption spectrograms.

NeurIPS Conference 2015 Conference Paper

Learning visual biases from human imagination

  • Carl Vondrick
  • Hamed Pirsiavash
  • Aude Oliva
  • Antonio Torralba

Although the human visual system can recognize many concepts under challengingconditions, it still has some biases. In this paper, we investigate whether wecan extract these biases and transfer them into a machine recognition system. We introduce a novel method that, inspired by well-known tools in humanpsychophysics, estimates the biases that the human visual system might use forrecognition, but in computer vision feature spaces. Our experiments aresurprising, and suggest that classifiers from the human visual system can betransferred into a machine with some success. Since these classifiers seem tocapture favorable biases in the human visual system, we further present an SVMformulation that constrains the orientation of the SVM hyperplane to agree withthe bias from human visual system. Our results suggest that transferring thishuman bias into machines may help object recognition systems generalize acrossdatasets and perform better when very little training data is available.

NeurIPS Conference 2015 Conference Paper

Skip-Thought Vectors

  • Ryan Kiros
  • Yukun Zhu
  • Russ Salakhutdinov
  • Richard Zemel
  • Raquel Urtasun
  • Antonio Torralba
  • Sanja Fidler

We describe an approach for unsupervised learning of a generic, distributed sentence encoder. Using the continuity of text from books, we train an encoder-decoder model that tries to reconstruct the surrounding sentences of an encoded passage. Sentences that share semantic and syntactic properties are thus mapped to similar vector representations. We next introduce a simple vocabulary expansion method to encode words that were not seen as part of training, allowing us to expand our vocabulary to a million words. After training our model, we extract and evaluate our vectors with linear models on 8 tasks: semantic relatedness, paraphrase detection, image-sentence ranking, question-type classification and 4 benchmark sentiment and subjectivity datasets. The end result is an off-the-shelf encoder that can produce highly generic sentence representations that are robust and perform well in practice. We will make our encoder publicly available.

NeurIPS Conference 2015 Conference Paper

Where are they looking?

  • Adria Recasens
  • Aditya Khosla
  • Carl Vondrick
  • Antonio Torralba

Humans have the remarkable ability to follow the gaze of other people to identify what they are looking at. Following eye gaze, or gaze-following, is an important ability that allows us to understand what other people are thinking, the actions they are performing, and even predict what they might do next. Despite the importance of this topic, this problem has only been studied in limited scenarios within the computer vision community. In this paper, we propose a deep neural network-based approach for gaze-following and a new benchmark dataset for thorough evaluation. Given an image and the location of a head, our approach follows the gaze of the person and identifies the object being looked at. After training, the network is able to discover how to extract head pose and gaze orientation, and to select objects in the scene that are in the predicted line of sight and likely to be looked at (such as televisions, balls and food). The quantitative evaluation shows that our approach produces reliable results, even when viewing only the back of the head. While our method outperforms several baseline approaches, we are still far from reaching human performance at this task. Overall, we believe that this is a challenging and important task that deserves more attention from the community.

NeurIPS Conference 2014 Conference Paper

Learning Deep Features for Scene Recognition using Places Database

  • Bolei Zhou
  • Agata Lapedriza
  • Jianxiong Xiao
  • Antonio Torralba
  • Aude Oliva

Scene recognition is one of the hallmark tasks of computer vision, allowing definition of a context for object recognition. Whereas the tremendous recent progress in object recognition tasks is due to the availability of large datasets like ImageNet and the rise of Convolutional Neural Networks (CNNs) for learning high-level features, performance at scene recognition has not attained the same level of success. This may be because current deep features trained from ImageNet are not competitive enough for such tasks. Here, we introduce a new scene-centric database called Places with over 7 million labeled pictures of scenes. We propose new methods to compare the density and diversity of image datasets and show that Places is as dense as other scene datasets and has more diversity. Using CNN, we learn deep features for scene recognition tasks, and establish new state-of-the-art results on several scene-centric datasets. A visualization of the CNN layers' responses allows us to show differences in the internal representations of object-centric and scene-centric networks.

NeurIPS Conference 2012 Conference Paper

Localizing 3D cuboids in single-view images

  • Jianxiong Xiao
  • Bryan Russell
  • Antonio Torralba

In this paper we seek to detect rectangular cuboids and localize their corners in uncalibrated single-view images depicting everyday scenes. In contrast to recent approaches that rely on detecting vanishing points of the scene and grouping line segments to form cuboids, we build a discriminative parts-based detector that models the appearance of the cuboid corners and internal edges while enforcing consistency to a 3D cuboid model. Our model is invariant to the different 3D viewpoints and aspect ratios and is able to detect cuboids across many different object categories. We introduce a database of images with cuboid annotations that spans a variety of indoor and outdoor scenes and show qualitative and quantitative results on our collected database. Our model out-performs baseline detectors that use 2D constraints alone on the task of localizing cuboid corners.

NeurIPS Conference 2012 Conference Paper

Memorability of Image Regions

  • Aditya Khosla
  • Jianxiong Xiao
  • Antonio Torralba
  • Aude Oliva

While long term human visual memory can store a remarkable amount of visual information, it tends to degrade over time. Recent works have shown that image memorability is an intrinsic property of an image that can be reliably estimated using state-of-the-art image features and machine learning algorithms. However, the class of features and image information that is forgotten has not been explored yet. In this work, we propose a probabilistic framework that models how and which local regions from an image may be forgotten using a data-driven approach that combines local and global images features. The model automatically discov- ers memorability maps of individual images without any human annotation. We incorporate multiple image region attributes in our algorithm, leading to improved memorability prediction of images as compared to previous works.

NeurIPS Conference 2011 Conference Paper

Learning to Learn with Compound HD Models

  • Antonio Torralba
  • Joshua Tenenbaum
  • Russ Salakhutdinov

We introduce HD (or ``Hierarchical-Deep'') models, a new compositional learning architecture that integrates deep learning models with structured hierarchical Bayesian models. Specifically we show how we can learn a hierarchical Dirichlet process (HDP) prior over the activities of the top-level features in a Deep Boltzmann Machine (DBM). This compound HDP-DBM model learns to learn novel concepts from very few training examples, by learning low-level generic features, high-level features that capture correlations among low-level features, and a category hierarchy for sharing priors over the high-level features that are typical of different kinds of concepts. We present efficient learning and inference algorithms for the HDP-DBM model and show that it is able to learn new concepts from very few examples on CIFAR-100 object recognition, handwritten character recognition, and human motion capture datasets.

NeurIPS Conference 2011 Conference Paper

Transfer Learning by Borrowing Examples for Multiclass Object Detection

  • Joseph Lim
  • Russ Salakhutdinov
  • Antonio Torralba

Despite the recent trend of increasingly large datasets for object detection, there still exist many classes with few training examples. To overcome this lack of train- ing data for certain classes, we propose a novel way of augmenting the training data for each class by borrowing and transforming examples from other classes. Our model learns which training instances from other classes to borrow and how to transform the borrowed examples so that they become more similar to instances from the target class. Our experimental results demonstrate that our new object detector, with borrowed and transformed examples, improves upon the current state-of-the-art detector on the challenging SUN09 object detection dataset.

NeurIPS Conference 2011 Conference Paper

Understanding the Intrinsic Memorability of Images

  • Phillip Isola
  • Devi Parikh
  • Antonio Torralba
  • Aude Oliva

Artists, advertisers, and photographers are routinely presented with the task of creating an image that a viewer will remember. While it may seem like image memorability is purely subjective, recent work shows that it is not an inexplicable phenomenon: variation in memorability of images is consistent across subjects, suggesting that some images are intrinsically more memorable than others, independent of a subjects' contexts and biases. In this paper, we used the publicly available memorability dataset of Isola et al. , and augmented the object and scene annotations with interpretable spatial, content, and aesthetic image properties. We used a feature-selection scheme with desirable explaining-away properties to determine a compact set of attributes that characterizes the memorability of any individual image. We find that images of enclosed spaces containing people with visible faces are memorable, while images of vistas and peaceful scenes are not. Contrary to popular belief, unusual or aesthetically pleasing scenes do not tend to be highly memorable. This work represents one of the first attempts at understanding intrinsic image memorability, and opens a new domain of investigation at the interface between human cognition and computer vision.

NeurIPS Conference 2009 Conference Paper

Nonparametric Bayesian Texture Learning and Synthesis

  • Long Zhu
  • Yuanahao Chen
  • Bill Freeman
  • Antonio Torralba

We present a nonparametric Bayesian method for texture learning and synthesis. A texture image is represented by a 2D-Hidden Markov Model (2D-HMM) where the hidden states correspond to the cluster labeling of textons and the transition matrix encodes their spatial layout (the compatibility between adjacent textons). 2D-HMM is coupled with the Hierarchical Dirichlet process (HDP) which allows the number of textons and the complexity of transition matrix grow as the input texture becomes irregular. The HDP makes use of Dirichlet process prior which favors regular textures by penalizing the model complexity. This framework (HDP-2D-HMM) learns the texton vocabulary and their spatial layout jointly and automatically. The HDP-2D-HMM results in a compact representation of textures which allows fast texture synthesis with comparable rendering quality over the state-of-the-art image-based rendering methods. We also show that HDP-2D-HMM can be applied to perform image segmentation and synthesis.

NeurIPS Conference 2009 Conference Paper

Semi-Supervised Learning in Gigantic Image Collections

  • Rob Fergus
  • Yair Weiss
  • Antonio Torralba

With the advent of the Internet it is now possible to collect hundreds of millions of images. These images come with varying degrees of label information. Clean labels can be manually obtained on a small fraction, noisy labels may be extracted automatically from surrounding text, while for most images there are no labels at all. Semi-supervised learning is a principled framework for combining these different label sources. However, it scales polynomially with the number of images, making it impractical for use on gigantic collections with hundreds of millions of images and thousands of classes. In this paper we show how to utilize recent results in machine learning to obtain highly efficient approximations for semi-supervised learning that are linear in the number of images. Specifically, we use the convergence of the eigenvectors of the normalized graph Laplacian to eigenfunctions of weighted Laplace-Beltrami operators. We combine this with a label sharing framework obtained from Wordnet to propagate label information to classes lacking manual annotations. Our algorithm enables us to apply semi-supervised learning to a database of 80 million images with 74 thousand classes.

NeurIPS Conference 2009 Conference Paper

Unsupervised Detection of Regions of Interest Using Iterative Link Analysis

  • Gunhee Kim
  • Antonio Torralba

This paper proposes a fast and scalable alternating optimization technique to detect regions of interest (ROIs) in cluttered Web images without labels. The proposed approach discovers highly probable regions of object instances by iteratively repeating the following two functions: (1) choose the exemplar set (i. e. small number of high ranked reference ROIs) across the dataset and (2) refine the ROIs of each image with respect to the exemplar set. These two subproblems are formulated as ranking in two different similarity networks of ROI hypotheses by link analysis. The experiments with the PASCAL 06 dataset show that our unsupervised localization performance is better than one of state-of-the-art techniques and comparable to supervised methods. Also, we test the scalability of our approach with five objects in Flickr dataset consisting of more than 200, 000 images.

NeurIPS Conference 2008 Conference Paper

Spectral Hashing

  • Yair Weiss
  • Antonio Torralba
  • Rob Fergus

Semantic hashing seeks compact binary codes of datapoints so that the Hamming distance between codewords correlates with semantic similarity. Hinton et al. used a clever implementation of autoencoders to find such codes. In this paper, we show that the problem of finding a best code for a given dataset is closely related to the problem of graph partitioning and can be shown to be NP hard. By relaxing the original problem, we obtain a spectral method whose solutions are simply a subset of thresh- olded eigenvectors of the graph Laplacian. By utilizing recent results on convergence of graph Laplacian eigenvectors to the Laplace-Beltrami eigen- functions of manifolds, we show how to efficiently calculate the code of a novel datapoint. Taken together, both learning the code and applying it to a novel point are extremely simple. Our experiments show that our codes significantly outperform the state-of-the art.

NeurIPS Conference 2007 Conference Paper

Object Recognition by Scene Alignment

  • Bryan Russell
  • Antonio Torralba
  • Ce Liu
  • Rob Fergus
  • William Freeman

Current object recognition systems can only recognize a limited number of object categories; scaling up to many categories is the next challenge. We seek to build a system to recognize and localize many different object categories in complex scenes. We achieve this through a simple approach: by matching the input im- age, in an appropriate representation, to images in a large training set of labeled images. Due to regularities in object identities across similar scenes, the retrieved matches provide hypotheses for object identities and locations. We build a prob- abilistic model to transfer the labels from the retrieval set to the input image. We demonstrate the effectiveness of this approach and study algorithm component contributions using held-out test sets from the LabelMe database.

NeurIPS Conference 2005 Conference Paper

Describing Visual Scenes using Transformed Dirichlet Processes

  • Antonio Torralba
  • Alan Willsky
  • Erik Sudderth
  • William Freeman

Motivated by the problem of learning to detect and recognize objects with minimal supervision, we develop a hierarchical probabilistic model for the spatial structure of visual scenes. In contrast with most existing models, our approach explicitly captures uncertainty in the number of object instances depicted in a given image. Our scene model is based on the transformed Dirichlet process (TDP), a novel extension of the hierarchical DP in which a set of stochastically transformed mixture components are shared between multiple groups of data. For visual scenes, mixture components describe the spatial structure of visual features in an objectcentered coordinate frame, while transformations model the object positions in a particular image. Learning and inference in the TDP, which has many potential applications beyond computer vision, is based on an empirically effective Gibbs sampler. Applied to a dataset of partially labeled street scenes, we show that the TDP's inclusion of spatial structure improves detection performance, flexibly exploiting partially labeled training images.

NeurIPS Conference 2004 Conference Paper

Contextual Models for Object Detection Using Boosted Random Fields

  • Antonio Torralba
  • Kevin Murphy
  • William Freeman

We seek to both detect and segment objects in images. To exploit both lo- cal image data as well as contextual information, we introduce Boosted Random Fields (BRFs), which uses Boosting to learn the graph struc- ture and local evidence of a conditional random field (CRF). The graph structure is learned by assembling graph fragments in an additive model. The connections between individual pixels are not very informative, but by using dense graphs, we can pool information from large regions of the image; dense models also support efficient inference. We show how contextual information from other objects can improve detection perfor- mance, both in terms of accuracy and speed, by using a computational cascade. We apply our system to detect stuff and things in office and street scenes. 1 Introduction Our long-term goal is to build a vision system that can examine an image and describe what objects are in it, and where. In many images, such as Fig. 5(a), objects of interest, such as the keyboard or mouse, are so small that they are impossible to detect just by using local features. Seeing a blob next to a keyboard, humans can infer it is likely to be a mouse; we want to give a computer the same abilities. There are several pieces of related work. Murphy et al [9] used global scene context to help object recognition, but did not model relationships between objects. Fink and Perona [4] exploited local dependencies in a boosting framework, but did not allow for multiple rounds of communication between correlated objects. He et al [6] do not model connections between objects directly, but rather they induce such correlations indirectly, via a bank of hidden variables, using a "restricted Boltzmann machine" architecture. In this paper, we exploit contextual correlations between the object classes by introducing Boosted Random Fields (BRFs). Boosted random fields build on both boosting [5, 10] and conditional random fields (CRFs) [8, 7, 6]. Boosting is a simple way of sequentially constructing "strong" classifiers from "weak" components, and has been used for single- class object detection with great success [12]. Dietterich et al [3] combine boosting and 1D CRFs, but they only consider the problem of learning the local evidence potentials; we consider the much harder problem of learning the structure of a 2D CRF. Standard applications of MRFs/ CRFs to images [7] assume a 4-nearest neighbor grid structure. While successful in low-level vision, this structure will fail in capturing im- portant long distance dependencies between whole regions and across classes. We propose a method for learning densely connected random fields with long range connections. The topology of these connections is chosen by a weak learner which has access to a library of graph fragments, derived from patches of labeled training images, which reflect typical spatial arrangments of objects (similar to the segmentation fragments in [2]). At each round of the learning algorithm, we add more connections from other locations in the image and from other classes (detectors). The connections are assumed to be spatially invariant, which means this update can be performed using convolution followed by a sigmoid nonlinearity. The resulting architecture is similar to a convolutional neural network, although we used a stagewise training procedure, which is much faster than back propagation. In addition to recognizing things, such as cars and people, we are also interested in recog- nizing spatially extended "stuff" [1], such as roads and buildings. The traditional sliding window approach to object detection does not work well for detecting "stuff". Instead, we combine object detection and image segmentation (c. f. , [2]) by labeling every pixel in the image. We do not rely on a bottom-up image segmentation algorithm, which can be fragile without top-down guidance. 2 Learning potentials and graph structure A conditional random field (CRF) is a distribution of the form 1 P (S|x) = Z i(Si) i, j (Si, Sj ) i jNi where x is the input (e. g. , image), Ni are the neighbors of node i, and Si are labels. We have assumed pairwise potentials for notational simplicity. Our goal is to learn the local evidence potentials, i, the compatibility potentials, and the set of neighbors Ni. We propose the following simple approximation: use belief propagation (BP) to estimate the marginals, P (Si|x), and then use boosting to maximize the likelihood of each node's training data with respect to i and. In more detail, the algorithm is as follows. At iteration t, the goal is to minimize the negative log-likelihood of the training data. As in [11], we consider the per-label loss (i. e. , we use marginal probabilities), as opposed to requiring that the joint labeling be correct (as in Viterbi decoding). Hence the cost function to be minimized is Jt = Jti = - bti, m(Si, m) = - bti, m(+1)Si, mbti, m(-1)1-Si, m (1) i m i m i where Si, m {-1, +1} is the true label for pixel i in training case m, Si, m = (Si, m + 1)/2 {0, 1} is just a relabeling, and bti, m = [P (Si = -1|xm, t), P (Si = 1|xm, t)] is the belief state at node i given input image xm after t iterations of the algorithm. The belief at node i is given by the following (dropping the dependence on case m) bti(1) ti(1) Mti(1) where Mti is the product of all the messages coming into i from all its neighbors at time t and where the message that k sends to i is given by bt (s M t+1(1) = t+1 (1) t+1 (1) = k k) i (2) ki ki k, i(sk, 1) t (sk) kN ik i sk{-1, +1} where k, i is the compatility between nodes k and i. If we assume that the local potentials have the form t /2 /2 i(si) = [eF t i; e-F ti ], where F ti is some function of the input data, then: bti(+1) = (F ti + Gti), Gti = log Mti(+1) - log Mti(-1) (3) where (u) = 1/(1 + e-u) is the sigmoid function. Hence each term in Eq. 1 simplifies to a cost function similar to that used in boosting: log Jt +Gt ) i, m i = log 1 + e-Si, m(F ti, m. (4) m 1. Input: a set of labeled pairs {xi, m; Si, m}, bound T Output: Local evidence functions f ti(x) and message update functions gti(bN ). i 2. Initialize: bt=0 i, m = 0; F t=0 i, m = 0; Gt=0 i, m = 0 3. For t=1. .T. (a) Fit local potential fi(xi, m) by weighted LS to Y t +Gt ) i, m i, m = Si, m(1 + e-Si, m(F t i ) (b). Fit compatibilities gti(bt-1 ) to Y t N i, m by weighted LS. i, m (c) Compute local potential F t i, m = F t-1 + f t i, m i (xi, m) (d) Compute compatibilities Gti, m = t gn ) n=1 i (bt-1 Ni, m (e) Update the beliefs bti, m = (F ti, m + Gti, m) (f) Update weights wt+1 = bt i, m i, m(-1) bt i, m(+1) Figure 1: BRF training algorithm. We assume that the graph is very densely connected so that the information that one single node sends to another is so small that we can make the approximation t+1 (+1)/ t+1 (-1) 1. (This is a reasonable approximation in the case of images, ki ki where each node represents a single pixel; only when the influence of many pixels is taken into account will the messages become informative. ) Hence bt (s k, m k ) M t+1(+1) s k, i(sk, +1) t (s Gt+1 = log i = log k [-1, +1] i k ) k i (5) M t+1(-1) bt (sk) i k, m k s k, i(sk, -1) k [-1, +1] t (s i k ) k k, i(sk, +1) bt (s k, m k) log sk[-1, +1] (6) k, i(sk, -1) bt (sk) k sk[-1, +1] k, m With this simplification, Gt+1 (bt i is now a non-linear function of the beliefs Gt+1 i m) at iteration t. Therefore, We can write the beliefs at iteration t as a function of the local evidences and the beliefs at time t - 1: bti(+1) = (F ti(xi, m) + Gti(bt-1 m )). The key idea behind BRFs is to use boosting to learn the G functions, which approximately implement message passing in densely connected graphs. We explain this in more detail below. 2. 1 Learning local evidence potentials Defining F ti(xi, m) = F t-1(x i i, m) + f t i (xi, m) as an additive model, where xi, m are the features of training sample m at node i, we can learn this function in a stagewise fashion by optimizing the second order Taylor expansion of Eq. 4 wrt f ti, as in logitBoost [5]: arg min log Jti arg min wti, m(Y ti, m - fti(xi, m))2 (7) f t f t i i m where Y t +Gt ) i, m i, m = Si, m(1+e-Si, m(F t i ). In the case that the weak learner is a "regression stump", fi(x) = ah(x)+b, we can find the optimal a, b by solving a weighted least squares problem, with weights wti, m = bti(-1) bti(+1); we can find the best basis function h(x) by searching over all elements of a dictionary. 2. 2 Learning compatibility potentials and graph structure In this section, we discuss how to learn the compatibility functions ij, and hence the structure of the graph. Instead of learning the compatibility functions ij, we propose to 1. Input: a set of inputs {xi, m} and functions f ti, gti Output: Set of beliefs bi, m and MAP estimates Si, m. 2. Initialize: bt=0 i, m = 0; F t=0 i, m = 0; Gt=0 i, m = 0 3. From t = 1 to T, repeat (a) Update local evidences F t i, m = F t-1 + f t i, m i (xi, m) (b) Update compatibilities Gti, m = t gn ) n=1 i (bt-1 Ni, m (c) Compute current beliefs bti, m = (F ti, m + Gti, m) 4. Output classification is Si, m = bti, m > 0. 5 Figure 2: BRF run-time inference algorithm. learn directly the function Gt+1 i. We propose to use an additive model for Gt+1 i as we did for learning F: Gt+1 = t gn i, m n=1 i (btm), where btm is a vector with the beliefs of all nodes in the graph at iteration t for the training sample m. The weak learners gn i (btm) can be regression stumps with the form gn i (btm) = a(w btm > ) + b, where a, b, are the parameters of the regression stump, and wi is a set of weights selected from a dictionary. In the case of a graph with weak and almost symmetrical connections (which holds if (s1, s2) 1, for all (s1, s2), which implies the messages are not very informative) we can further simplify the function Gt+1 i by approximating it as a linear function of the beliefs: Gt+1 = i, m k, i btk, m(+1) + k, i (8) kNi This step reduces the computational cost. The weak learners gn i (btm) will also be linear functions. Hence the belief update simplifies to bt+1(+1) = ( i, m i btm + i + F t i, m), which is similar to the mean-field update equations. The neighborhood Ni over which we sum incoming messages is determined by the graph structure, which is encoded in the non-zero values of i. Each weak learner gn i will compute a weighted combination of the beliefs of the some subset of the nodes; this subset may change from iteration to iteration, and can be quite large. At iteration t, we choose the weak learner gti so as to minimize t-1 log Jt +gt(bt-1)+ gn(bt-1)) i m i m i (bt-1) = - log 1 + e-Si, m(F ti, m n=1 m which reduces to a weighted least squares problem similar to Eq. 7. See Fig. 1 for the pseudo-code for the complete learning algorithm, and Fig. 2 for the pseudo-code for run- time inference. 3 BRFs for multiclass object detection and segmentation With the BRF training algorithm in hand, we describe our approach for multiclass object detection and region-labeling using densely connected BRFs. 3. 1 Weak learners for detecting stuff and things The square sliding window approach does not provide a natural way of working with irreg- ular objects. Using region labeling as an image representation allows dealing with irregular and extended objects (buildings, bookshelf, road, .. .). Extended stuff [1] may be a very important source of contextual information for other objects. (a) Examples from the dictionary of about 2000 patches and masks, Ux, y, Vx, y. (b) Examples from the dictionary of 30 graphs, Wx, y, c. f t=0 f t=1 f t=2 F S + +. .. = put thu utO Tr (c) Example feedforward segmentation for screens. Figure 3: Examples of patches from the dictionary and an example of the segmentation obtained using boosting trained with patches from (a). The weak learners we use for the local evidence potentials are based on the segmentation fragments proposed in [2]. Specifically, we create a dictionary of about 2000 image patches U, chosen at random (but overlapping each object), plus a corresponding set of binary (in- class/ out-of-class) image masks, V: see Fig. 3(a). At each round t, for each class c, and for each dictionary entry, we construct the following weak learner, whose output is a binary matrix of the same size as the image I: v(I) = ((I U ) > ) V > 0 (9) where represents normalized cross-correlation and represents convolution. The in- tuition behind this is that I U will produce peaks at image locations that contain this patch/template, and then convolving with V will superimpose the segmentation mask on top of the peaks. As a function of the threshold, the feature will behave more as a template detector ( 1) or as a texture descriptor ( car car building car road car Road F b=(F+G) Car car building building building road building Building x G car road building road road road y c) A car out of context a) Incoming messages (outside 3rd floor windows) to a car node. b) Compatibilities (W'). is less of a car. t=1 t=2 t=4 t=20 t=40 Final labeling b(car) S(all) d) Evolution of the beliefs for the car nodes (b) and labeling (S) for road, building, car. Figure 4: Street scene. The BRF is trained to detect cars, buildings and the road. In Fig. 4(a-b), we show the structures of the graph and the weights W defined by GT for a BRF trained to detect cars, buildings and roads in street scenes. 3. 2 Learning and inference For training we used a labeled dataset of office and street scenes with about 100 images in each set. During the training, in the first 5 rounds we only update the local potentials, to allow local evidence to accrue. After the 5th iteration we start updating also the compatibil- ity functions. At each round, we update only the local potential and compatibility function associated with a single object class that reduces the most the multiclass cost. This allows objects that need many features to have more complicated local potentials. The algorithm learns to first detect easy (and large) objects, since these reduce the error of all classes the fastest. The easy-to-detect objects can then pass information to the harder ones. For instance, in office scenes, the system first detects screens, then keyboards, and finally computer mice. Fig. 5 illustrates this behavior on the test set. A similar behavior is obtained for the car detector (Fig. 4(d)). The detection of building and road provides strong constraints for the locations of the car. 3. 3 Cascade of classifiers with BRFs The BRF can be turned into a cascade [12] by thresholding the beliefs. Computations can then be reduced by doing the convolutions (required for computing f and g) only in pixels that are still candidates for the presence of the target. At each round we update a binary rejection mask for each object class, Rtx, y, c, by thresholding the beliefs at round t: Rtx, y, c = Rt-1 x, y, c (btx, y, c > tc). A pixel in the rejection mask is set to zero when we can decide that the object is not present (when btx, y, c is below the threshold tc 0), and it is set to 1 when more processing is required. The threshold tc is chosen so that the percentage of missed detections is below a predefined level (we use 1%). Similarity we can define a detection mask that will indicate pixels in which we decide the object is present. The mask is then used for computing the features v(I) and messages G by applying the convolutions only on the pixels not yet classified. We can denote those operators as R and R. This Input image screen mouse Ground truth Output labeling keyboard t=5 t=10 t=15 t=25 t=50 b (screen) b (screen) b (screen) b (screen) b (screen) F G b (keyboard) b (keyboard) b (keyboard) b (keyboard) b (keyboard) F G b (mouse) b (mouse) b (mouse) b (mouse) b (mouse) F G 1 ROC Screen Boosting BRF Mouse a under Keyboard re Iteration (t) A 0. 5 t=0 t=20 t=50 Figure 5: Top. In this desk scene, it is easy to identify objects like the screen, keyboard and mouse, even though the local information is sometimes insufficient. Middle: the evolution of the beliefs (b and F and G) during detection for a test image. Bottom. The graph bellow shows the average evolution of the area under the ROC for the three objects on 120 test images. results in a more efficient classifier with only a slight decrease of performance. In Fig. 6 we compare the reduction of the search space when implementing a cascade using independent boosting (which reduces to Viola and Jones [12]), and when using BRF's. We see that for objects for which context is the main source of information, like the mouse, the reduction in search space is much more dramatic using BRFs than using boosting alone. 4 Conclusion The proposed BRF algorithm combines boosting and CRF's, providing an algorithm that is easy for both training and inference. We have demonstrated object detection in cluttered scenes by exploiting contextual relationships between objects. The BRF algorithm is com- putationally efficient and provides a natural extension of the cascade of classifiers by inte- grating evidence from other objects in order to quickly reject certain image regions. The BRF's densely connected graphs, which efficiently collect information over large image regions, provide an alternative framework to nearest-neighbor grids for vision problems.

NeurIPS Conference 2003 Conference Paper

Using the Forest to See the Trees: A Graphical Model Relating Features, Objects, and Scenes

  • Kevin Murphy
  • Antonio Torralba
  • William Freeman

Standard approaches to object detection focus on local patches of the image, and try to classify them as background or not. We propose to use the scene context (image as a whole) as an extra source of (global) information, to help resolve local ambiguities. We present a conditional random field for jointly solving the tasks of object detection and scene classification.

NeurIPS Conference 2002 Conference Paper

Shape Recipes: Scene Representations that Refer to the Image

  • William Freeman
  • Antonio Torralba

The goal of low-level vision is to estimate an underlying scene, given an observed image. Real-world scenes (eg, albedos or shapes) can be very complex, conventionally requiring high dimensional representations which are hard to estimate and store. We propose a low-dimensional rep- resentation, called a scene recipe, that relies on the image itself to de- scribe the complex scene configurations. Shape recipes are an example: these are the regression coefficients that predict the bandpassed shape from image data. We describe the benefits of this representation, and show two uses illustrating their properties: (1) we improve stereo shape estimates by learning shape recipes at low resolution and applying them at full resolution; (2) Shape recipes implicitly contain information about lighting and materials and we use them for material segmentation.

NeurIPS Conference 2001 Conference Paper

Contextual Modulation of Target Saliency

  • Antonio Torralba

The most popular algorithms for object detection require the use of exhaustive spatial and scale search procedures. In such approaches, an object is defined by means of local features. fu this paper we show that including contextual information in object detection pro(cid: 173) cedures provides an efficient way of cutting down the need for exhaustive search. We present results with real images showing that the proposed scheme is able to accurately predict likely object classes, locations and sizes.