Author name cluster

Li Fei-Fei

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

22 papers

1 author row

NeurIPS Conference 2024 Conference Paper

Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making

Manling Li
Shiyu Zhao
Qineng Wang
Kangrui Wang
Yu Zhou
Sanjana Srivastava
Cem Gokmen
Tony Lee

We aim to evaluate Large Language Models (LLMs) for embodied decision making. While a significant body of work has been leveraging LLMs for decision making in embodied environments, we still lack a systematic understanding of their performance because they are usually applied in different domains, for different purposes, and built based on different inputs and outputs. Furthermore, existing evaluations tend to rely solely on a final success rate, making it difficult to pinpoint what ability is missing in LLMs and where the problem lies, which in turn blocks embodied agents from leveraging LLMs effectively and selectively. To address these limitations, we propose a generalized interface (Embodied Agent Interface) that supports the formalization of various types of tasks and input-output specifications of LLM-based modules. Specifically, it allows us to unify 1) a broad set of embodied decision-making tasks involving both state and temporally extended goals, 2) four commonly-used LLM-based modules for decision making: goal interpretation, subgoal decomposition, action sequencing, and transition modeling, and 3) a collection of fine-grained metrics that break down evaluation into error types, such as hallucination errors, affordance errors, and various types of planning errors. Overall, our benchmark offers a comprehensive assessment of LLMs’ performance for different subtasks, pinpointing the strengths and weaknesses in LLM-powered embodied AI systems and providing insights into the effective and selective use of LLMs in embodied decision making.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

HourVideo: 1-Hour Video-Language Understanding

Keshigeyan Chandrasegaran
Agrim Gupta
Lea M. Hadzic
Taran Kota
Jimming He
Cristobal Eyzaguirre
Zane Durante
Manling Li

We present HourVideo, a benchmark dataset for hour-long video-language understanding. Our dataset consists of a novel task suite comprising summarization, perception ( recall, tracking ), visual reasoning ( spatial, temporal, predictive, causal, counterfactual ), and navigation ( room-to-room, object retrieval ) tasks. HourVideo includes 500 manually curated egocentric videos from the Ego4D dataset, spanning durations of 20 to 120 minutes, and features 12, 976 high-quality, five-way multiple-choice questions. Benchmarking results reveal that multimodal models, including GPT-4 and LLaVA-NeXT, achieve marginal improvements over random chance. In stark contrast, human experts significantly outperform the state-of-the-art long-context multimodal model, Gemini Pro 1. 5 (85. 0\% vs. 37. 3\%), highlighting a substantial gap in multimodal capabilities. Our benchmark, evaluation toolkit, prompts, and documentation are available at https: //hourvideo. stanford. edu.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

OccFusion: Rendering Occluded Humans with Generative Diffusion Priors

Adam Sun
Tiange Xiang
Scott Delp
Li Fei-Fei
Ehsan Adeli

Existing human rendering methods require every part of the human to be fully visible throughout the input video. However, this assumption does not hold in real-life settings where obstructions are common, resulting in only partial visibility of the human. Considering this, we present OccFusion, an approach that utilizes efficient 3D Gaussian splatting supervised by pretrained 2D diffusion models for efficient and high-fidelity human rendering. We propose a pipeline consisting of three stages. In the Initialization stage, complete human masks are generated from partial visibility masks. In the Optimization stage, 3D human Gaussians are optimized with additional supervisions by Score-Distillation Sampling (SDS) to create a complete geometry of the human. Finally, in the Refinement stage, in-context inpainting is designed to further improve rendering quality on the less observed human body parts. We evaluate OccFusion on ZJU-MoCap and challenging OcMotion sequences and found that it achieves state-of-the-art performance in the rendering of occluded humans.

PDF Details DOI

AAMAS Conference 2023 Conference Paper

Modeling Dynamic Environments with Scene Graph Memory

Andrey Kurenkov
Michael Lingelbach
Tanmay Agarwal
Chengshu Li
Emily Jin
Ruohan Zhang
Li Fei-Fei
Jiajun Wu

Embodied AI agents operating in dynamic environments often need to predict object locations to make informed decisions. We propose a method for doing this via link prediction on partially observable dynamic graphs. We represent the agent’s accumulated set of observations in a data structure called a Scene Graph Memory (SGM), combine this data structure with a neural net architecture we call Node Edge Predictor (NEP), and show that it can be trained to predict the locations of objects in a variety of environments with diverse object movement dynamics. To evaluate our method, we implement the Dynamic Household Simulator, a novel benchmark which enables sampling of diverse dynamic scene graphs that follow the semantic patterns typically seen at peoples’ homes. We demonstrate that our method outperforms baselines both in terms of quickly adapting to the dynamics of a new scene and in terms of its overall accuracy.

PDF

IJCAI Conference 2020 Conference Paper

DualSMC: Tunneling Differentiable Filtering and Planning under Continuous POMDPs

Yunbo Wang
Bo Liu
Jiajun Wu
Yuke Zhu
Simon S. Du
Li Fei-Fei
Joshua B. Tenenbaum

A major difficulty of solving continuous POMDPs is to infer the multi-modal distribution of the unobserved true states and to make the planning algorithm dependent on the perceived uncertainty. We cast POMDP filtering and planning problems as two closely related Sequential Monte Carlo (SMC) processes, one over the real states and the other over the future optimal trajectories, and combine the merits of these two parts in a new model named the DualSMC network. In particular, we first introduce an adversarial particle filter that leverages the adversarial relationship between its internal components. Based on the filtering results, we then propose a planning algorithm that extends the previous SMC planning approach [Piche et al. , 2018] to continuous POMDPs with an uncertainty-dependent policy. Crucially, not only can DualSMC handle complex observations such as image input but also it remains highly interpretable. It is shown to be effective in three continuous POMDP domains: the floor positioning domain, the 3D light-dark navigation domain, and a modified Reacher domain.

PDF Details DOI

NeurIPS Conference 2019 Conference Paper

HYPE: A Benchmark for Human eYe Perceptual Evaluation of Generative Models

Sharon Zhou
Mitchell Gordon
Ranjay Krishna
Austin Narcomey
Li Fei-Fei
Michael Bernstein

Generative models often use human evaluations to measure the perceived quality of their outputs. Automated metrics are noisy indirect proxies, because they rely on heuristics or pretrained embeddings. However, up until now, direct human evaluation strategies have been ad-hoc, neither standardized nor validated. Our work establishes a gold standard human benchmark for generative realism. We construct Human eYe Perceptual Evaluation (HYPE) a human benchmark that is (1) grounded in psychophysics research in perception, (2) reliable across different sets of randomly sampled outputs from a model, (3) able to produce separable model performances, and (4) efficient in cost and time. We introduce two variants: one that measures visual perception under adaptive time constraints to determine the threshold at which a model's outputs appear real (e. g. $250$ms), and the other a less expensive variant that measures human error rate on fake and real images sans time constraints. We test HYPE across six state-of-the-art generative adversarial networks and two sampling techniques on conditional and unconditional image generation using four datasets: CelebA, FFHQ, CIFAR-10, and ImageNet. We find that HYPE can track model improvements across training epochs, and we confirm via bootstrap sampling that HYPE rankings are consistent and replicable.

PDF Details

NeurIPS Conference 2019 Conference Paper

Regression Planning Networks

Danfei Xu
Roberto Martín-Martín
De-An Huang
Yuke Zhu
Silvio Savarese
Li Fei-Fei

Recent learning-to-plan methods have shown promising results on planning directly from observation space. Yet, their ability to plan for long-horizon tasks is limited by the accuracy of the prediction model. On the other hand, classical symbolic planners show remarkable capabilities in solving long-horizon tasks, but they require predefined symbolic rules and symbolic states, restricting their real-world applicability. In this work, we combine the benefits of these two paradigms and propose a learning-to-plan method that can directly generate a long-term symbolic plan conditioned on high-dimensional observations. We borrow the idea of regression (backward) planning from classical planning literature and introduce Regression Planning Networks (RPN), a neural network architecture that plans backward starting at a task goal and generates a sequence of intermediate goals that reaches the current observation. We show that our model not only inherits many favorable traits from symbolic planning --including the ability to solve previously unseen tasks-- but also can learn from visual inputs in an end-to-end manner. We evaluate the capabilities of RPN in a grid world environment and a simulated 3D kitchen environment featuring complex visual scenes and long task horizon, and show that it achieves near-optimal performance in completely new task instances.

PDF Details

NeurIPS Conference 2018 Conference Paper

Flexible neural representation for physics prediction

Damian Mrowca
Chengxu Zhuang
Elias Wang
Nick Haber
Li Fei-Fei
Josh Tenenbaum
Daniel Yamins

Humans have a remarkable capacity to understand the physical dynamics of objects in their environment, flexibly capturing complex structures and interactions at multiple levels of detail. Inspired by this ability, we propose a hierarchical particle-based object representation that covers a wide variety of types of three-dimensional objects, including both arbitrary rigid geometrical shapes and deformable materials. We then describe the Hierarchical Relation Network (HRN), an end-to-end differentiable neural network based on hierarchical graph convolution, that learns to predict physical dynamics in this representation. Compared to other neural network baselines, the HRN accurately handles complex collisions and nonrigid deformations, generating plausible dynamics predictions at long time scales in novel settings, and scaling to large scene configurations. These results demonstrate an architecture with the potential to form the basis of next-generation physics predictors for use in computer vision, robotics, and quantitative cognitive science.

PDF Details

NeurIPS Conference 2018 Conference Paper

Learning to Decompose and Disentangle Representations for Video Prediction

Jun-Ting Hsieh
Bingbin Liu
De-An Huang
Li Fei-Fei
Juan Carlos Niebles

Our goal is to predict future video frames given a sequence of input frames. Despite large amounts of video data, this remains a challenging task because of the high-dimensionality of video frames. We address this challenge by proposing the Decompositional Disentangled Predictive Auto-Encoder (DDPAE), a framework that combines structured probabilistic models and deep networks to automatically (i) decompose the high-dimensional video that we aim to predict into components, and (ii) disentangle each component to have low-dimensional temporal dynamics that are easier to predict. Crucially, with an appropriately specified generative model of video frames, our DDPAE is able to learn both the latent decomposition and disentanglement without explicit supervision. For the Moving MNIST dataset, we show that DDPAE is able to recover the underlying components (individual digits) and disentanglement (appearance and location) as we would intuitively do. We further demonstrate that DDPAE can be applied to the Bouncing Balls dataset involving complex interactions between multiple objects to predict the video frame directly from the pixels and recover physical states without explicit supervision.

PDF Details

NeurIPS Conference 2018 Conference Paper

Learning to Play With Intrinsically-Motivated, Self-Aware Agents

Nick Haber
Damian Mrowca
Stephanie Wang
Li Fei-Fei
Daniel Yamins

Infants are experts at playing, with an amazing ability to generate novel structured behaviors in unstructured environments that lack clear extrinsic reward signals. We seek to mathematically formalize these abilities using a neural network that implements curiosity-driven intrinsic motivation. Using a simple but ecologically naturalistic simulated environment in which an agent can move and interact with objects it sees, we propose a "world-model" network that learns to predict the dynamic consequences of the agent's actions. Simultaneously, we train a separate explicit "self-model" that allows the agent to track the error map of its world-model. It then uses the self-model to adversarially challenge the developing world-model. We demonstrate that this policy causes the agent to explore novel and informative interactions with its environment, leading to the generation of a spectrum of complex behaviors, including ego-motion prediction, object attention, and object gathering. Moreover, the world-model that the agent learns supports improved performance on object dynamics prediction, detection, localization and recognition tasks. Taken together, our results are initial steps toward creating flexible autonomous agents that self-supervise in realistic physical environments.

PDF Details

YNIMG Journal 2017 Journal Article

Evidence for similar patterns of neural activity elicited by picture- and word-based representations of natural scenes

Manoj Kumar
Kara D. Federmeier
Li Fei-Fei
Diane M. Beck

Details DOI

AAAI Conference 2017 Conference Paper

Fine-Grained Car Detection for Visual Census Estimation

Timnit Gebru
Jonathan Krause
Yilun Wang
Duyun Chen
Jia Deng
Li Fei-Fei

Targeted socio-economic policies require an accurate understanding of a country’s demographic makeup. To that end, the United States spends more than 1 billion dollars a year gathering census data such as race, gender, education, occupation and unemployment rates. Compared to the traditional method of collecting surveys across many years which is costly and labor intensive, data-driven, machine learningdriven approaches are cheaper and faster—with the potential ability to detect trends in close to real time. In this work, we leverage the ubiquity of Google Street View images and develop a computer vision pipeline to predict income, per capita carbon emission, crime rates and other city attributes from a single source of publicly available visual data. We ﬁrst detect cars in 50 million images across 200 of the largest US cities and train a model to predict demographic attributes using the detected cars. To facilitate our work, we have collected the largest and most challenging ﬁne-grained dataset reported to date consisting of over 2600 classes of cars comprised of images from Google Street View and other web sources, classi- ﬁed by car experts to account for even the most subtle of visual differences. We use this data to construct the largest scale ﬁne-grained detection system reported to date. Our prediction results correlate well with ground truth income data (r=0. 82), Massachusetts department of vehicle registration, and sources investigating crime rates, income segregation, per capita carbon emission, and other market research. Finally, we learn interesting relationships between cars and neighbourhoods allowing us to perform the ﬁrst large scale sociological analysis of cities using computer vision techniques.

PDF Details

NeurIPS Conference 2017 Conference Paper

Label Efficient Learning of Transferable Representations acrosss Domains and Tasks

Zelun Luo
Yuliang Zou
Judy Hoffman
Li Fei-Fei

We propose a framework that learns a representation transferable across different domains and tasks in a data efficient manner. Our approach battles domain shift with a domain adversarial loss, and generalizes the embedding to novel task using a metric learning-based approach. Our model is simultaneously optimized on labeled source data and unlabeled or sparsely labeled data in the target domain. Our method shows compelling results on novel classes within a new domain even when only a few labeled examples per class are available, outperforming the prevalent fine-tuning approach. In addition, we demonstrate the effectiveness of our framework on the transfer learning task from image object recognition to video action recognition.

PDF Details

YNIMG Journal 2016 Journal Article

Typicality sharpens category representations in object-selective cortex

Marius Cătălin Iordan
Michelle R. Greene
Diane M. Beck
Li Fei-Fei

Details DOI

NeurIPS Conference 2014 Conference Paper

Deep Fragment Embeddings for Bidirectional Image Sentence Mapping

Andrej Karpathy
Armand Joulin
Li Fei-Fei

We introduce a model for bidirectional retrieval of images and sentences through a deep, multi-modal embedding of visual and natural language data. Unlike previous models that directly map images or sentences into a common embedding space, our model works on a finer level and embeds fragments of images (objects) and fragments of sentences (typed dependency tree relations) into a common space. We then introduce a structured max-margin objective that allows our model to explicitly associate these fragments across modalities. Extensive experimental evaluation shows that reasoning on both the global level of images and sentences and the finer level of their respective fragments improves performance on image-sentence retrieval tasks. Additionally, our model provides interpretable predictions for the image-sentence retrieval task since the inferred inter-modal alignment of fragments is explicit.

PDF Details

YNIMG Journal 2013 Journal Article

Differential connectivity within the Parahippocampal Place Area

Christopher Baldassano
Diane M. Beck
Li Fei-Fei

Details DOI

NeurIPS Conference 2012 Conference Paper

Shifting Weights: Adapting Object Detectors from Image to Video

Kevin Tang
Vignesh Ramanathan
Li Fei-Fei
Daphne Koller

Typical object detectors trained on images perform poorly on video, as there is a clear distinction in domain between the two types of data. In this paper, we tackle the problem of adapting object detectors learned from images to work well on videos. We treat the problem as one of unsupervised domain adaptation, in which we are given labeled data from the source domain (image), but only unlabeled data from the target domain (video). Our approach, self-paced domain adaptation, seeks to iteratively adapt the detector by re-training the detector with automatically discovered target domain examples, starting with the easiest first. At each iteration, the algorithm adapts by considering an increased number of target domain examples, and a decreased number of source domain examples. To discover target domain examples from the vast amount of video data, we introduce a simple, robust approach that scores trajectory tracks instead of bounding boxes. We also show how rich and expressive features specific to the target domain can be incorporated under the same framework. We show promising results on the 2011 TRECVID Multimedia Event Detection and LabelMe Video datasets that illustrate the benefit of our approach to adapt object detectors to video.

PDF Details

YNIMG Journal 2012 Journal Article

Voxel-level functional connectivity using spatial regularization

Christopher Baldassano
Marius Cătălin Iordan
Diane M. Beck
Li Fei-Fei

Details DOI

NeurIPS Conference 2010 Conference Paper

Large Margin Learning of Upstream Scene Understanding Models

Jun Zhu
Li-Jia Li
Li Fei-Fei
Eric Xing

Upstream supervised topic models have been widely used for complicated scene understanding. However, existing maximum likelihood estimation (MLE) schemes can make the prediction model learning independent of latent topic discovery and result in an imbalanced prediction rule for scene classification. This paper presents a joint max-margin and max-likelihood learning method for upstream scene understanding models, in which latent topic discovery and prediction model estimation are closely coupled and well-balanced. The optimization problem is efficiently solved with a variational EM procedure, which iteratively solves an online loss-augmented SVM. We demonstrate the advantages of the large-margin approach on both an 8-category sports dataset and the 67-class MIT indoor scene dataset for scene categorization.

PDF Details

NeurIPS Conference 2010 Conference Paper

Object Bank: A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification

Li-Jia Li
Hao Su
Li Fei-Fei
Eric Xing

Robust low-level image features have been proven to be effective representations for a variety of visual recognition tasks such as object recognition and scene classification; but pixels, or even local image patches, carry little semantic meanings. For high level visual tasks, such low-level image representations are potentially not enough. In this paper, we propose a high-level image representation, called the Object Bank, where an image is represented as a scale invariant response map of a large number of pre-trained generic object detectors, blind to the testing dataset or visual task. Leveraging on the Object Bank representation, superior performances on high level visual recognition tasks can be achieved with simple off-the-shelf classifiers such as logistic regression and linear SVM. Sparsity algorithms make our representation more efficient and scalable for large scene datasets, and reveal semantically meaningful feature patterns.

PDF Details

NeurIPS Conference 2009 Conference Paper

Exploring Functional Connectivities of the Human Brain using Multivariate Information Analysis

Barry Chai
Dirk Walther
Diane Beck
Li Fei-Fei

In this study, we present a method for estimating the mutual information for a localized pattern of fMRI data. We show that taking a multivariate information approach to voxel selection leads to a decoding accuracy that surpasses an univariate inforamtion approach and other standard voxel selection methods. Furthermore, we extend the multivariate mutual information theory to measure the functional connectivity between distributed brain regions. By jointly estimating the information shared by two sets of voxels we can reliably map out the connectivities in the human brain during experiment conditions. We validated our approach on a 6-way scene categorization fMRI experiment. The multivariate information analysis is able to ﬁnd strong information ﬂow between PPA and RSC, which conﬁrms existing neuroscience studies on scenes. Furthermore, by exploring over the whole brain, our method identifies other interesting ROIs that share information with the PPA, RSC scene network, suggesting interesting future work for neuroscientists.

PDF Details

NeurIPS Conference 2009 Conference Paper

Hierarchical Mixture of Classification Experts Uncovers Interactions between Brain Regions

Bangpeng Yao
Dirk Walther
Diane Beck
Li Fei-Fei

The human brain can be described as containing a number of functional regions. For a given task, these regions, as well as the connections between them, play a key role in information processing in the brain. However, most existing multi-voxel pattern analysis approaches either treat multiple functional regions as one large uniform region or several independent regions, ignoring the connections between regions. In this paper, we propose to model such connections in an Hidden Conditional Random Field (HCRF) framework, where the classifier of one region of interest (ROI) makes predictions based on not only its voxels but also the classifier predictions from ROIs that it connects to. Furthermore, we propose a structural learning method in the HCRF framework to automatically uncover the connections between ROIs. Experiments on fMRI data acquired while human subjects viewing images of natural scenes show that our model can improve the top-level (the classifier combining information from all ROIs) and ROI-level prediction accuracy, as well as uncover some meaningful connections between ROIs.

PDF Details