Author name cluster

Stefan Lee

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

21 papers

2 author rows

NeurIPS Conference 2025 Conference Paper

Graph Neural Network Based Action Ranking for Planning

Rajesh Mangannavar
Stefan Lee
Alan Fern
Prasad Tadepalli

We propose a novel approach to learn relational policies for classical planning based on learning to rank actions. We introduce a new graph representation that explicitly captures action information and propose a Graph Neural Network (GNN) architecture augmented with Gated Recurrent Units (GRUs) to learn action rankings. Unlike value-function based approaches that must learn a globally consistent function, our action ranking method only needs to learn locally consistent ranking. Our model is trained on data generated from small problem instances that are easily solved by planners and is applied to significantly larger instances where planning is computationally prohibitive. Experimental results across standard planning benchmarks demonstrate that our action-ranking approach not only achieves better generalization to larger problems than those used in training but also outperforms multiple baselines (value function and action ranking) methods in terms of success rate and plan quality.

PDF Details

ICRA Conference 2025 Conference Paper

Learning to Prune Branches in Modern Tree-Fruit Orchards

Abhinav Jain
Cindy M. Grimm
Stefan Lee

Dormant tree pruning is laborintensive but essential to maintaining modern highly-productive fruit orchards. In this work we present a closed-loop visuomotor controller for robotic pruning. The controller guides the cutter through a cluttered tree environment to reach a specified cut point and ensures the cutters are perpendicular to the branch. We train the controller using a novel orchard simulation that captures the geometric distribution of branches in a target apple orchard configuration. Unlike traditional methods requiring full 3D reconstruction, our controller uses just optical flow images from a wrist-mounted camera. We deploy our learned policy in simulation and the real-world for an example V-Trellis envy tree with zero-shot transfer, achieving a ~30% success rate - approximately half the performance of an oracle planner.

Details

IROS Conference 2024 Conference Paper

Interruptive Language Control of Bipedal Locomotion

Ashish Malik
Stefan Lee
Alan Fern

We study the problem of natural language-based control of dynamic bipedal locomotion from the perspective of operational robustness and hardware safety. Existing work on natural language-based robot control has focused on episodic command execution for stable robot platforms, such as fixed-based manipulators in table-top scenarios. These scenarios feature non-overlapping phases of instruction and execution, with execution mishaps usually posing no threat to the robot safety. This allows for non-trivial failure rates to be acceptable. In contrast, our work involves indistinguishable instruction and execution stages for a dynamically unstable robot where execution failures can harm the robot. For example, interrupting a bipedal robot with a new instruction in certain states may cause it to fall. Our first contribution is to design and train a natural language-based controller for the bipedal robot Cassie that can take in new language commands at any time. Our second contribution is to introduce a protocol for evaluating the robustness to interruptions of such controllers and evaluating the learned controller in simulation under different interruption distributions. Our third contribution is to learn a detector for interruptions that are likely to lead to failure and to integrate that detector into a failure mitigation strategy. Overall, our results show that interruptions can lead to non-trivial failure rates for the original controller and that the proposed mitigation strategy can help to significantly reduce that rate.

Details

ICRA Conference 2024 Conference Paper

Point Cloud Models Improve Visual Robustness in Robotic Learners

Skand Peri
Iain Lee
Chanho Kim
Fuxin Li
Tucker Hermans
Stefan Lee

Visual control policies can encounter significant performance degradation when visual conditions like lighting or camera position differ from those seen during training – often exhibiting sharp declines in capability even for minor differences. In this work, we examine robustness to a suite of these types of visual changes for RGB-D and point cloud based visual control policies. To perform these experiments on both model-free and model-based reinforcement learners, we introduce a novel Point Cloud World Model (PCWM) and point cloud based control policies. Our experiments show that policies that explicitly encode point clouds are significantly more robust than their RGB-D counterparts. Further, we find our proposed PCWM significantly outperforms prior works in terms of sample efficiency during training. Taken together, these results suggest reasoning about the 3D scene through point clouds can improve performance, reduce learning time, and increase robustness for robotic learners. Code: https://github.com/pvskand/pcwm

Details

ICLR Conference 2023 Conference Paper

Emergence of Maps in the Memories of Blind Navigation Agents

Erik Wijmans
Manolis Savva
Irfan Essa
Stefan Lee
Ari S. Morcos
Dhruv Batra

Animal navigation research posits that organisms build and maintain internal spa- tial representations, or maps, of their environment. We ask if machines – specifically, artificial intelligence (AI) navigation agents – also build implicit (or ‘mental’) maps. A positive answer to this question would (a) explain the surprising phenomenon in recent literature of ostensibly map-free neural-networks achieving strong performance, and (b) strengthen the evidence of mapping as a fundamental mechanism for navigation by intelligent embodied agents, whether they be biological or artificial. Unlike animal navigation, we can judiciously design the agent’s perceptual system and control the learning paradigm to nullify alternative navigation mechanisms. Specifically, we train ‘blind’ agents – with sensing limited to only egomotion and no other sensing of any kind – to perform PointGoal navigation (‘go to $\Delta$x, $\Delta$y’) via reinforcement learning. Our agents are composed of navigation-agnostic components (fully-connected and recurrent neural networks), and our experimental setup provides no inductive bias towards mapping. Despite these harsh conditions, we find that blind agents are (1) surprisingly effective navigators in new environments (∼95% success); (2) they utilize memory over long horizons (remembering ∼1,000 steps of past experience in an episode); (3) this memory enables them to exhibit intelligent behavior (following walls, detecting collisions, taking shortcuts); (4) there is emergence of maps and collision detection neurons in the representations of the environment built by a blind agent as it navigates; and (5) the emergent maps are selective and task dependent (e.g. the agent ‘forgets’ exploratory detours). Overall, this paper presents no new techniques for the AI audience, but a surprising finding, an insight, and an explanation.

Details

ICLR Conference 2021 Conference Paper

DeepAveragers: Offline Reinforcement Learning By Solving Derived Non-Parametric MDPs

Aayam Shrestha
Stefan Lee
Prasad Tadepalli
Alan Fern

We study an approach to offline reinforcement learning (RL) based on optimally solving finitely-represented MDPs derived from a static dataset of experience. This approach can be applied on top of any learned representation and has the potential to easily support multiple solution objectives as well as zero-shot adjustment to changing environments and goals. Our main contribution is to introduce the Deep Averagers with Costs MDP (DAC-MDP) and to investigate its solutions for offline RL. DAC-MDPs are a non-parametric model that can leverage deep representations and account for limited data by introducing costs for exploiting under-represented parts of the model. In theory, we show conditions that allow for lower-bounding the performance of DAC-MDP solutions. We also investigate the empirical behavior in a number of environments, including those with image-based observations. Overall, the experiments demonstrate that the framework can work in practice and scale to large complex offline RL problems.

Details

AAAI Conference 2021 Conference Paper

Semantic MapNet: Building Allocentric Semantic Maps and Representations from Egocentric Views

Vincent Cartillier
Zhile Ren
Neha Jain
Stefan Lee
Irfan Essa
Dhruv Batra

We study the task of semantic mapping – specifically, an embodied agent (a robot or an egocentric AI assistant) is given a tour of a new environment and asked to build an allocentric top-down semantic map (‘what is where? ’) from egocentric observations of an RGB-D camera with known pose (via localization sensors). Towards this goal, we present Semantic MapNet (SMNet), which consists of: (1) an Egocentric Visual Encoder that encodes each egocentric RGB-D frame, (2) a Feature Projector that projects egocentric features to appropriate locations on a floor-plan, (3) a Spatial Memory Tensor of size floor-plan length × width × feature-dims that learns to accumulate projected egocentric features, and (4) a Map Decoder that uses the memory tensor to produce semantic topdown maps. SMNet combines the strengths of (known) projective camera geometry and neural representation learning. On the task of semantic mapping in the Matterport3D dataset, SMNet significantly outperforms competitive baselines by 4. 01 − 16. 81% (absolute) on mean-IoU and 3. 81 − 19. 69% (absolute) on Boundary-F1 metrics. Moreover, we show how to use the neural episodic memories and spatio-semantic allocentric representations built by SMNet for subsequent tasks in the same space – navigating to objects seen during the tour (‘Find chair’) or answering questions about the space (‘How many chairs did you see in the house? ’). Project page: https: //vincentcartillier. github. io/smnet. html.

PDF Details

NeurIPS Conference 2021 Conference Paper

SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation

Abhinav Moudgil
Arjun Majumdar
Harsh Agrawal
Stefan Lee
Dhruv Batra

Natural language instructions for visual navigation often use scene descriptions (e. g. , bedroom) and object references (e. g. , green chairs) to provide a breadcrumb trail to a goal location. This work presents a transformer-based vision-and-language navigation (VLN) agent that uses two different visual encoders -- a scene classification network and an object detector -- which produce features that match these two distinct types of visual cues. In our method, scene features contribute high-level contextual information that supports object-level processing. With this design, our model is able to use vision-and-language pretraining (i. e. , learning the alignment between images and text from large-scale web data) to substantially improve performance on the Room-to-Room (R2R) and Room-Across-Room (RxR) benchmarks. Specifically, our approach leads to improvements of 1. 8% absolute in SPL on R2R and 3. 7% absolute in SR on RxR. Our analysis reveals even larger gains for navigation instructions that contain six or more object references, which further suggests that our approach is better able to use object features and align them to references in the instructions.

PDF Details

ICLR Conference 2020 Conference Paper

DD-PPO: Learning Near-Perfect PointGoal Navigators from 2. 5 Billion Frames

Erik Wijmans
Abhishek Kadian
Ari S. Morcos
Stefan Lee
Irfan Essa
Devi Parikh
Manolis Savva
Dhruv Batra

We present Decentralized Distributed Proximal Policy Optimization (DD-PPO), a method for distributed reinforcement learning in resource-intensive simulated environments. DD-PPO is distributed (uses multiple machines), decentralized (lacks a centralized server), and synchronous (no computation is ever "stale"), making it conceptually simple and easy to implement. In our experiments on training virtual robots to navigate in Habitat-Sim, DD-PPO exhibits near-linear scaling -- achieving a speedup of 107x on 128 GPUs over a serial implementation. We leverage this scaling to train an agent for 2.5 Billion steps of experience (the equivalent of 80 years of human experience) -- over 6 months of GPU-time training in under 3 days of wall-clock time with 64 GPUs. This massive-scale training not only sets the state of art on Habitat Autonomous Navigation Challenge 2019, but essentially "solves" the task -- near-perfect autonomous navigation in an unseen environment without access to a map, directly from an RGB-D camera and a GPS+Compass sensor. Fortuitously, error vs computation exhibits a power-law-like distribution; thus, 90% of peak performance is obtained relatively early (at 100 million steps) and relatively cheaply (under 1 day with 8 GPUs). Finally, we show that the scene understanding and navigation policies learned can be transferred to other navigation tasks -- the analog of "ImageNet pre-training + task-specific fine-tuning" for embodied AI. Our model outperforms ImageNet pre-trained CNNs on these transfer tasks and can serve as a universal resource (all models and code are publicly available).

Details

NeurIPS Conference 2020 Conference Paper

Dialog without Dialog Data: Learning Visual Dialog Agents from VQA Data

Michael Cogswell
Jiasen Lu
Rishabh Jain
Stefan Lee
Devi Parikh
Dhruv Batra

Can we develop visually grounded dialog agents that can efficiently adapt to new tasks without forgetting how to talk to people? Such agents could leverage a larger variety of existing data to generalize to a new task, minimizing expensive data collection and annotation. In this work, we study a setting we call "Dialog without Dialog", which requires agents to develop visually grounded dialog models that can adapt to new tasks without language level supervision. By factorizing intention and language, our model minimizes linguistic drift after fine-tuning for new tasks. We present qualitative results, automated metrics, and human studies that all show our model can adapt to new tasks and maintain language quality. Baselines either fail to perform well at new tasks or experience language drift, becoming unintelligible to humans. Code has been made available at: https: //github. com/mcogswell/dialog without dialog.

PDF Details

NeurIPS Conference 2020 Conference Paper

Language-Conditioned Imitation Learning for Robot Manipulation Tasks

Simon Stepputtis
Joseph Campbell
Mariano Phielipp
Stefan Lee
Chitta Baral
Heni Ben Amor

Imitation learning is a popular approach for teaching motor skills to robots. However, most approaches focus on extracting policy parameters from execution traces alone (i. e. , motion trajectories and perceptual data). No adequate communication channel exists between the human expert and the robot to describe critical aspects of the task, such as the properties of the target object or the intended shape of the motion. Motivated by insights into the human teaching process, we introduce a method for incorporating unstructured natural language into imitation learning. At training time, the expert can provide demonstrations along with verbal descriptions in order to describe the underlying intent (e. g. , "go to the large green bowl"). The training process then interrelates these two modalities to encode the correlations between language, perception, and motion. The resulting language-conditioned visuomotor policies can be conditioned at runtime on new human commands and instructions, which allows for more fine-grained control over the trained policies while also reducing situational ambiguity. We demonstrate in a set of simulation experiments how our approach can learn language-conditioned manipulation policies for a seven-degree-of-freedom robot arm and compare the results to a variety of alternative methods.

PDF Details

NeurIPS Conference 2019 Conference Paper

Chasing Ghosts: Instruction Following as Bayesian State Tracking

Peter Anderson
Ayush Shrivastava
Devi Parikh
Dhruv Batra
Stefan Lee

A visually-grounded navigation instruction can be interpreted as a sequence of expected observations and actions an agent following the correct trajectory would encounter and perform. Based on this intuition, we formulate the problem of finding the goal location in Vision-and-Language Navigation (VLN) within the framework of Bayesian state tracking - learning observation and motion models conditioned on these expectable events. Together with a mapper that constructs a semantic spatial map on-the-fly during navigation, we formulate an end-to-end differentiable Bayes filter and train it to identify the goal by predicting the most likely trajectory through the map according to the instructions. The resulting navigation policy constitutes a new approach to instruction following that explicitly models a probability distribution over states, encoding strong geometric and algorithmic priors while enabling greater explainability. Our experiments show that our approach outperforms a strong LingUNet baseline when predicting the goal location on the map. On the full VLN task, i. e. navigating to the goal location, our approach achieves promising results with less reliance on navigation constraints.

PDF Details

ICML Conference 2019 Conference Paper

Counterfactual Visual Explanations

Yash Goyal
Ziyan Wu 0001
Jan Ernst
Dhruv Batra
Devi Parikh
Stefan Lee

In this work, we develop a technique to produce counterfactual visual explanations. Given a ‘query’ image $I$ for which a vision system predicts class $c$, a counterfactual visual explanation identifies how $I$ could change such that the system would output a different specified class $c’$. To do this, we select a ‘distractor’ image $I’$ that the system predicts as class $c’$ and identify spatial regions in $I$ and $I’$ such that replacing the identified region in $I$ with the identified region in $I’$ would push the system towards classifying $I$ as $c’$. We apply our approach to multiple image classification datasets generating qualitative results showcasing the interpretability and discriminativeness of our counterfactual explanations. To explore the effectiveness of our explanations in teaching humans, we present machine teaching experiments for the task of fine-grained bird classification. We find that users trained to distinguish bird species fare better when given access to counterfactual explanations in addition to training examples.

Details

ICML Conference 2019 Conference Paper

Probabilistic Neural Symbolic Models for Interpretable Visual Question Answering

Ramakrishna Vedantam
Karan Desai
Stefan Lee
Marcus Rohrbach
Dhruv Batra
Devi Parikh

We propose a new class of probabilistic neural-symbolic models, that have symbolic functional programs as a latent, stochastic variable. Instantiated in the context of visual question answering, our probabilistic formulation offers two key conceptual advantages over prior neural-symbolic models for VQA. Firstly, the programs generated by our model are more understandable while requiring less number of teaching examples. Secondly, we show that one can pose counterfactual scenarios to the model, to probe its beliefs on the programs that could lead to a specified answer given an image. Our results on the CLEVR and SHAPES datasets verify our hypotheses, showing that the model gets better program (and answer) prediction accuracy even in the low data regime, and allows one to probe the coherence and consistency of reasoning performed.

Details

ICML Conference 2019 Conference Paper

Trainable Decoding of Sets of Sequences for Neural Sequence Models

Ashwin Kalyan
Peter Anderson
Stefan Lee
Dhruv Batra

Many sequence prediction tasks admit multiple correct outputs and so, it is often useful to decode a set of outputs that maximize some task-specific set-level metric. However, retooling standard sequence prediction procedures tailored towards predicting the single best output leads to the decoding of sets containing very similar sequences; failing to capture the variation in the output space. To address this, we propose $\nabla$BS, a trainable decoding procedure that outputs a set of sequences, highly valued according to the metric. Our method tightly integrates the training and decoding phases and further allows for the optimization of the task-specific metric addressing the shortcomings of standard sequence prediction. Further, we discuss the trade-offs of commonly used set-level metrics and motivate a new set-level metric that naturally evaluates the notion of “capturing the variation in the output space”. Finally, we show results on the image captioning task and find that our model outperforms standard techniques and natural ablations.

Details

NeurIPS Conference 2019 Conference Paper

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Jiasen Lu
Dhruv Batra
Devi Parikh
Stefan Lee

We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a multi-modal two-stream model, processing both visual and textual inputs in separate streams that interact through co-attentional transformer layers. We pretrain our model through two proxy tasks on the large, automatically collected Conceptual Captions dataset and then transfer it to multiple established vision-and-language tasks -- visual question answering, visual commonsense reasoning, referring expressions, and caption-based image retrieval -- by making only minor additions to the base architecture. We observe significant improvements across tasks compared to existing task-specific models -- achieving state-of-the-art on all four tasks. Our work represents a shift away from learning groundings between vision and language only as part of task training and towards treating visual grounding as a pretrainable and transferable capability.

PDF Details

AAAI Conference 2018 Conference Paper

Diverse Beam Search for Improved Description of Complex Scenes

Ashwin Vijayakumar
Michael Cogswell
Ramprasaath Selvaraju
Qing Sun
Stefan Lee
David Crandall
Dhruv Batra

A single image captures the appearance and position of multiple entities in a scene as well as their complex interactions. As a consequence, natural language grounded in visual contexts tends to be diverse – with utterances differing as focus shifts to speciﬁc objects, interactions, or levels of detail. Recently, neural sequence models such as RNNs and LSTMs have been employed to produce visually-grounded language. Beam Search, the standard work-horse for decoding sequences from these models, is an approximate inference algorithm that decodes the top-B sequences in a greedy left-to-right fashion. In practice, the resulting sequences are often minor rewordings of a common utterance, failing to capture the multimodal nature of source images. To address this shortcoming, we propose Diverse Beam Search (DBS), a diversity promoting alternative to BS for approximate inference. DBS produces sequences that are signiﬁcantly different from each other by incorporating diversity constraints within groups of candidate sequences during decoding; moreover, it achieves this with minimal computational or memory overhead. We demonstrate that our method improves both diversity and quality of decoded sequences over existing techniques on two visually-grounded language generation tasks – image captioning and visual question generation – particularly on complex scenes containing diverse visual content. We also show similar improvements at language-only machine translation tasks, highlighting the generality of our approach.

PDF Details

ICML Conference 2018 Conference Paper

Learn from Your Neighbor: Learning Multi-modal Mappings from Sparse Annotations

Ashwin Kalyan
Stefan Lee
Anitha Kannan
Dhruv Batra

Many structured prediction problems (particularly in vision and language domains) are ambiguous, with multiple outputs being ‘correct’ for an input {–} e. g. there are many ways of describing an image, multiple ways of translating a sentence; however, exhaustively annotating the applicability of all possible outputs is intractable due to exponentially large output spaces (e. g. all English sentences). In practice, these problems are cast as multi-class prediction, with the likelihood of only a sparse set of annotations being maximized {–} unfortunately penalizing for placing beliefs on plausible but unannotated outputs. We make and test the following hypothesis {–} for a given input, the annotations of its neighbors may serve as an additional supervisory signal. Specifically, we propose an objective that transfers supervision from neighboring examples. We first study the properties of our developed method in a controlled toy setup before reporting results on multi-label classification and two image-grounded sequence modeling tasks {–} captioning and question generation. We evaluate using standard task-specific metrics and measures of output diversity, finding consistent improvements over standard maximum likelihood training and other baselines.

Details

NeurIPS Conference 2018 Conference Paper

Overcoming Language Priors in Visual Question Answering with Adversarial Regularization

Sainandan Ramakrishnan
Aishwarya Agrawal
Stefan Lee

Modern Visual Question Answering (VQA) models have been shown to rely heavily on superficial correlations between question and answer words learned during training -- \eg overwhelmingly reporting the type of room as kitchen or the sport being played as tennis, irrespective of the image. Most alarmingly, this shortcoming is often not well reflected during evaluation because the same strong priors exist in test distributions; however, a VQA system that fails to ground questions in image content would likely perform poorly in real-world settings. In this work, we present a novel regularization scheme for VQA that reduces this effect. We introduce a question-only model that takes as input the question encoding from the VQA model and must leverage language biases in order to succeed. We then pose training as an adversarial game between the VQA model and this question-only adversary -- discouraging the VQA model from capturing language biases in its question encoding. Further, we leverage this question-only model to estimate the mutual information between the image and answer given the question, which we maximize explicitly to encourage visual grounding. Our approach is a model agnostic training procedure and simple to implement. We show empirically that it can improve performance significantly on a bias-sensitive split of the VQA dataset for multiple base models -- achieving state-of-the-art on this task. Further, on standard VQA tasks, our approach shows significantly less drop in accuracy compared to existing bias-reducing VQA models.

PDF Details

NeurIPS Conference 2016 Conference Paper

Stochastic Multiple Choice Learning for Training Diverse Deep Ensembles

Stefan Lee
Senthil Purushwalkam Shiva Prakash
Michael Cogswell
Viresh Ranjan
David Crandall
Dhruv Batra

Many practical perception systems exist within larger processes which often include interactions with users or additional components that are capable of evaluating the quality of predicted solutions. In these contexts, it is beneficial to provide these oracle mechanisms with multiple highly likely hypotheses rather than a single prediction. In this work, we pose the task of producing multiple outputs as a learning problem over an ensemble of deep networks -- introducing a novel stochastic gradient descent based approach to minimize the loss with respect to an oracle. Our method is simple to implement, agnostic to both architecture and loss function, and parameter-free. Our approach achieves lower oracle error compared to existing methods on a wide range of tasks and deep architectures. We also show qualitatively that solutions produced from our approach often provide interpretable representations of task ambiguity.

PDF Details

YNIMG Journal 1998 Journal Article

Multimodal Real Time In vivo Characterization of Cerebral Vascular Responsiveness in Rats Following Traumatic Brain Injury Using Optical Intrinsic Signals

Greg Wong
Stefan Lee
David Hovda
Arthur Toga

Details DOI