Author name cluster

Benjamin Sapp

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

16 papers

2 author rows

TMLR Journal 2025 Journal Article

EMMA: End-to-End Multimodal Model for Autonomous Driving

Jyh-Jing Hwang
Runsheng Xu
Hubert Lin
Wei-Chih Hung
Jingwei Ji
Kristy Choi
Di Huang
Tong He

We introduce EMMA, an End-to-end Multimodal Model for Autonomous driving. Built on a multimodal large language model foundation, EMMA directly maps raw camera sensor data into various driving-specific outputs, including planner trajectories, perception objects, and road graph elements. EMMA maximizes the utility of world knowledge from the pre-trained large language models, by representing all non-sensor inputs (e.g. navigation instructions and ego vehicle status) and outputs (e.g. trajectories and 3D locations) as natural language text. This approach allows EMMA to jointly process various driving tasks in a unified language space, and generate the outputs for each task using task-specific prompts. Empirically, we demonstrate EMMA’s effectiveness by achieving state-of-the-art performance in motion planning on nuScenes as well as competitive results on an in-house large-scale benchmark. EMMA also yields competitive results for camera-primary 3D object detection on the Waymo Open Dataset (WOD). We show that co-training EMMA with planner trajectories, object detection, and road graph tasks yields improvements across all three domains, highlighting EMMA’s potential as a generalist model for autonomous driving applications. We hope that our results will inspire research to further evolve the state of the art in autonomous driving model architectures.

PDF Details

ICRA Conference 2024 Conference Paper

CausalAgents: A Robustness Benchmark for Motion Forecasting

Liting Sun
Rebecca Roelofs
Benjamin Caine
Khaled S. Refaat
Benjamin Sapp
Scott Ettinger
Wei Chai

As machine learning models become increasingly prevalent in motion forecasting for autonomous vehicles (AVs), it is critical to ensure that model predictions are safe and reliable. In this paper, we examine the robustness of motion forecasting to non-causal perturbations. We construct a new benchmark for evaluating and improving model robustness by applying perturbations to existing data. Specifically, we conduct an extensive labeling effort to identify causal agents, or agents whose presence influences human drivers’ behavior, in the Waymo Open Motion Dataset (WOMD), and we use these labels to perturb the data by deleting non-causal agents from the scene. We evaluate a diverse set of state-of-the-art deep-learning models on our proposed benchmark and find that all evaluated models exhibit large shifts under non-causal perturbation: we observe a surprising 25-38% relative change in minADE as compared to the original. In addition, we investigate techniques to improve model robustness, including increasing the training dataset size and using targeted data augmentations that randomly drop non-causal agents throughout training. Finally, we release the causal agent labels as an extension to WOMD and the robustness benchmarks to aid the community in building more reliable and safe deep-learning models for motion forecasting 1.

Details

IROS Conference 2023 Conference Paper

Imitation Is Not Enough: Robustifying Imitation with Reinforcement Learning for Challenging Driving Scenarios

Yiren Lu 0001
Justin Fu
George Tucker
Xinlei Pan
Eli Bronstein
Rebecca Roelofs
Benjamin Sapp
Brandyn White

Imitation learning (IL) is a simple and powerful way to use high-quality human driving data, which can be collected at scale, to produce human-like behavior. However, policies based on imitation learning alone often fail to sufficiently account for safety and reliability concerns. In this paper, we show how imitation learning combined with reinforcement learning using simple rewards can substan-tially improve the safety and reliability of driving policies over those learned from imitation alone. In particular, we train a policy on over lOOk miles of urban driving data, and measure its effectiveness in test scenarios grouped by different levels of collision likelihood. Our analysis shows that while imitation can perform well in low-difficulty scenarios that are well-covered by the demonstration data, our proposed approach significantly improves robustness on the most challenging scenarios (over 38 % reduction in failures). To our knowledge, this is the first application of a combined imitation and reinforcement learning approach in autonomous driving that utilizes large amounts of real- world human driving data.

Details

TMLR Journal 2023 Journal Article

VN-Transformer: Rotation-Equivariant Attention for Vector Neurons

Serge Assaad
Carlton Downey
Rami Al-Rfou'
Nigamaa Nayakanti
Benjamin Sapp

Rotation equivariance is a desirable property in many practical applications such as motion forecasting and 3D perception, where it can offer benefits like sample efficiency, better generalization, and robustness to input perturbations. Vector Neurons (VN) is a recently developed framework offering a simple yet effective approach for deriving rotation-equivariant analogs of standard machine learning operations by extending one-dimensional scalar neurons to three-dimensional "vector neurons." We introduce a novel "VN-Transformer" architecture to address several shortcomings of the current VN models. Our contributions are: (i) we derive a rotation-equivariant attention mechanism which eliminates the need for the heavy feature preprocessing required by the original Vector Neurons models; (ii) we extend the VN framework to support non-spatial attributes, expanding the applicability of these models to real-world datasets; (iii) we derive a rotation-equivariant mechanism for multi-scale reduction of point-cloud resolution, greatly speeding up inference and training; (iv) we show that small tradeoffs in equivariance ($\epsilon$-approximate equivariance) can be used to obtain large improvements in numerical stability and training robustness on accelerated hardware, and we bound the propagation of equivariance violations in our models. Finally, we apply our VN-Transformer to 3D shape classification and motion forecasting with compelling results.

PDF Details

ICRA Conference 2023 Conference Paper

Wayformer: Motion Forecasting via Simple & Efficient Attention Networks

Nigamaa Nayakanti
Rami Al-Rfou
Aurick Zhou
Kratarth Goel
Khaled S. Refaat
Benjamin Sapp

Motion forecasting for autonomous driving is a challenging task because complex driving scenarios involve a heterogeneous mix of static and dynamic inputs. It is an open problem how best to represent and fuse information about road geometry, lane connectivity, time-varying traffic light state, and history of a dynamic set of agents and their interactions into an effective encoding. To model this diverse set of input features, many approaches proposed to design an equally complex system with a diverse set of modality specific modules. This results in systems that are difficult to scale, extend, or tune in rigorous ways to trade off quality and efficiency. In this paper, we present Wayformer, a family of simple and homogeneous attention based architectures for motion forecasting. Wayformer offers a compact model description consisting of an attention based scene encoder and a decoder. In the scene encoder we study the choice of early, late and hierarchical fusion of input modalities. For each fusion type we explore strategies to trade off efficiency and quality via factorized attention or latent query attention. We show that early fusion, despite its simplicity, is not only modality agnostic but also achieves state-of-the-art results on both Waymo Open Motion Dataset (WOMD) and Argoverse leaderboards, demonstrating the effectiveness of our design philosophy.

Details

NeurIPS Conference 2023 Conference Paper

Waymax: An Accelerated, Data-Driven Simulator for Large-Scale Autonomous Driving Research

Cole Gulino
Justin Fu
Wenjie Luo
George Tucker
Eli Bronstein
Yiren Lu
Jean Harb
Xinlei Pan

Simulation is an essential tool to develop and benchmark autonomous vehicle planning software in a safe and cost-effective manner. However, realistic simulation requires accurate modeling of multi-agent interactive behaviors to be trustworthy, behaviors which can be highly nuanced and complex. To address these challenges, we introduce Waymax, a new data-driven simulator for autonomous driving in multi-agent scenes, designed for large-scale simulation and testing. Waymax uses publicly-released, real-world driving data (e. g. , the Waymo Open Motion Dataset) to initialize or play back a diverse set of multi-agent simulated scenarios. It runs entirely on hardware accelerators such as TPUs/GPUs and supports in-graph simulation for training, making it suitable for modern large-scale, distributed machine learning workflows. To support online training and evaluation, Waymax includes several learned and hard-coded behavior models that allow for realistic interaction within simulation. To supplement Waymax, we benchmark a suite of popular imitation and reinforcement learning algorithms with ablation studies on different design decisions, where we highlight the effectiveness of routes as guidance for planning agents and the ability of RL to overfit against simulated agents.

PDF Details

ICRA Conference 2022 Conference Paper

MultiPath++: Efficient Information Fusion and Trajectory Aggregation for Behavior Prediction

Balakrishnan Varadarajan
Ahmed Hefny
Avikalp Srivastava
Khaled S. Refaat
Nigamaa Nayakanti
Andre Cornman
Kan Chen
Bertrand Douillard

Predicting the future behavior of road users is one of the most challenging and important problems in autonomous driving. Applying deep learning to this problem requires fusing heterogeneous world state in the form of rich perception signals and map information, and inferring highly multi-modal distributions over possible futures. In this paper, we present MultiPath++, a future prediction model that achieves state-of-the-art performance on popular benchmarks. MultiPath++ improves the MultiPath architecture [34] by revisiting many design choices. The first key design difference is a departure from dense image-based encoding of the input world state in favor of a sparse encoding of heterogeneous scene elements: MultiPath++ consumes compact and efficient polylines to describe road features, and raw agent state information directly (e. g. , position, velocity, acceleration). We propose a context-aware fusion of these elements and develop a reusable multi-context gating fusion component. Second, we reconsider the choice of pre-defined static anchors, and develop a way to learn latent anchor embeddings end-to-end in the model. Lastly, we explore ensembling and output aggregation techniques—common in other ML domains—and find effective variants for our probabilistic multimodal output representation. We perform an extensive ablation on these design choices, and show that our proposed model achieves state-of-the-art performance on the Argoverse Motion Forecasting Competition [10] and the Waymo Open Dataset Motion Prediction Challenge [13].

Details

ICRA Conference 2022 Conference Paper

Narrowing the coordinate-frame gap in behavior prediction models: Distillation for efficient and accurate scene-centric motion forecasting

DiJia Andy Su
Bertrand Douillard
Rami Al-Rfou
Cheol Park
Benjamin Sapp

Behavior prediction models have proliferated in recent years, especially in the popular real-world robotics application of autonomous driving, where representing the distribution over possible futures of moving agents is essential for safe and comfortable motion planning. In these models, the choice of coordinate frames to represent inputs and outputs has crucial trade offs which broadly fall into one of two categories. Agent-centric models transform inputs and perform inference in agent-centric coordinates. These models are intrinsically invari-ant to translation and rotation between scene elements, are best-performing on public leaderboards, but scale quadratically with the number of agents and scene elements. Scene-centric models use a fixed coordinate system to process all agents. This gives them the advantage of sharing representations among all agents, offering efficient amortized inference computation which scales linearly with the number of agents. However, these models have to learn invariance to translation and rotation between scene elements, and typically underperform agent-centric models. In this work, we develop knowledge distillation techniques between probabilistic motion forecasting models, and apply these techniques to close the gap in performance between agent-centric and scene-centric models. This improves scene-centric model performance by 13. 2% on the public Argoverse benchmark, 7. 8% on Waymo Open Dataset and up to 9. 4% on a large In-House dataset. These improved scene-centric models rank highly in public leaderboards and are up to 15 times more efficient than their agent-centric teacher counterparts in busy scenes.

Details

ICLR Conference 2022 Conference Paper

Scene Transformer: A unified architecture for predicting future trajectories of multiple agents

Jiquan Ngiam
Vijay Vasudevan
Benjamin Caine
Zhengdong Zhang
Hao-Tien Lewis Chiang
Jeffrey Ling
Rebecca Roelofs
Alex Bewley

Predicting the motion of multiple agents is necessary for planning in dynamic environments. This task is challenging for autonomous driving since agents (e.g., vehicles and pedestrians) and their associated behaviors may be diverse and influence one another. Most prior work have focused on predicting independent futures for each agent based on all past motion, and planning against these independent predictions. However, planning against independent predictions can make it challenging to represent the future interaction possibilities between different agents, leading to sub-optimal planning. In this work, we formulate a model for predicting the behavior of all agents jointly, producing consistent futures that account for interactions between agents. Inspired by recent language modeling approaches, we use a masking strategy as the query to our model, enabling one to invoke a single model to predict agent behavior in many ways, such as potentially conditioned on the goal or full future trajectory of the autonomous vehicle or the behavior of other agents in the environment. Our model architecture employs attention to combine features across road elements, agent interactions, and time steps. We evaluate our approach on autonomous driving datasets for both marginal and joint motion prediction, and achieve state of the art performance across two popular datasets. Through combining a scene-centric approach, agent permutation equivariant model, and a sequence masking strategy, we show that our model can unify a variety of motion prediction tasks from joint motion predictions to conditioned prediction.

Details

ICRA Conference 2022 Conference Paper

StopNet: Scalable Trajectory and Occupancy Prediction for Urban Autonomous Driving

Jinkyu Kim 0001
Reza Mahjourian
Scott Ettinger
Mayank Bansal
Brandyn White
Benjamin Sapp
Dragomir Anguelov

We introduce a motion forecasting (behavior prediction) method that meets the latency requirements for autonomous driving in dense urban environments without sacrificing accuracy. A whole-scene sparse input representation allows StopNet to scale to predicting trajectories for hundreds of road agents with reliable latency. In addition to predicting trajectories, our scene encoder lends itself to predicting whole-scene probabilistic occupancy grids, a complementary output representation suitable for busy urban environments. Occupancy grids allow the AV to reason collectively about the behavior of groups of agents without processing their individual trajectories. We demonstrate the effectiveness of our sparse input representation and our model in terms of computation and accuracy over three datasets. We further show that co-training consistent trajectory and occupancy predictions improves upon state-of-the-art performance under standard metrics.

Details

ICRA Conference 2021 Conference Paper

Identifying Driver Interactions via Conditional Behavior Prediction

Ekaterina I. Tolstaya
Reza Mahjourian
Carlton Downey
Balakrishnan Varadarajan
Benjamin Sapp
Dragomir Anguelov

Interactive driving scenarios, such as lane changes, merges and unprotected turns, are some of the most challenging situations for autonomous driving. Planning in interactive scenarios requires accurately modeling the reactions of other agents to different future actions of the ego agent. We develop end-to-end models for conditional behavior prediction (CBP) that take as an input a query future trajectory for an ego-agent, and predict distributions over future trajectories for other agents conditioned on the query. Leveraging such a model, we develop a general-purpose agent interactivity score derived from probabilistic first principles. The interactivity score allows us to find interesting interactive scenarios for training and evaluating behavior prediction models. We further demonstrate that the proposed score is effective for agent prioritization under computational budget constraints.

Details

CSL Conference 2011 Conference Paper

Concurrency Semantics for the Geiger-Paz-Pearl Axioms of Independence

Sara Miner More
Pavel Naumov
Benjamin Sapp

Independence between two sets of random variables is a well-known relation in probability theory. Its origins trace back to Abraham de Moivre's work in the 18th century. The propositional theory of this relation was axiomatized by Geiger, Paz, and Pearl. Sutherland introduced a relation in information flow theory that later became known as "nondeducibility. " Subsequently, the first two authors generalized this relation from a relation between two arguments to a relation between two sets of arguments and proved that it is completely described by essentially the same axioms as independence in probability theory. This paper considers a non-interference relation between two groups of concurrent processes sharing common resources. Two such groups are called non-interfering if, when executed concurrently, the only way for them to reach deadlock is for one of the groups to deadlock internally. The paper shows that a complete axiomatization of this relation is given by the same Geiger-Paz-Pearl axioms.

Details

KR Conference 2010 Conference Paper

Independence and Functional Dependence Relations on Secrets

Robert Kelvey
Sara Miner More
Pavel Naumov
Benjamin Sapp

We study logical principles connecting two relations: independence, which is known as nondeducibility in the study of information ﬂow, and functional dependence. Two different epistemic interpretations for these relations are discussed: semantics of secrets and probabilistic semantics. A logical system sound and complete with respect to both of these semantics is introduced and is shown to be decidable.

NeurIPS Conference 2010 Conference Paper

Sidestepping Intractable Inference with Structured Ensemble Cascades

David Weiss
Benjamin Sapp
Ben Taskar

For many structured prediction problems, complex models often require adopting approximate inference techniques such as variational methods or sampling, which generally provide no satisfactory accuracy guarantees. In this work, we propose sidestepping intractable inference altogether by learning ensembles of tractable sub-models as part of a structured prediction cascade. We focus in particular on problems with high-treewidth and large state-spaces, which occur in many computer vision tasks. Unlike other variational methods, our ensembles do not enforce agreement between sub-models, but filter the space of possible outputs by simply adding and thresholding the max-marginals of each constituent model. Our framework jointly estimates parameters for all models in the ensemble for each level of the cascade by minimizing a novel, convex loss function, yet requires only a linear increase in computation over learning or inference in a single tractable sub-model. We provide a generalization bound on the filtering loss of the ensemble as a theoretical justification of our approach, and we evaluate our method on both synthetic data and the task of estimating articulated human pose from challenging videos. We find that our approach significantly outperforms loopy belief propagation on the synthetic data and a state-of-the-art model on the pose estimation/tracking problem.

PDF Details

AAAI Conference 2008 Conference Paper

A Fast Data Collection and Augmentation Procedure for Object Recognition

Benjamin Sapp

When building an application that requires object class recognition, having enough data to learn from is critical for good performance, and can easily determine the success or failure of the system. However, it is typically extremely laborintensive to collect data, as the process usually involves acquiring the image, then manual cropping and hand-labeling. Preparing large training sets for object recognition has already become one of the main bottlenecks for such emerging applications as mobile robotics and object recognition on the web. This paper focuses on a novel and practical solution to the dataset collection problem. Our method is based on using a green screen to rapidly collect example images; we then use a probabilistic model to rapidly synthesize a much larger training set that attempts to capture desired invariants in the object’s foreground and background. We demonstrate this procedure on our own mobile robotics platform, where we achieve 135x savings in the time/effort needed to obtain a training set. Our data collection method is agnostic to the learning algorithm being used, and applies to any of a large class of standard object recognition methods. Given these results, we suggest that this method become a standard protocol for developing scalable object recognition systems. Further, we used our data to build reliable classifiers that enabled our robot to visually recognize an object in an office environment, and thereby fetch an object from an office in response to a verbal request.

PDF Details

IJCAI Conference 2007 Conference Paper

Stephen Gould
Joakim Arfvidsson
Adrian Kaehler
Benjamin Sapp
Marius Messner
Gary Bradski
Paul Baumstarck
Sukwon Chung

Human object recognition in a physical 3-d environment is still far superior to that of any robotic vision system. We believe that one reason (out of many) for this - one that has not heretofore been significantly exploited in the artificial vision literature - is that humans use a fovea to fixate on, or near an object, thus obtaining a very high resolution image of the bject and rendering it easy to recognize. In this paper, we present a novel method for identifying and tracking objects in multi-resolution digital video of partially cluttered environments. Our method is motivated by biological vision systems and uses a learned "attentive" interest map on a low resolution data stream to direct a high resolution "fovea. " Objects that are recognized in the fovea can then be tracked using peripheral vision. Because object recognition is run only on a small foveal image, our system achieves performance in real-time object recognition and tracking that is well beyond simpler systems.

PDF Details