Arrow Research search

Author name cluster

Bolei Zhou

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

38 papers
2 author rows

Possible papers

38

ICLR Conference 2025 Conference Paper

3DitScene: Editing Any Scene via Language-guided Disentangled Gaussian Splatting

  • Qihang Zhang
  • Yinghao Xu 0001
  • Chaoyang Wang 0001
  • Hsin-Ying Lee 0001
  • Gordon Wetzstein
  • Bolei Zhou
  • Ceyuan Yang

Scene image editing is crucial for entertainment, photography, and advertising design. Existing methods solely focus on either 2D individual object or 3D global scene editing. This results in a lack of a unified approach to effectively control and manipulate scenes at the 3D level with different levels of granularity. In this work, we propose 3DitScene, a novel and unified scene editing framework leveraging language-guided disentangled Gaussian Splatting that enables seamless editing from 2D to 3D, allowing precise control over scene composition and individual objects. We first incorporate 3D Gaussians that are refined through generative priors and optimization techniques. Language features from CLIP then introduce semantics into 3D geometry for object disentanglement. With the disentangled Gaussians, 3DitScene allows for manipulation at both the global and individual levels, revolutionizing creative expression and empowering control over scenes and objects. Experimental results demonstrate the effectiveness and versatility of 3DitScene in scene image editing.

NeurIPS Conference 2025 Conference Paper

Adv-BMT: Bidirectional Motion Transformer for Safety-Critical Traffic Scenario Generation

  • Yuxin Liu
  • Zhenghao (Mark) Peng
  • Xuanhao Cui
  • Bolei Zhou

Scenario-based testing is essential for validating the performance of autonomous driving (AD) systems. However, such testing is limited by the scarcity of long-tailed, safety-critical scenarios in existing datasets collected in the real world. To tackle the data issue, we propose the Adv-BMT framework, which augments real-world scenarios with diverse and realistic adversarial interactions. The core component of Adv-BMT is a bidirectional motion transformer (BMT) model to perform inverse traffic motion predictions, which takes the last frame of the scenario as input and reconstruct the traffic in the inverse of chronological order, till the initial time step. The Adv-BMT framework is a two-stage pipeline: it first conducts adversarial initializations and then inverse motion predictions. Different from previous work, we do not need any collision data for pretraining and are still able to generate realistic and diverse collision interactions. Our experimental results validate the quality of generated collision scenarios by Adv-BMT: training in our augmented dataset would reduce episode collision rates by 20\% compared to previous work. The code will be made available.

NeurIPS Conference 2025 Conference Paper

AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning

  • Zewei Zhou
  • Tianhui Cai
  • Seth Zhao
  • Yun Zhang
  • Zhiyu Huang
  • Bolei Zhou
  • Jiaqi Ma

Recent advancements in Vision-Language-Action (VLA) models have shown promise for end-to-end autonomous driving by leveraging world knowledge and reasoning capabilities. However, current VLA models often struggle with physically infeasible action outputs, complex model structures, or unnecessarily long reasoning. In this paper, we propose AutoVLA, a novel VLA model that unifies reasoning and action generation within a single autoregressive generation model for end-to-end autonomous driving. AutoVLA performs semantic reasoning and trajectory planning directly from raw visual inputs and language instructions. We tokenize continuous trajectories into discrete, feasible actions, enabling direct integration into the language model. For training, we employ supervised fine-tuning to equip the model with dual thinking modes: fast thinking (trajectory-only) and slow thinking (enhanced with chain-of-thought reasoning). To further enhance planning performance and efficiency, we introduce a reinforcement fine-tuning method based on Group Relative Policy Optimization (GRPO), reducing unnecessary reasoning in straightforward scenarios. Extensive experiments across real-world and simulated datasets and benchmarks, including nuPlan, nuScenes, Waymo, and CARLA, demonstrate the competitive performance of AutoVLA in both open-loop and closed-loop settings. Qualitative results showcase the adaptive reasoning and accurate planning capabilities of AutoVLA in diverse scenarios.

IROS Conference 2025 Conference Paper

CooPre: Cooperative Pretraining for V2X Cooperative Perception

  • Seth Z. Zhao
  • Hao Xiang
  • Chenfeng Xu
  • Xin Xia 0007
  • Bolei Zhou
  • Jiaqi Ma 0003

Existing Vehicle-to-Everything (V2X) cooperative perception methods rely on accurate multi-agent 3D annotations. Nevertheless, it is time-consuming and expensive to collect and annotate real-world data, especially for V2X systems. In this paper, we present a self-supervised learning framwork for V2X cooperative perception, which utilizes the vast amount of unlabeled 3D V2X data to enhance the perception performance. Specifically, multi-agent sensing information is aggregated to form a holistic view and a novel proxy task is formulated to reconstruct the LiDAR point clouds across multiple connected agents to better reason multi-agent spatial correlations. Besides, we develop a V2X bird-eye-view (BEV) guided masking strategy which effectively allows the model to pay attention to 3D features across heterogeneous V2X agents (i. e. , vehicles and infrastructure) in the BEV space. Noticeably, such a masking strategy effectively pretrains the 3D encoder with a multi-agent LiDAR point cloud reconstruction objective and is compatible with mainstream cooperative perception backbones. Our approach, validated through extensive experiments on representative datasets (i. e. , V2X-Real, V2V4Real, and OPV2V) and multiple state-of-the-art cooperative perception methods (i. e. , AttFuse, F-Cooper, and V2X-ViT), leads to a performance boost across all V2X settings. Notably, CooPre achieves a 4% mAP improvement on V2X-Real dataset and surpasses baseline performance using only 50% of the training data, highlighting its data efficiency. Additionally, we demonstrate the framework’s powerful performance in cross-domain transferability and robustness under challenging scenarios. The code will be made publicly available at https://github.com/ucla-mobility/CooPre.

ICRA Conference 2025 Conference Paper

Data-Efficient Learning from Human Interventions for Mobile Robots

  • Zhenghao Peng
  • Zhizheng Liu
  • Bolei Zhou

Mobile robots are essential in applications such as autonomous delivery and hospitality services. Applying learning-based methods to address mobile robot tasks has gained popularity due to its robustness and generalizability. Traditional methods such as Imitation Learning (IL) and Reinforcement Learning (RL) offer adaptability but require large datasets, carefully crafted reward functions, and face sim-to-real gaps, making them challenging for efficient and safe real-world deployment. We propose an online human-in-the-loop learning method PVP4Real that combines IL and RL to address these issues. PVP4Real enables efficient real-time policy learning from online human intervention and demonstration, without reward or any pretraining, significantly improving data efficiency and training safety. We validate our method by training two different robots—a legged quadruped, and a wheeled delivery robot—in two mobile robot tasks, one of which even uses raw RGBD image as observation. The training finishes within 15 minutes. Our experiments show the promising future of human-in-the-loop learning in addressing the data efficiency issue in real-world robotic tasks. More information is available at https://metadriverse.github.io/pvp4real/.

ICML Conference 2025 Conference Paper

Hyper: Hyperparameter Robust Efficient Exploration in Reinforcement Learning

  • Yiran Wang
  • Chenshu Liu
  • Yunfan Li
  • Sanae Amani
  • Bolei Zhou
  • Lin F. Yang

The exploration & exploitation dilemma poses significant challenges in reinforcement learning (RL). Recently, curiosity-based exploration methods achieved great success in tackling hard-exploration problems. However, they necessitate extensive hyperparameter tuning on different environments, which heavily limits the applicability and accessibility of this line of methods. In this paper, we characterize this problem via analysis of the agent behavior, concluding the fundamental difficulty of choosing a proper hyperparameter. We then identify the difficulty and the instability of the optimization when the agent learns with curiosity. We propose our method, hyperparameter robust exploration ( Hyper ), which extensively mitigates the problem by effectively regularizing the visitation of the exploration and decoupling the exploitation to ensure stable training. We theoretically justify that Hyper is provably efficient under function approximation setting and empirically demonstrate its appealing performance and robustness in various environments.

ICLR Conference 2025 Conference Paper

Learning to Generate Diverse Pedestrian Movements from Web Videos with Noisy Labels

  • Zhizheng Liu
  • Joe Lin
  • Wayne Wu
  • Bolei Zhou

Understanding and modeling pedestrian movements in the real world is crucial for applications like motion forecasting and scene simulation. Many factors influence pedestrian movements, such as scene context, individual characteristics, and goals, which are often ignored by the existing human generation methods. Web videos contain natural pedestrian behavior and rich motion context, but annotating them with pre-trained predictors leads to noisy labels. In this work, we propose learning diverse pedestrian movements from web videos. We first curate a large-scale dataset called CityWalkers that captures diverse real-world pedestrian movements in urban scenes. Then, based on CityWalkers, we propose a generative model called PedGen for diverse pedestrian movement generation. PedGen introduces automatic label filtering to remove the low-quality labels and a mask embedding to train with partial labels. It also contains a novel context encoder that lifts the 2D scene context to 3D and can incorporate various context factors in generating realistic pedestrian movements in urban scenes. Experiments show that PedGen outperforms existing baseline methods for pedestrian movement generation by learning from noisy labels and incorporating the context factors. In addition, PedGen achieves zero-shot generalization in both real-world and simulated environments. The code, model, and data are available at https://genforce.github.io/PedGen/.

ICLR Conference 2025 Conference Paper

MetaUrban: An Embodied AI Simulation Platform for Urban Micromobility

  • Wayne Wu
  • Honglin He
  • Jack He
  • Yiran Wang
  • Chenda Duan
  • Zhizheng Liu
  • Quanyi Li
  • Bolei Zhou

Public urban spaces such as streetscapes and plazas serve residents and accommodate social life in all its vibrant variations. Recent advances in robotics and embodied AI make public urban spaces no longer exclusive to humans. Food delivery bots and electric wheelchairs have started sharing sidewalks with pedestrians, while robot dogs and humanoids have recently emerged in the street. **Micromobility** enabled by AI for short-distance travel in public urban spaces plays a crucial component in future transportation systems. It is essential to ensure the generalizability and safety of AI models used for maneuvering mobile machines. In this work, we present **MetaUrban**, a *compositional* simulation platform for the AI-driven urban micromobility research. MetaUrban can construct an *infinite* number of interactive urban scenes from compositional elements, covering a vast array of ground plans, object placements, pedestrians, vulnerable road users, and other mobile agents' appearances and dynamics. We design point navigation and social navigation tasks as the pilot study using MetaUrban for urban micromobility research and establish various baselines of Reinforcement Learning and Imitation Learning. We conduct extensive evaluation across mobile machines, demonstrating that heterogeneous mechanical structures significantly influence the learning and execution of AI policies. We perform a thorough ablation study, showing that the compositional nature of the simulated environments can substantially improve the generalizability and safety of the trained mobile agents. MetaUrban will be made publicly available to provide research opportunities and foster safe and trustworthy embodied AI and micromobility in cities. The code and data have been released.

NeurIPS Conference 2025 Conference Paper

Predictive Preference Learning from Human Interventions

  • Haoyuan Cai
  • Zhenghao (Mark) Peng
  • Bolei Zhou

Learning from human involvement aims to incorporate the human subject to monitor and correct agent behavior errors. Although most interactive imitation learning methods focus on correcting the agent’s action at the current state, they do not adjust its actions in future states, which may be potentially more hazardous. To address this, we introduce Predictive Preference Learning from Human Interventions (PPL), which leverages the implicit preference signals contained in human interventions to inform predictions of future rollouts. The key idea of PPL is to bootstrap each human intervention into L future time steps, called the preference horizon, with the assumption that the agent follows the same action and the human makes the same intervention in the preference horizon. By applying preference optimization on these future states, expert corrections are propagated into the safety-critical regions where the agent is expected to explore, significantly improving learning efficiency and reducing human demonstrations needed. We evaluate our approach with experiments on both autonomous driving and robotic manipulation benchmarks and demonstrate its efficiency and generality. Our theoretical analysis further shows that selecting an appropriate preference horizon L balances coverage of risky states with label correctness, thereby bounding the algorithmic optimality gap. Demo and code are available at: https: //metadriverse. github. io/ppl.

ICML Conference 2025 Conference Paper

Robot-Gated Interactive Imitation Learning with Adaptive Intervention Mechanism

  • Haoyuan Cai
  • Zhenghao Peng
  • Bolei Zhou

Interactive Imitation Learning (IIL) allows agents to acquire desired behaviors through human interventions, but current methods impose high cognitive demands on human supervisors. We propose the Adaptive Intervention Mechanism (AIM), a novel robot-gated IIL algorithm that learns an adaptive criterion for requesting human demonstrations. AIM utilizes a proxy Q-function to mimic the human intervention rule and adjusts intervention requests based on the alignment between agent and human actions. By assigning high Q-values when the agent deviates from the expert and decreasing these values as the agent becomes proficient, the proxy Q-function enables the agent to assess the real-time alignment with the expert and request assistance when needed. Our expert-in-the-loop experiments reveal that AIM significantly reduces expert monitoring efforts in both continuous and discrete control tasks. Compared to the uncertainty-based baseline Thrifty-DAgger, our method achieves a 40% improvement in terms of human take-over cost and learning efficiency. Furthermore, AIM effectively identifies safety-critical states for expert assistance, thereby collecting higher-quality expert demonstrations and reducing overall expert data and environment interactions needed. Code and demo video are available at https: //github. com/metadriverse/AIM.

ICML Conference 2025 Conference Paper

WOMD-Reasoning: A Large-Scale Dataset for Interaction Reasoning in Driving

  • Yiheng Li
  • Cunxin Fan
  • Chongjian Ge
  • Seth Z. Zhao
  • Chenran Li
  • Chenfeng Xu
  • Huaxiu Yao
  • Masayoshi Tomizuka

Language models uncover unprecedented abilities in analyzing driving scenarios, owing to their limitless knowledge accumulated from text-based pre-training. Naturally, they should particularly excel in analyzing rule-based interactions, such as those triggered by traffic laws, which are well documented in texts. However, such interaction analysis remains underexplored due to the lack of dedicated language datasets that address it. Therefore, we propose Waymo Open Motion Dataset-Reasoning (WOMD-Reasoning), a comprehensive large-scale Q&As dataset built on WOMD focusing on describing and reasoning traffic rule-induced interactions in driving scenarios. WOMD-Reasoning also presents by far the largest multi-modal Q&A dataset, with 3 million Q&As on real-world driving scenarios, covering a wide range of driving topics from map descriptions and motion status descriptions to narratives and analyses of agents’ interactions, behaviors, and intentions. To showcase the applications of WOMD-Reasoning, we design Motion-LLaVA, a motion-language model fine-tuned on WOMD-Reasoning. Quantitative and qualitative evaluations are performed on WOMD-Reasoning dataset as well as the outputs of Motion-LLaVA, supporting the data quality and wide applications of WOMD-Reasoning, in interaction predictions, traffic rule compliance plannings, etc. The dataset and its vision modal extension are available on https: //waymo. com/open/download/. The codes & prompts to build it are available on https: //github. com/yhli123/WOMD-Reasoning.

NeurIPS Conference 2024 Conference Paper

Ctrl-X: Controlling Structure and Appearance for Text-To-Image Generation Without Guidance

  • Kuan Heng Lin
  • Sicheng Mo
  • Ben Klingher
  • Fangzhou Mu
  • Bolei Zhou

Recent controllable generation approaches such as FreeControl and Diffusion Self-Guidance bring fine-grained spatial and appearance control to text-to-image (T2I) diffusion models without training auxiliary modules. However, these methods optimize the latent embedding for each type of score function with longer diffusion steps, making the generation process time-consuming and limiting their flexibility and use. This work presents Ctrl-X, a simple framework for T2I diffusion controlling structure and appearance without additional training or guidance. Ctrl-X designs feed-forward structure control to enable the structure alignment with a structure image and semantic-aware appearance transfer to facilitate the appearance transfer from a user-input image. Extensive qualitative and quantitative experiments illustrate the superior performance of Ctrl-X on various condition inputs and model checkpoints. In particular, Ctrl-X supports novel structure and appearance control with arbitrary condition images of any modality, exhibits superior image quality and appearance transfer compared to existing works, and provides instant plug-and-play functionality to any T2I and text-to-video (T2V) diffusion model. See our project page for the code and an overview of the results: https: //genforce. github. io/ctrl-x

NeurIPS Conference 2024 Conference Paper

Shared Autonomy with IDA: Interventional Diffusion Assistance

  • Brandon J. McMahan
  • Zhenghao Peng
  • Bolei Zhou
  • Jonathan C. Kao

The rapid development of artificial intelligence (AI) has unearthed the potential to assist humans in controlling advanced technologies. Shared autonomy (SA) facilitates control by combining inputs from a human pilot and an AI copilot. In prior SA studies, the copilot is constantly active in determining the action played at each time step. This limits human autonomy that may have deleterious effects on performance. In general, the amount of helpful copilot assistance varies greatly depending on the task dynamics. We therefore hypothesized that human autonomy and SA performance improves through dynamic and selective copilot intervention. To address this, we develop a goal-agnostic intervention assistance (IA) that dynamically shares control by having the copilot intervene only when the expected value of the copilot’s action exceeds that of the human’s action. We implement IA with a diffusion copilot (termed IDA) trained on expert demonstrations with goal masking. We prove that IDA performance is lower bounded by human performance, so that IDA does not negatively impact human control. In experiments with simulated human pilots, we show that IDA achieves higher performance than both pilot-only and traditional SA control in variants of the Reacher environment and Lunar Lander. We then demonstrate with human-in the-loop experiments that IDA achieves better control in Lunar Lander and that human participants experience greater autonomy and prefer IDA over pilot-only and traditional SA control. We attribute the success of IDA to preserving human autonomy while simultaneously offering assistance to prevent the human from entering universally bad states.

NeurIPS Conference 2024 Conference Paper

SimGen: Simulator-conditioned Driving Scene Generation

  • Yunsong Zhou
  • Michael Simon
  • Zhenghao Peng
  • Sicheng Mo
  • Hongzi Zhu
  • Minyi Guo
  • Bolei Zhou

Controllable synthetic data generation can substantially lower the annotation cost of training data. Prior works use diffusion models to generate driving images conditioned on the 3D object layout. However, those models are trained on small-scale datasets like nuScenes, which lack appearance and layout diversity. Moreover, overfitting often happens, where the trained models can only generate images based on the layout data from the validation set of the same dataset. In this work, we introduce a simulator-conditioned scene generation framework called SimGen that can learn to generate diverse driving scenes by mixing data from the simulator and the real world. It uses a novel cascade diffusion pipeline to address challenging sim-to-real gaps and multi-condition conflicts. A driving video dataset DIVA is collected to enhance the generative diversity of SimGen, which contains over 147. 5 hours of real-world driving videos from 73 locations worldwide and simulated driving data from the MetaDrive simulator. SimGen achieves superior generation quality and diversity while preserving controllability based on the text prompt and the layout pulled from a simulator. We further demonstrate the improvements brought by SimGen for synthetic data augmentation on the BEV detection and segmentation task and showcase its capability in safety-critical data generation.

TMLR Journal 2024 Journal Article

Unsupervised Discovery of Steerable Factors When Graph Deep Generative Models Are Entangled

  • Shengchao Liu
  • Chengpeng Wang
  • Jiarui Lu
  • Weili Nie
  • Hanchen Wang
  • Zhuoxinran Li
  • Bolei Zhou
  • Jian Tang

Deep generative models (DGMs) have been widely developed for graph data. However, much less investigation has been carried out on understanding the latent space of such pretrained graph DGMs. These understandings possess the potential to provide constructive guidelines for crucial tasks, such as graph controllable generation. Thus in this work, we are interested in studying this problem and propose GraphCG, a method for the unsupervised discovery of steerable factors in the latent space of pretrained graph DGMs. We first examine the representation space of three pretrained graph DGMs with six disentanglement metrics, and we observe that the pretrained representation space is entangled. Motivated by this observation, GraphCG learns the steerable factors via maximizing the mutual information between semantic-rich directions, where the controlled graph moving along the same direction will share the same steerable factors. We quantitatively verify that GraphCG outperforms four competitive baselines on two graph DGMs pretrained on two molecule datasets. Additionally, we qualitatively illustrate seven steerable factors learned by GraphCG on five pretrained DGMs over five graph datasets, including two for molecules and three for point clouds.

TMLR Journal 2023 Journal Article

ChemSpacE: Interpretable and Interactive Chemical Space Exploration

  • Yuanqi Du
  • Xian Liu
  • Nilay Mahesh Shah
  • Shengchao Liu
  • Jieyu Zhang
  • Bolei Zhou

Discovering meaningful molecules in the vast combinatorial chemical space has been a long-standing challenge in many fields, from materials science to drug design. Recent progress in machine learning, especially with generative models, shows great promise for automated molecule synthesis. Nevertheless, most molecule generative models remain black-boxes, whose utilities are limited by a lack of interpretability and human participation in the generation process. In this work, we propose \textbf{Chem}ical \textbf{Spac}e \textbf{E}xplorer (ChemSpacE), a simple yet effective method for exploring the chemical space with pre-trained deep generative models. Our method enables users to interact with existing generative models and steer the molecule generation process. We demonstrate the efficacy of ChemSpacE on the molecule optimization task and the latent molecule manipulation task in single-property and multi-property settings. On the molecule optimization task, the performance of ChemSpacE is on par with previous black-box optimization methods yet is considerably faster and more sample efficient. Furthermore, the interface from ChemSpacE facilitates human-in-the-loop chemical space exploration and interactive molecule design. Code and demo are available at \url{https://github.com/yuanqidu/ChemSpacE}.

ICLR Conference 2023 Conference Paper

Guarded Policy Optimization with Imperfect Online Demonstrations

  • Zhenghai Xue
  • Zhenghao Peng
  • Quanyi Li
  • Zhihan Liu
  • Bolei Zhou

The Teacher-Student Framework (TSF) is a reinforcement learning setting where a teacher agent guards the training of a student agent by intervening and providing online demonstrations. Assuming optimal, the teacher policy has the perfect timing and capability to intervene in the learning process of the student agent, providing safety guarantee and exploration guidance. Nevertheless, in many real-world settings it is expensive or even impossible to obtain a well-performing teacher policy. In this work, we relax the assumption of a well-performing teacher and develop a new method that can incorporate arbitrary teacher policies with modest or inferior performance. We instantiate an Off-Policy Reinforcement Learning algorithm, termed Teacher-Student Shared Control (TS2C), which incorporates teacher intervention based on trajectory-based value estimation. Theoretical analysis validates that the proposed TS2C algorithm attains efficient exploration and substantial safety guarantee without being affected by the teacher's own performance. Experiments on various continuous control tasks show that our method can exploit teacher policies at different performance levels while maintaining a low training cost. Moreover, the student policy surpasses the imperfect teacher policy in terms of higher accumulated reward in held-out testing environments. Code is available at https://metadriverse.github.io/TS2C.

NeurIPS Conference 2023 Conference Paper

Learning from Active Human Involvement through Proxy Value Propagation

  • Zhenghao (Mark) Peng
  • Wenjie Mo
  • Chenda Duan
  • Quanyi Li
  • Bolei Zhou

Learning from active human involvement enables the human subject to actively intervene and demonstrate to the AI agent during training. The interaction and corrective feedback from human brings safety and AI alignment to the learning process. In this work, we propose a new reward-free active human involvement method called Proxy Value Propagation for policy optimization. Our key insight is that a proxy value function can be designed to express human intents, wherein state- action pairs in the human demonstration are labeled with high values, while those agents’ actions that are intervened receive low values. Through the TD-learning framework, labeled values of demonstrated state-action pairs are further propagated to other unlabeled data generated from agents’ exploration. The proxy value function thus induces a policy that faithfully emulates human behaviors. Human- in-the-loop experiments show the generality and efficiency of our method. With minimal modification to existing reinforcement learning algorithms, our method can learn to solve continuous and discrete control tasks with various human control devices, including the challenging task of driving in Grand Theft Auto V. Demo video and code are available at: https: //metadriverse. github. io/pvp.

NeurIPS Conference 2023 Conference Paper

ScenarioNet: Open-Source Platform for Large-Scale Traffic Scenario Simulation and Modeling

  • Quanyi Li
  • Zhenghao (Mark) Peng
  • Lan Feng
  • Zhizheng Liu
  • Chenda Duan
  • Wenjie Mo
  • Bolei Zhou

Large-scale driving datasets such as Waymo Open Dataset and nuScenes substantially accelerate autonomous driving research, especially for perception tasks such as 3D detection and trajectory forecasting. Since the driving logs in these datasets contain HD maps and detailed object annotations which accurately reflect the real-world complexity of traffic behaviors, we can harvest a massive number of complex traffic scenarios and recreate their digital twins in simulation. Compared to the hand-crafted scenarios often used in existing simulators, data-driven scenarios collected from the real world can facilitate many research opportunities in machine learning and autonomous driving. In this work, we present ScenarioNet, an open-source platform for large-scale traffic scenario modeling and simulation. ScenarioNet defines a unified scenario description format and collects a large-scale repository of real-world traffic scenarios from the heterogeneous data in various driving datasets including Waymo, nuScenes, Lyft L5, and nuPlan datasets. These scenarios can be further replayed and interacted with in multiple views from Bird-Eye-View layout to realistic 3D rendering in MetaDrive simulator. This provides a benchmark for evaluating the safety of autonomous driving stacks in simulation before their real-world deployment. We further demonstrate the strengths of ScenarioNet on large-scale scenario generation, imitation learning, and reinforcement learning in both single-agent and multi-agent settings. Code, demo videos, and website are available at https: //github. com/metadriverse/scenarionet

ICLR Conference 2023 Conference Paper

Towards Smooth Video Composition

  • Qihang Zhang
  • Ceyuan Yang
  • Yujun Shen
  • Yinghao Xu 0001
  • Bolei Zhou

Video generation, with the purpose of producing a sequence of frames, requires synthesizing consistent and persistent dynamic contents over time. This work investigates how to model the temporal relations for composing a video with arbitrary number of frames, from a few to even infinite, using generative adversarial networks (GANs). First, towards composing adjacent frames, we show that the alias-free operation for single image generation, together with adequately pre-learned knowledge, bring a smooth frame transition without harming the per-frame quality. Second, through incorporating a temporal shift module (TSM), which is originally designed for video understanding, into the discriminator, we manage to advance the generator in synthesizing more reasonable dynamics. Third, we develop a novel B-Spline based motion representation to ensure the temporal smoothness, and hence achieve infinite-length video generation, going beyond the frame number used in training. We evaluate our approach on a range of datasets and show substantial improvements over baselines on video generation. Code and models are publicly available at \url{https://genforce.github.io/StyleSV}.

ICRA Conference 2023 Conference Paper

TrafficGen: Learning to Generate Diverse and Realistic Traffic Scenarios

  • Lan Feng
  • Quanyi Li
  • Zhenghao Peng
  • Shuhan Tan
  • Bolei Zhou

Diverse and realistic traffic scenarios are crucial for evaluating the AI safety of autonomous driving systems in simulation. This work introduces a data-driven method called TrafficGen for traffic scenario generation. It learns from the fragmented human driving data collected in the real world and then generates realistic traffic scenarios. TrafficGen is an autoregressive neural generative model with an encoder-decoder architecture. In each autoregressive iteration, it first encodes the current traffic context with the attention mechanism and then decodes a vehicle's initial state followed by generating its long trajectory. We evaluate the trained model in terms of vehicle placement and trajectories, and the experimental result shows our method has substantial improvements over baselines for generating traffic scenarios. After training, TrafficGen can also augment existing traffic scenarios, by adding new vehicles and extending the fragmented trajectories. We further demonstrate that importing the generated scenarios into a simulator as an interactive training environment improves the performance and safety of a driving agent learned from reinforcement learning. Model and data are available at https://metadriverse.github.io/trafficgen.

ICRA Conference 2023 Conference Paper

V2XP-ASG: Generating Adversarial Scenes for Vehicle-to-Everything Perception

  • Hao Xiang
  • Runsheng Xu
  • Xin Xia 0007
  • Zhaoliang Zheng
  • Bolei Zhou
  • Jiaqi Ma 0003

Recent advancements in Vehicle-to-Everything communication technology have enabled autonomous vehicles to share sensory information to obtain better perception performance. With the rapid growth of autonomous vehicles and intelligent infrastructure, the V2X perception systems will soon be deployed at scale, which raises a safety-critical question: how can we evaluate and improve its performance under challenging traffic scenarios before the real-world deployment? Collecting diverse large-scale real-world test scenes seems to be the most straightforward solution, but it is expensive and time-consuming, and the collections can only cover limited scenarios. To this end, we propose the first open adversarial scene generator V2XP-ASG that can produce realistic, challenging scenes for modern LiDAR-based multi-agent perception systems. V2XP-ASG learns to construct an adversarial collaboration graph and simultaneously perturb multiple agents' poses in an adversarial and plausible manner. The experiments demonstrate that V2XP-ASG can effectively identify challenging scenes for a large range of V2X perception systems. Meanwhile, by training on the limited number of generated challenging scenes, the accuracy of V2X perception systems can be further improved by 12. 3% on challenging and 4% on normal scenes. Our code will be released at https://github.com/XHwind/V2XP-ASG.

IJCAI Conference 2022 Conference Paper

AutoAlign: Pixel-Instance Feature Aggregation for Multi-Modal 3D Object Detection

  • Zehui Chen
  • Zhenyu Li
  • Shiquan Zhang
  • Liangji Fang
  • Qinhong Jiang
  • Feng Zhao
  • Bolei Zhou
  • Hang Zhao

Object detection through either RGB images or the LiDAR point clouds has been extensively explored in autonomous driving. However, it remains challenging to make these two data sources complementary and beneficial to each other. In this paper, we propose AutoAlign, an automatic feature fusion strategy for 3D object detection. Instead of establishing deterministic correspondence with camera projection matrix, we model the mapping relationship between the image and point clouds with a learnable alignment map. This map enables our model to automate the alignment of non-homogenous features in a dynamic and data-driven manner. Specifically, a cross-attention feature alignment module is devised to adaptively aggregate pixel-level image features for each voxel. To enhance the semantic consistency during feature alignment, we also design a self-supervised cross-modal feature interaction module, through which the model can learn feature aggregation with instance-level feature guidance. Extensive experimental results show that our approach can lead to 2. 3 mAP and 7. 0 mAP improvements on the KITTI and nuScenes datasets respectively. Notably, our best model reaches 70. 9 NDS on the nuScenes testing leaderboard, achieving competitive performance among various state-of-the-arts.

ICLR Conference 2022 Conference Paper

Efficient Learning of Safe Driving Policy via Human-AI Copilot Optimization

  • Quanyi Li
  • Zhenghao Peng
  • Bolei Zhou

Human intervention is an effective way to inject human knowledge into the training loop of reinforcement learning, which can bring fast learning and ensured training safety. Given the very limited budget of human intervention, it remains challenging to design when and how human expert interacts with the learning agent in the training. In this work, we develop a novel human-in-the-loop learning method called Human-AI Copilot Optimization (HACO).To allow the agent's sufficient exploration in the risky environments while ensuring the training safety, the human expert can take over the control and demonstrate how to avoid probably dangerous situations or trivial behaviors. The proposed HACO then effectively utilizes the data both from the trial-and-error exploration and human's partial demonstration to train a high-performing agent. HACO extracts proxy state-action values from partial human demonstration and optimizes the agent to improve the proxy values meanwhile reduce the human interventions. The experiments show that HACO achieves a substantially high sample efficiency in the safe driving benchmark. HACO can train agents to drive in unseen traffic scenarios with a handful of human intervention budget and achieve high safety and generalizability, outperforming both reinforcement learning and imitation learning baselines with a large margin. Code and demo video are included in the supplementary materials.

NeurIPS Conference 2022 Conference Paper

Exploit Reward Shifting in Value-Based Deep-RL: Optimistic Curiosity-Based Exploration and Conservative Exploitation via Linear Reward Shaping

  • Hao Sun
  • Lei Han
  • Rui Yang
  • Xiaoteng Ma
  • Jian Guo
  • Bolei Zhou

In this work, we study the simple yet universally applicable case of reward shaping in value-based Deep Reinforcement Learning (DRL). We show that reward shifting in the form of a linear transformation is equivalent to changing the initialization of the $Q$-function in function approximation. Based on such an equivalence, we bring the key insight that a positive reward shifting leads to conservative exploitation, while a negative reward shifting leads to curiosity-driven exploration. Accordingly, conservative exploitation improves offline RL value estimation, and optimistic value estimation improves exploration for online RL. We validate our insight on a range of RL tasks and show its improvement over baselines: (1) In offline RL, the conservative exploitation leads to improved performance based on off-the-shelf algorithms; (2) In online continuous control, multiple value functions with different shifting constants can be used to tackle the exploration-exploitation dilemma for better sample efficiency; (3) In discrete control tasks, a negative reward shifting yields an improvement over the curiosity-based exploration method.

NeurIPS Conference 2022 Conference Paper

Human-AI Shared Control via Policy Dissection

  • Quanyi Li
  • Zhenghao Peng
  • Haibin Wu
  • Lan Feng
  • Bolei Zhou

Human-AI shared control allows human to interact and collaborate with autonomous agents to accomplish control tasks in complex environments. Previous Reinforcement Learning (RL) methods attempted goal-conditioned designs to achieve human-controllable policies at the cost of redesigning the reward function and training paradigm. Inspired by the neuroscience approach to investigate the motor cortex in primates, we develop a simple yet effective frequency-based approach called Policy Dissection to align the intermediate representation of the learned neural controller with the kinematic attributes of the agent behavior. Without modifying the neural controller or retraining the model, the proposed approach can convert a given RL-trained policy into a human-controllable policy. We evaluate the proposed approach on many RL tasks such as autonomous driving and locomotion. The experiments show that human-AI shared control system achieved by Policy Dissection in driving task can substantially improve the performance and safety in unseen traffic scenes. With human in the inference loop, the locomotion robots also exhibit versatile controllable motion skills even though they are only trained to move forward. Our results suggest the promising direction of implementing human-AI shared autonomy through interpreting the learned representation of the autonomous agents. Code and demo videos are available at https: //metadriverse. github. io/policydissect

NeurIPS Conference 2022 Conference Paper

Improving GANs with A Dynamic Discriminator

  • Ceyuan Yang
  • Yujun Shen
  • Yinghao Xu
  • Deli Zhao
  • Bo Dai
  • Bolei Zhou

Discriminator plays a vital role in training generative adversarial networks (GANs) via distinguishing real and synthesized samples. While the real data distribution remains the same, the synthesis distribution keeps varying because of the evolving generator, and thus effects a corresponding change of the bi-classification task assigned to the discriminator. We argue that a discriminator with an on-the-fly adjustment on its capacity can better accommodate such a time-varying task. A comprehensive empirical study confirms that the proposed training strategy, termed as DynamicD, improves the synthesis performance without incurring any additional computation cost or training objectives. Two capacity adjusting schemes are developed for training GANs under different data regimes: i) given a sufficient amount of training data, the discriminator benefits from a progressively increased learning capacity, and ii) when the training data is limited, gradually decreasing the layer width mitigates the over-fitting issue of the discriminator. Experiments on both 2D and 3D-aware image synthesis tasks conducted on a range of datasets substantiate the generalizability of our DynamicD as well as its substantial improvement over the baselines. Furthermore, DynamicD is synergistic to other discriminator-improving approaches (including data augmentation, regularizers, and pre-training), and brings continuous performance gain when combined with them for learning GANs. Code will be made publicly available.

AAAI Conference 2022 Conference Paper

SimIPU: Simple 2D Image and 3D Point Cloud Unsupervised Pre-training for Spatial-Aware Visual Representations

  • Zhenyu Li
  • Zehui Chen
  • Ang Li
  • Liangji Fang
  • Qinhong Jiang
  • Xianming Liu
  • Junjun Jiang
  • Bolei Zhou

Pre-training has become a standard paradigm in many computer vision tasks. However, most of the methods are generally designed on the RGB image domain. Due to the discrepancy between the two-dimensional image plane and the three-dimensional space, such pre-trained models fail to perceive spatial information and serve as sub-optimal solutions for 3D-related tasks. To bridge this gap, we aim to learn a spatial-aware visual representation that can describe the three-dimensional space and is more suitable and effective for these tasks. To leverage point clouds, which are much more superior in providing spatial information compared to images, we propose a simple yet effective 2D Image and 3D Point cloud Unsupervised pre-training strategy, called SimIPU. Specifically, we develop a multi-modal contrastive learning framework that consists of an intra-modal spatial perception module to learn a spatial-aware representation from point clouds and an inter-modal feature interaction module to transfer the capability of perceiving spatial information from the point cloud encoder to the image encoder, respectively. Positive pairs for contrastive losses are established by the matching algorithm and the projection matrix. The whole framework is trained in an unsupervised end-toend fashion. To the best of our knowledge, this is the first study to explore contrastive learning pre-training strategies for outdoor multi-modal datasets, containing paired camera images and LIDAR point clouds.

AAAI Conference 2022 Conference Paper

Visual Sound Localization in the Wild by Cross-Modal Interference Erasing

  • Xian Liu
  • Rui Qian
  • Hang Zhou
  • Di Hu
  • Weiyao Lin
  • Ziwei Liu
  • Bolei Zhou
  • Xiaowei Zhou

The task of audiovisual sound source localization has been well studied under constrained scenes, where the audio recordings are clean. However, in real world scenarios, audios are usually contaminated by off screen sound and background noise. They will interfere with the procedure of identifying desired sources and building visual sound connections, making previous studies nonapplicable. In this work, we propose the Interference Eraser (IEr) framework, which tackles the problem of audiovisual sound source localization in the wild. The key idea is to eliminate the interference by redefining and carving discriminative audio representations. Specifically, we observe that the previous practice of learning only a single audio representation is insufficient due to the additive nature of audio signals. We thus extend the audio representation with our Audio Instance Identifier module, which clearly distinguishes sounding instances when audio signals of different volumes are unevenly mixed. Then we erase the influence of the audible but off screen sounds and the silent but visible objects by a Cross modal Referrer module with cross modality distillation. Quantitative and qualitative evaluations demonstrate that our framework achieves superior results on sound localization tasks, especially under real world scenarios.

NeurIPS Conference 2021 Conference Paper

Data-Efficient Instance Generation from Instance Discrimination

  • Ceyuan Yang
  • Yujun Shen
  • Yinghao Xu
  • Bolei Zhou

Generative Adversarial Networks (GANs) have significantly advanced image synthesis, however, the synthesis quality drops significantly given a limited amount of training data. To improve the data efficiency of GAN training, prior work typically employs data augmentation to mitigate the overfitting of the discriminator yet still learn the discriminator with a bi-classification ($\textit{i. e. }$, real $\textit{vs. }$ fake) task. In this work, we propose a data-efficient Instance Generation ($\textit{InsGen}$) method based on instance discrimination. Concretely, besides differentiating the real domain from the fake domain, the discriminator is required to distinguish every individual image, no matter it comes from the training set or from the generator. In this way, the discriminator can benefit from the infinite synthesized samples for training, alleviating the overfitting problem caused by insufficient training data. A noise perturbation strategy is further introduced to improve its discriminative power. Meanwhile, the learned instance discrimination capability from the discriminator is in turn exploited to encourage the generator for diverse generation. Extensive experiments demonstrate the effectiveness of our method on a variety of datasets and training settings. Noticeably, on the setting of $2K$ training images from the FFHQ dataset, we outperform the state-of-the-art approach with 23. 5\% FID improvement.

AAAI Conference 2021 Conference Paper

HiABP: Hierarchical Initialized ABP for Unsupervised Representation Learning

  • Jiankai Sun
  • Rui Liu
  • Bolei Zhou

Although Markov chain Monte Carlo (MCMC) is useful for generating samples from the posterior distribution, it often suffers from intractability when dealing with large-scale datasets. To address this issue, we propose Hierarchical Initialized Alternating Back-propagation (HiABP) for efficient Bayesian inference. Especially, we endow Alternating Backpropagation (ABP) method with a well-designed initializer and hierarchical structure, composing the pipeline of Initializing, Improving, and Learning back-propagation. It saves much time for the generative model to initialize the latent variable by constraining a sampler to be close to the true posterior distribution. The initialized latent variable is then improved significantly by an MCMC sampler. Thus the proposed method has the strengths of both methods, i. e. , the effectiveness of MCMC and the efficiency of variational inference. Experimental results validate our framework can outperform other popular deep generative models in modeling natural images and learning from incomplete data. We further demonstrate the unsupervised disentanglement of hierarchical latent representation with controllable image synthesis.

NeurIPS Conference 2021 Conference Paper

Learning to Simulate Self-driven Particles System with Coordinated Policy Optimization

  • Zhenghao Peng
  • Quanyi Li
  • Ka Ming Hui
  • Chunxiao Liu
  • Bolei Zhou

Self-Driven Particles (SDP) describe a category of multi-agent systems common in everyday life, such as flocking birds and traffic flows. In a SDP system, each agent pursues its own goal and constantly changes its cooperative or competitive behaviors with its nearby agents. Manually designing the controllers for such SDP system is time-consuming, while the resulting emergent behaviors are often not realistic nor generalizable. Thus the realistic simulation of SDP systems remains challenging. Reinforcement learning provides an appealing alternative for automating the development of the controller for SDP. However, previous multi-agent reinforcement learning (MARL) methods define the agents to be teammates or enemies before hand, which fail to capture the essence of SDP where the role of each agent varies to be cooperative or competitive even within one episode. To simulate SDP with MARL, a key challenge is to coordinate agents' behaviors while still maximizing individual objectives. Taking traffic simulation as the testing bed, in this work we develop a novel MARL method called Coordinated Policy Optimization (CoPO), which incorporates social psychology principle to learn neural controller for SDP. Experiments show that the proposed method can achieve superior performance compared to MARL baselines in various metrics. Noticeably the trained vehicles exhibit complex and diverse social behaviors that improve performance and safety of the population as a whole. Demo video and source code are available at: https: //decisionforce. github. io/CoPO/

AAAI Conference 2020 Conference Paper

Every Frame Counts: Joint Learning of Video Segmentation and Optical Flow

  • Mingyu Ding
  • Zhe Wang
  • Bolei Zhou
  • Jianping Shi
  • Zhiwu Lu
  • Ping Luo

A major challenge for video semantic segmentation is the lack of labeled data. In most benchmark datasets, only one frame of a video clip is annotated, which makes most supervised methods fail to utilize information from the rest of the frames. To exploit the spatio-temporal information in videos, many previous works use pre-computed optical flows, which encode the temporal consistency to improve the video segmentation. However, the video segmentation and optical flow estimation are still considered as two separate tasks. In this paper, we propose a novel framework for joint video semantic segmentation and optical flow estimation. Semantic segmentation brings semantic information to handle occlusion for more robust optical flow estimation, while the nonoccluded optical flow provides accurate pixel-level temporal correspondences to guarantee the temporal consistency of the segmentation. Moreover, our framework is able to utilize both labeled and unlabeled frames in the video through joint training, while no additional calculation is required in inference. Extensive experiments show that the proposed model makes the video semantic segmentation and optical flow estimation benefit from each other and outperforms existing methods under the same settings in both tasks.

NeurIPS Conference 2019 Conference Paper

Policy Continuation with Hindsight Inverse Dynamics

  • Hao Sun
  • Zhizhong Li
  • Xiaotong Liu
  • Bolei Zhou
  • Dahua Lin

Solving goal-oriented tasks is an important but challenging problem in reinforcement learning (RL). For such tasks, the rewards are often sparse, making it difficult to learn a policy effectively. To tackle this difficulty, we propose a new approach called Policy Continuation with Hindsight Inverse Dynamics (PCHID). This approach learns from Hindsight Inverse Dynamics based on Hindsight Experience Replay. Enabling the learning process in a self-imitated manner and thus can be trained with supervised learning. This work also extends it to multi-step settings with Policy Continuation. The proposed method is general, which can work in isolation or be combined with other on-policy and off-policy algorithms. On two multi-goal tasks GridWorld and FetchReach, PCHID significantly improves the sample efficiency as well as the final performance.

IROS Conference 2018 Conference Paper

Real-Time Object Pose Estimation with Pose Interpreter Networks

  • Jimmy Wu
  • Bolei Zhou
  • Rebecca L. Russell
  • Vincent Kee
  • Syler Wagner
  • Mitchell Hebert
  • Antonio Torralba 0001
  • David M. S. Johnson

In this work, we introduce pose interpreter networks for 6-DoF object pose estimation. In contrast to other CNN-based approaches to pose estimation that require expensively annotated object pose data, our pose interpreter network is trained entirely on synthetic pose data. We use object masks as an intermediate representation to bridge real and synthetic. We show that when combined with a segmentation model trained on RGB images, our synthetically trained pose interpreter network is able to generalize to real data. Our end-to-end system for object pose estimation runs in real-time (20 Hz) on live RGB data, without using depth information or ICP refinement.

IROS Conference 2017 Conference Paper

SegICP: Integrated deep semantic segmentation and pose estimation

  • Jay Ming Wong
  • Vincent Kee
  • Tiffany Le
  • Syler Wagner
  • Gian Luca Mariottini
  • Abraham Schneider
  • Lei Hamilton
  • Rahul Chipalkatty

Recent robotic manipulation competitions have highlighted that sophisticated robots still struggle to achieve fast and reliable perception of task-relevant objects in complex, realistic scenarios. To improve these systems' perceptive speed and robustness, we present SegICP, a novel integrated solution to object recognition and pose estimation. SegICP couples convolutional neural networks and multi-hypothesis point cloud registration to achieve both robust pixel-wise semantic segmentation as well as accurate and real-time 6-DOF pose estimation for relevant objects. Our architecture achieves 1 cm position error and <; 5° angle error in real time without an initial seed. We evaluate and benchmark SegICP against an annotated dataset generated by motion capture.

ICLR Conference 2015 Conference Paper

Object Detectors Emerge in Deep Scene CNNs

  • Bolei Zhou
  • Aditya Khosla
  • Àgata Lapedriza
  • Aude Oliva
  • Antonio Torralba 0001

With the success of new computational architectures for visual processing, such as convolutional neural networks (CNN) and access to image databases with millions of labeled examples (e.g., ImageNet, Places), the state of the art in computer vision is advancing rapidly. One important factor for continued progress is to understand the representations that are learned by the inner layers of these deep architectures. Here we show that object detectors emerge from training CNNs to perform scene classification. As scenes are composed of objects, the CNN for scene classification automatically discovers meaningful objects detectors, representative of the learned scene categories. With object detectors emerging as a result of learning to recognize scenes, our work demonstrates that the same network can perform both scene recognition and object localization in a single forward-pass, without ever having been explicitly taught the notion of objects.

NeurIPS Conference 2014 Conference Paper

Learning Deep Features for Scene Recognition using Places Database

  • Bolei Zhou
  • Agata Lapedriza
  • Jianxiong Xiao
  • Antonio Torralba
  • Aude Oliva

Scene recognition is one of the hallmark tasks of computer vision, allowing definition of a context for object recognition. Whereas the tremendous recent progress in object recognition tasks is due to the availability of large datasets like ImageNet and the rise of Convolutional Neural Networks (CNNs) for learning high-level features, performance at scene recognition has not attained the same level of success. This may be because current deep features trained from ImageNet are not competitive enough for such tasks. Here, we introduce a new scene-centric database called Places with over 7 million labeled pictures of scenes. We propose new methods to compare the density and diversity of image datasets and show that Places is as dense as other scene datasets and has more diversity. Using CNN, we learn deep features for scene recognition tasks, and establish new state-of-the-art results on several scene-centric datasets. A visualization of the CNN layers' responses allows us to show differences in the internal representations of object-centric and scene-centric networks.