Arrow Research search

Author name cluster

Michael Luo

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

9 papers
2 author rows

Possible papers

9

NeurIPS Conference 2025 Conference Paper

SimpleStrat: Diversifying Language Model Generation with Stratification

  • Justin Wong
  • Yury Orlovskiy
  • Alexander Shypula
  • Michael Luo
  • Sanjit Seshia
  • Joseph Gonzalez

Generating diverse responses from large language models (LLMs) is crucial for applications such as adversarial testing, search, and synthetic data generation, where diversity provides distinct answers across generations. Previous approaches rely solely on increasing the temperature, sacrificing quality. Furthermore, the model's next-token probabilities may not be representative of the true answer distribution. To combat these challenges, we propose SimpleStrat, an alternative that uses the language model itself to partition the solution space into strata from which to sample. To measure resampling diversity, we introduce CoverageQA, a dataset of underspecified questions with multiple equally plausible answers. We propose measuring resampling diversity as the KL Divergence between the response distribution and the uniform distribution over valid ground truth answers and use recall as an alternative when assessing proprietary models. On CoverageQA, SimpleStrat improves diversity across all temperatures, showing orthogonal benefits. Quantifiably, we achieve as much as 4X better recall when applied to GPT-4o, and an average reduction in KL divergence by 0. 36 when applied to Llama 3. Furthermore, we show that SimpleStrat achieves more resampling diversity at temperature T=0 than scaling temperature to T=1 on creative writing, an open-ended domain. Implementation and dataset available at https: //github. com/jwong8314/simplestrat.

NeurIPS Conference 2025 Conference Paper

WorldModelBench: Judging Video Generation Models As World Models

  • Dacheng Li
  • Yunhao Fang
  • Yukang Chen
  • Shuo Yang
  • Shiyi Cao
  • Justin Wong
  • Michael Luo
  • Xiaolong Wang

Video generation models have rapidly progressed, positioning themselves as video world models capable of supporting decision-making applications like robotics and autonomous driving. However, current benchmarks fail to rigorously evaluate these claims, focusing only on general video quality, ignoring important factors to world models such as physics adherence. To bridge this gap, we propose WorldModelBench, a benchmark designed to evaluate the world modeling capabilities of video generation models in application-driven domains. WorldModelBench offers two key advantages: (1) Against to nuanced world modeling violations: By incorporating instruction-following and physics-adherence dimensions, WorldModelBench detects subtle violations, such as irregular changes in object size that breach the mass conservation law—issues overlooked by prior benchmarks. (2) Aligned with large-scale human preferences: We crowd-source 67K human labels to accurately measure 14 frontier models. Using our high-quality human labels, we further fine-tune an accurate judger to automate the evaluation procedure, achieving 9. 9% lower error in predicting world modeling violations than GPT-4o with 2B parameters. In addition, we demonstrate that training to align human annotations by maximizing the rewards from the judger noticeably improve the world modeling capability. The dataset is hosted in HuggingFace at https: //huggingface. co/datasets/Efficient-Large-Model/worldmodelbench. The code to run evaluation is available at https: //github. com/WorldModelBench-Team/WorldModelBench.

NeurIPS Conference 2024 Conference Paper

Stylus: Automatic Adapter Selection for Diffusion Models

  • Michael Luo
  • Justin Wong
  • Brandon Trabucco
  • Yanping Huang
  • Joseph E. Gonzalez
  • Zhifeng Chen
  • Ruslan Salakhutdinov
  • Ion Stoica

Beyond scaling base models with more data or parameters, fine-tuned adapters provide an alternative way to generate high fidelity, custom images at reduced costs. As such, adapters have been widely adopted by open-source communities, accumulating a database of over 100K adapters—most of which are highly customized with insufficient descriptions. To generate high quality images, this paper explores the problem of matching the prompt to a Stylus of relevant adapters, built on recent work that highlight the performance gains of composing adapters. We introduce Stylus, which efficiently selects and automatically composes task-specific adapters based on a prompt's keywords. Stylus outlines a three-stage approach that first summarizes adapters with improved descriptions and embeddings, retrieves relevant adapters, and then further assembles adapters based on prompts' keywords by checking how well they fit the prompt. To evaluate Stylus, we developed StylusDocs, a curated dataset featuring 75K adapters with pre-computed adapter embeddings. In our evaluation on popular Stable Diffusion checkpoints, Stylus achieves greater CLIP/FID Pareto efficiency and is twice as preferred, with humans and multimodal models as evaluators, over the base model.

NeurIPS Conference 2021 Conference Paper

Accelerating Quadratic Optimization with Reinforcement Learning

  • Jeffrey Ichnowski
  • Paras Jain
  • Bartolomeo Stellato
  • Goran Banjac
  • Michael Luo
  • Francesco Borrelli
  • Joseph E. Gonzalez
  • Ion Stoica

First-order methods for quadratic optimization such as OSQP are widely used for large-scale machine learning and embedded optimal control, where many related problems must be rapidly solved. These methods face two persistent challenges: manual hyperparameter tuning and convergence time to high-accuracy solutions. To address these, we explore how Reinforcement Learning (RL) can learn a policy to tune parameters to accelerate convergence. In experiments with well-known QP benchmarks we find that our RL policy, RLQP, significantly outperforms state-of-the-art QP solvers by up to 3x. RLQP generalizes surprisingly well to previously unseen problems with varying dimension and structure from different applications, including the QPLIB, Netlib LP and Maros-M{\'e}sz{\'a}ros problems. Code, models, and videos are available at https: //berkeleyautomation. github. io/rlqp/.

ICLR Conference 2021 Conference Paper

Discovering Non-monotonic Autoregressive Orderings with Variational Inference

  • Xuanlin Li
  • Brandon Trabucco
  • Dong Huk Park
  • Michael Luo
  • Sheng Shen 0001
  • Trevor Darrell
  • Yang Gao 0029

The predominant approach for language modeling is to encode a sequence of tokens from left to right, but this eliminates a source of information: the order by which the sequence was naturally generated. One strategy to recover this information is to decode both the content and ordering of tokens. Some prior work supervises content and ordering with hand-designed loss functions to encourage specific orders or bootstraps from a predefined ordering. These approaches require domain-specific insight. Other prior work searches over valid insertion operations that lead to ground truth sequences during training, which has high time complexity and cannot be efficiently parallelized. We address these limitations with an unsupervised learner that can be trained in a fully-parallelizable manner to discover high-quality autoregressive orders in a data driven way without a domain-specific prior. The learner is a neural network that performs variational inference with the autoregressive ordering as a latent variable. Since the corresponding variational lower bound is not differentiable, we develop a practical algorithm for end-to-end optimization using policy gradients. Strong empirical results with our solution on sequence modeling tasks suggest that our algorithm is capable of discovering various autoregressive orders for different sequences that are competitive with or even better than fixed orders.

ICRA Conference 2021 Conference Paper

Learning Seed Placements and Automation Policies for Polyculture Farming with Companion Plants

  • Yahav Avigal
  • Anna Deza
  • William Wong
  • Sebastian Oehme
  • Mark Presten
  • Mark Theis
  • Jackson Chui
  • Paul Shao

Polyculture farming is a sustainable farming technique based on synergistic interactions between differing plant types that make them more resistant to diseases and pests and better able to retain water. Reduced uniformity can reduce use of pesticides, fertilizer, and water, but is more labor intensive and more challenging to automate. We describe a scaled physical testbed (1. 5m×3. 0m) that uses a high resolution camera and soil sensors to monitor polyculture plants to facilitate tuning of plant growth, companion effects, and irrigation parameters for a first-order garden simulator. We use this simulator to develop a novel seed placement algorithm that increases coverage and diversity, and a learned pruning policy. In simulation experiments, the seed placement algorithm yields 60% more coverage and 10% more diversity than random seed placement and the learned pruning policy runs 1000X faster than a procedural lookahead policy to achieve high leaf coverage and plant diversity on adversarial gardens that include plant species with diverse growth rates. These models and policies provide the groundwork for a fully-automated system under development. Code, datasets and supplementary material can be found at https://github.com/BerkeleyAutomation/AlphaGarden/.

NeurIPS Conference 2021 Conference Paper

RLlib Flow: Distributed Reinforcement Learning is a Dataflow Problem

  • Eric Liang
  • Zhanghao Wu
  • Michael Luo
  • Sven Mika
  • Joseph E. Gonzalez
  • Ion Stoica

Researchers and practitioners in the field of reinforcement learning (RL) frequently leverage parallel computation, which has led to a plethora of new algorithms and systems in the last few years. In this paper, we re-examine the challenges posed by distributed RL and try to view it through the lens of an old idea: distributed dataflow. We show that viewing RL as a dataflow problem leads to highly composable and performant implementations. We propose RLlib Flow, a hybrid actor-dataflow programming model for distributed RL, and validate its practicality by porting the full suite of algorithms in RLlib, a widely adopted distributed RL library. Concretely, RLlib Flow provides 2-9$\times$ code savings in real production code and enables the composition of multi-agent algorithms not possible by end users before. The open-source code is available as part of RLlib at https: //github. com/ray-project/ray/tree/master/rllib.

ICLR Conference 2020 Conference Paper

IMPACT: Importance Weighted Asynchronous Architectures with Clipped Target Networks

  • Michael Luo
  • Jiahao Yao
  • Richard Liaw
  • Eric Liang
  • Ion Stoica

The practical usage of reinforcement learning agents is often bottlenecked by the duration of training time. To accelerate training, practitioners often turn to distributed reinforcement learning architectures to parallelize and accelerate the training process. However, modern methods for scalable reinforcement learning (RL) often tradeoff between the throughput of samples that an RL agent can learn from (sample throughput) and the quality of learning from each sample (sample efficiency). In these scalable RL architectures, as one increases sample throughput (i.e. increasing parallelization in IMPALA (Espeholt et al., 2018)), sample efficiency drops significantly. To address this, we propose a new distributed reinforcement learning algorithm, IMPACT. IMPACT extends PPO with three changes: a target network for stabilizing the surrogate objective, a circular buffer, and truncated importance sampling. In discrete action-space environments, we show that IMPACT attains higher reward and, simultaneously, achieves up to 30% decrease in training wall-time than that of IMPALA. For continuous control environments, IMPACT trains faster than existing scalable agents while preserving the sample efficiency of synchronous PPO.

RLDM Conference 2019 Conference Abstract

Accelerating Distributed Deep Reinforcement Learning

  • Andrew Tan
  • Vishal Satish
  • Michael Luo

Recent advances in the field of Reinforcement Learning (RL) have allowed agents to accom- plish complex tasks with human-level performance such as beating the world champion at GO. However, these agents require immense amounts of training data and capturing this data is both time consuming and computationally expensive. One proposed solution to speed up this process is to distribute it among many workers, which may span multiple machines. This has led to distributed RL algorithms such as IMPALA and A3C, along with distributed frameworks such as Ray. Although increasing the amount of compute can reduce learning time, it is not sustainable as this can become extremely expensive for large tasks. Thus there is a growing need for sample and timestep efficient distributed RL algorithms that can reach the same performance as earlier methods but with smaller amounts of data and fewer timesteps. Furthermore, often times compute is not used efficiently; thus there is a need for more optimized algorithms that can more efficiently use the provided hardware. In order to tackle these problems, we explore combinations and im- provements of the best parts of pre-existing distributed RL algorithms into a single algorithm that performs better than any pre-existing algorithm alone, similar to Rainbow. We start with IMPALA and propose an asynchronous Proximal Policy Optimization (PPO) loss for IMPALA that is able to learn in fewer timesteps than the original vanilla policy gradient. We also add a distributed replay buffer to improve sample efficiency and integrate an auto-encoder into the policy graph in order to reduce the input to a smaller latent space, which reduces policy computation and also helps with timestep efficiency. Finally, at the systems level we implement parallel data loading to improve GPU utilization. With all these changes, we find that our final improved IMPALA can solve Atari Pong in 1. 7 million timesteps in under 3 minutes with 128 CPU workers and 2 GPU learners.