Arrow Research search

Author name cluster

Anastasios N. Angelopoulos

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

14 papers
2 author rows

Possible papers

14

ICML Conference 2025 Conference Paper

AutoEval Done Right: Using Synthetic Data for Model Evaluation

  • Pierre Boyeau
  • Anastasios N. Angelopoulos
  • Tianle Li
  • Nir Yosef
  • Jitendra Malik
  • Michael I. Jordan

The evaluation of machine learning models using human-labeled validation data can be expensive and time-consuming. AI-labeled synthetic data can be used to decrease the number of human annotations required for this purpose in a process called autoevaluation. We suggest efficient and statistically principled algorithms for this purpose that improve sample efficiency while remaining unbiased.

ICML Conference 2025 Conference Paper

Copilot Arena: A Platform for Code LLM Evaluation in the Wild

  • Wayne Chi
  • Valerie Chen
  • Anastasios N. Angelopoulos
  • Wei-Lin Chiang
  • Aditya Mittal
  • Naman Jain
  • Tianjun Zhang
  • Ion Stoica

Evaluating in-the-wild coding capabilities of large language models (LLMs) is a challenging endeavor with no existing solution. We introduce Copilot Arena, a platform to collect user preferences through native integration into a developer’s working environment. Copilot Arena comprises a novel interface for comparing pairs of model outputs, a sampling strategy to reduce experienced latency, and a prompting scheme to enable code completion functionality. Copilot Arena has served over 4. 5 million suggestions from 10 models and collected over 11k pairwise judgements. Our results highlight the importance of model evaluations in integrated settings. We find that model rankings from Copilot Arena differ from those of existing evaluations, which we attribute to the unique distribution of data and tasks contained in Copilot Arena. We also identify novel insights into human preferences on code such as an observed consistency in user preference across programming languages yet significant variation in preference due to task category. We open-source Copilot Arena and release data to enable human-centric evaluations and improve understanding of coding assistants.

ICML Conference 2025 Conference Paper

Exploring and Mitigating Adversarial Manipulation of Voting-Based Leaderboards

  • Yangsibo Huang
  • Milad Nasr
  • Anastasios N. Angelopoulos
  • Nicholas Carlini
  • Wei-Lin Chiang
  • Christopher A. Choquette-Choo
  • Daphne Ippolito
  • Matthew Jagielski

It is now common to evaluate Large Language Models (LLMs) by having humans manually vote to evaluate model outputs, in contrast to typical benchmarks that evaluate knowledge or skill at some particular task. Chatbot Arena, the most popular benchmark of this type, ranks models by asking users to select the better response between two randomly selected models (without revealing which model was responsible for the generations). These platforms are widely trusted as a fair and accurate measure of LLM capabilities. In this paper, we show that if bot protection and other defenses are not implemented, these voting-based benchmarks are potentially vulnerable to adversarial manipulation. Specifically, we show that an attacker can alter the leaderboard (to promote their favorite model or demote competitors) at the cost of roughly a thousand votes (verified in a simulated, offline version of Chatbot Arena). Our attack consists of two steps: first, we show how an attacker can determine which model was used to generate a given reply with more than $95%$ accuracy; and then, the attacker can use this information to consistently vote for (or against) a target model. Working with the Chatbot Arena developers, we identify, propose, and implement mitigations to improve the robustness of Chatbot Arena against adversarial manipulation, which, based on our analysis, substantially increases the cost of such attacks. Some of these defenses were present before our collaboration, such as bot protection with Cloudflare, malicious user detection, and rate limiting. Others, including reCAPTCHA and login are being integrated to strengthen the security in Chatbot Arena.

JMLR Journal 2025 Journal Article

Gradient Equilibrium in Online Learning: Theory and Applications

  • Anastasios N. Angelopoulos
  • Michael I. Jordan
  • Ryan J. Tibshirani

We present a new perspective on online learning that we refer to as gradient equilibrium: a sequence of iterates achieves gradient equilibrium if the average of gradients of losses along the sequence converges to zero. In general, this condition is not implied by, nor implies, sublinear regret. It turns out that gradient equilibrium is achievable by standard online learning methods such as gradient descent and mirror descent with constant step sizes (rather than decaying step sizes, as is usually required for no regret). Further, as we show through examples, gradient equilibrium translates into an interpretable and meaningful property in online prediction problems spanning regression, classification, quantile estimation, and others. Notably, we show that the gradient equilibrium framework can be used to develop a debiasing scheme for black-box predictions under arbitrary distribution shift, based on simple post hoc online descent updates. We also show that post hoc gradient updates can be used to calibrate predicted quantiles under distribution shift, and that the framework leads to unbiased Elo scores for pairwise preference prediction. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2025. ( edit, beta )

ICLR Conference 2025 Conference Paper

How to Evaluate Reward Models for RLHF

  • Evan Frick
  • Tianle Li
  • Connor Chen
  • Wei-Lin Chiang
  • Anastasios N. Angelopoulos
  • Jiantao Jiao
  • Banghua Zhu
  • Joseph E. Gonzalez

We introduce a new benchmark for reward models that quantifies their ability to produce strong language models through RLHF (Reinforcement Learning from Human Feedback). The gold-standard approach is to run a full RLHF training pipeline and directly probe downstream LLM performance. However, this process is prohibitively expensive. To address this, we build a predictive model of downstream LLM performance by evaluating the reward model on proxy tasks. These proxy tasks consist of a large-scale human preference and a verifiable correctness preference dataset, in which we measure 12 metrics across 12 domains. To investigate which reward model metrics are most correlated to gold-standard RLHF outcomes, we launch an end-to-end RLHF experiment on a large-scale crowd-sourced human preference platform to view real reward model downstream performance as ground truth. Ultimately, we compile our data and findings into Preference Proxy Evaluations (PPE), the first reward model benchmark explicitly linked to post-RLHF real-world human preference performance, which we opensource for public use and further development at https://github.com/lmarena/PPE.

ICML Conference 2025 Conference Paper

Prompt-to-Leaderboard: Prompt-Adaptive LLM Evaluations

  • Evan Frick
  • Connor Chen
  • Joseph Tennyson
  • Tianle Li
  • Wei-Lin Chiang
  • Anastasios N. Angelopoulos
  • Ion Stoica

Large language model (LLM) evaluations typically rely on aggregated metrics like accuracy or human preference, averaging across users and prompts. This averaging obscures user- and prompt-specific variations in model performance. To address this, we propose Prompt-to-Leaderboard (P2L), a method that produces leaderboards specific to a prompt or set of prompts. The core idea is to train an LLM taking natural language prompts as input to output a vector of Bradley-Terry coefficients which are then used to predict the human preference vote. The resulting prompt-dependent leaderboards allow for unsupervised task-specific evaluation, optimal routing of queries to models, personalization, and automated evaluation of model strengths and weaknesses. Data from Chatbot Arena suggest that P2L better captures the nuanced landscape of language model performance than the averaged leaderboard. Furthermore, our findings suggest that P2L’s ability to produce prompt-specific evaluations follows a power law scaling similar to that observed in LLMs themselves. In January 2025, the router we trained based on this methodology achieved the #1 spot on the Chatbot Arena leaderboard. Our code is available at this GitHub link: https: //github. com/lmarena/p2l.

ICML Conference 2024 Conference Paper

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

  • Wei-Lin Chiang
  • Lianmin Zheng
  • Ying Sheng 0007
  • Anastasios N. Angelopoulos
  • Tianle Li
  • Dacheng Li
  • Banghua Zhu
  • Hao Zhang 0025

Large Language Models (LLMs) have unlocked new capabilities and applications; however, evaluating the alignment with human preferences still poses significant challenges. To address this issue, we introduce Chatbot Arena, an open platform for evaluating LLMs based on human preferences. Our methodology employs a pairwise comparison approach and leverages input from a diverse user base through crowdsourcing. The platform has been operational for several months, amassing over 240K votes. This paper describes the platform, analyzes the data we have collected so far, and explains the tried-and-true statistical methods we are using for efficient and accurate evaluation and ranking of models. We confirm that the crowdsourced questions are sufficiently diverse and discriminating and that the crowd-sourced human votes are in good agreement with those of expert raters. These analyses collectively establish a robust foundation for the credibility of Chatbot Arena. Because of its unique value and openness, Chatbot Arena has emerged as one of the most referenced LLM leaderboards, widely cited by leading LLM developers and companies. The platform is publicly available at https: //chat. lmsys. org.

ICRA Conference 2024 Conference Paper

Conformal Decision Theory: Safe Autonomous Decisions from Imperfect Predictions

  • Jordan Lekeufack
  • Anastasios N. Angelopoulos
  • Andrea Bajcsy
  • Michael I. Jordan
  • Jitendra Malik

We introduce Conformal Decision Theory, a framework for producing safe autonomous decisions despite imperfect machine learning predictions. Examples of such decisions are ubiquitous, from robot planning algorithms that rely on pedestrian predictions, to calibrating autonomous manufacturing to exhibit high throughput and low error, to the choice of trusting a nominal policy versus switching to a safe backup policy at run-time. The decisions produced by our algorithms are safe in the sense that they come with provable statistical guarantees of having low risk without any assumptions on the world model whatsoever; the observations need not be I. I. D. and can even be adversarial. The theory extends results from conformal prediction to calibrate decisions directly, without requiring the construction of prediction sets. Experiments demonstrate the utility of our approach in robot motion planning around humans and robot manufacturing.

ICRA Conference 2024 Conference Paper

Conformal Policy Learning for Sensorimotor Control under Distribution Shifts

  • Huang Huang
  • Satvik Sharma
  • Antonio Loquercio
  • Anastasios N. Angelopoulos
  • Ken Goldberg
  • Jitendra Malik

This paper focuses on the problem of detecting and reacting to changes in the distribution of a sensorimotor controller’s observables. The key idea is the design of policies that can take conformal quantiles as input, to detect distribution shifts with formal statistical guarantees, which we define as conformal policy learning. We show how to design such policies by using conformal quantiles to switch between base policies with different characteristics, e. g. safety or speed, or directly augmenting a policy observation with a quantile and training it with reinforcement learning. Theoretically, we show that such policies achieve the formal convergence guarantees in finite time. In addition, we thoroughly evaluate their advantages and limitations on two use cases: simulated autonomous driving and active perception with a physical quadruped. Empirical results demonstrate that our approach outperforms five baselines. It is also the simplest of the baseline strategies besides one ablation. Being easy to use, flexible, and with formal guarantees, our work demonstrates how conformal prediction can be an effective tool for sensorimotor learning under uncertainty.

ICLR Conference 2024 Conference Paper

Conformal Risk Control

  • Anastasios N. Angelopoulos
  • Stephen Bates
  • Adam Fisch
  • Lihua Lei
  • Tal Schuster

We extend conformal prediction to control the expected value of any monotone loss function. The algorithm generalizes split conformal prediction together with its coverage guarantee. Like conformal prediction, the conformal risk control procedure is tight up to an $\mathcal{O}(1/n)$ factor. We also introduce extensions of the idea to distribution shift, quantile risk control, multiple and adversarial risk control, and expectations of U-statistics. Worked examples from computer vision and natural language processing demonstrate the usage of our algorithm to bound the false negative rate, graph distance, and token-level F1-score.

JMLR Journal 2024 Journal Article

Label Noise Robustness of Conformal Prediction

  • Bat-Sheva Einbinder
  • Shai Feldman
  • Stephen Bates
  • Anastasios N. Angelopoulos
  • Asaf Gendler
  • Yaniv Romano

We study the robustness of conformal prediction, a powerful tool for uncertainty quantification, to label noise. Our analysis tackles both regression and classification problems, characterizing when and how it is possible to construct uncertainty sets that correctly cover the unobserved noiseless ground truth labels. We further extend our theory and formulate the requirements for correctly controlling a general loss function, such as the false negative proportion, with noisy labels. Our theory and experiments suggest that conformal prediction and risk-controlling techniques with noisy labels attain conservative risk over the clean ground truth labels whenever the noise is dispersive and increases variability. In other adversarial cases, we can also correct for noise of bounded size in the conformal prediction algorithm in order to ensure achieving the correct risk of the ground truth labels without score or data regularity. [abs] [ pdf ][ bib ] &copy JMLR 2024. ( edit, beta )

ICML Conference 2024 Conference Paper

Online conformal prediction with decaying step sizes

  • Anastasios N. Angelopoulos
  • Rina Barber
  • Stephen Bates

We introduce a method for online conformal prediction with decaying step sizes. Like previous methods, ours possesses a retrospective guarantee of coverage for arbitrary sequences. However, unlike previous methods, we can simultaneously estimate a population quantile when it exists. Our theory and experiments indicate substantially improved practical properties: in particular, when the distribution is stable, the coverage is close to the desired level for every time point, not just on average over the observed sequence.

ICML Conference 2022 Conference Paper

Image-to-Image Regression with Distribution-Free Uncertainty Quantification and Applications in Imaging

  • Anastasios N. Angelopoulos
  • Amit Pal Singh Kohli
  • Stephen Bates
  • Michael I. Jordan
  • Jitendra Malik
  • Thayer Alshaabi
  • Srigokul Upadhyayula
  • Yaniv Romano

Image-to-image regression is an important learning task, used frequently in biological imaging. Current algorithms, however, do not generally offer statistical guarantees that protect against a model’s mistakes and hallucinations. To address this, we develop uncertainty quantification techniques with rigorous statistical guarantees for image-to-image regression problems. In particular, we show how to derive uncertainty intervals around each pixel that are guaranteed to contain the true value with a user-specified confidence probability. Our methods work in conjunction with any base machine learning model, such as a neural network, and endow it with formal mathematical guarantees{—}regardless of the true unknown data distribution or choice of model. Furthermore, they are simple to implement and computationally inexpensive. We evaluate our procedure on three image-to-image regression tasks: quantitative phase microscopy, accelerated magnetic resonance imaging, and super-resolution transmission electron microscopy of a Drosophila melanogaster brain.

ICLR Conference 2021 Conference Paper

Uncertainty Sets for Image Classifiers using Conformal Prediction

  • Anastasios N. Angelopoulos
  • Stephen Bates
  • Michael I. Jordan
  • Jitendra Malik

Convolutional image classifiers can achieve high predictive accuracy, but quanti fying their uncertainty remains an unresolved challenge, hindering their deployment in consequential settings. Existing uncertainty quantification techniques, such as Platt scaling, attempt to calibrate the network’s probability estimates, but they do not have formal guarantees. We present an algorithm that modifies any classifier to output a predictive set containing the true label with a user-specified probability, such as 90%. The algorithm is simple and fast like Platt scaling, but provides a formal finite-sample coverage guarantee for every model and dataset. Our method modifies an existing conformal prediction algorithm to give more sta ble predictive sets by regularizing the small scores of unlikely classes after Platt scaling. In experiments on both Imagenet and Imagenet-V2 with ResNet-152 and other classifiers, our scheme outperforms existing approaches, achieving coverage with sets that are often factors of 5 to 10 smaller than a stand-alone Platt scaling baseline.