Arrow Research search

Author name cluster

Flavio Calmon

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

12 papers
1 author row

Possible papers

12

NeurIPS Conference 2025 Conference Paper

HeavyWater and SimplexWater: Distortion-free LLM Watermarks for Low-Entropy Distributions

  • Dor Tsur
  • Carol Long
  • Claudio Mayrink Verdun
  • Sajani Vithana
  • Hsiang Hsu
  • Richard Chen
  • haim permuter
  • Flavio Calmon

Large language model (LLM) watermarks enable authentication of text provenance, curb misuse of machine-generated text, and promote trust in AI systems. Current watermarks operate by changing the next-token predictions output by an LLM. The updated (i. e. , watermarked) predictions depend on random side information produced, for example, by hashing previously generated tokens. LLM watermarking is particularly challenging in low-entropy generation tasks -- such as coding -- where next-token predictions are near-deterministic. In this paper, we propose an optimization framework for watermark design. Our goal is to understand how to most effectively use random side information in order to maximize the likelihood of watermark detection and minimize the distortion of generated text. Our analysis informs the design of two new watermarks: HeavyWater and SimplexWater. Both watermarks are tunable, gracefully trading-off between detection accuracy and text distortion. They can also be applied to any LLM and are agnostic to side information generation. We examine the performance of HeavyWater and SimplexWater through several benchmarks, demonstrating that they can achieve high watermark detection accuracy with minimal compromise of text generation quality, particularly in the low-entropy regime. Our theoretical analysis also reveals surprising new connections between LLM watermarking and coding theory.

NeurIPS Conference 2025 Conference Paper

Inference-Time Reward Hacking in Large Language Models

  • Hadi Khalaf
  • Claudio Mayrink Verdun
  • Alex Oesterling
  • Himabindu Lakkaraju
  • Flavio Calmon

A common paradigm to improve the performance of large language models is optimizing for a reward model. Reward models assign a numerical score to an LLM’s output that indicates, for example, how likely it is to align with user preferences or safety goals. However, reward models are never perfect. They inevitably function as proxies for complex desiderata such as correctness, helpfulness, and safety. By overoptimizing for a misspecified reward, we can subvert intended alignment goals and reduce overall performance -- a phenomenon commonly referred to as reward hacking. In this work, we characterize reward hacking in inference-time alignment and demonstrate when and how we can mitigate it by hedging on the proxy reward. We study this phenomenon under Best-of-$n$ (BoN) and Soft Best-of-$n$ (SBoN), and we introduce Best-of-Poisson (BoP) that provides an efficient, near-exact approximation of the optimal reward-KL divergence policy at inference time. We show that the characteristic pattern of hacking as observed in practice (where the true reward first increases before declining) is an inevitable property of a broad class of inference-time mechanisms, including BoN and BoP. To counter this effect, we introduce $\texttt{HedgeTune}$, an efficient algorithm to find the optimal inference-time parameter. We demonstrate that hedging mitigates reward hacking and achieves superior reward-distortion tradeoffs on math, reasoning, and human-preference setups.

NeurIPS Conference 2025 Conference Paper

Rigor in AI: Doing Rigorous AI Work Requires a Broader, Responsible AI-Informed Conception of Rigor

  • Alexandra Olteanu
  • Su Lin Blodgett
  • Agathe Balayn
  • Angelina Wang
  • Fernando Diaz
  • Flavio Calmon
  • Margaret Mitchell
  • Michael Ekstrand

In AI research and practice, rigor remains largely understood in terms of methodological rigor---such as whether mathematical, statistical, or computational methods are correctly applied. We argue that this narrow conception of rigor has contributed to the concerns raised by the responsible AI community, including overblown claims about the capabilities of AI systems. Our position is that a broader conception of what rigorous AI research and practice should entail is needed. We believe such a conception---in addition to a more expansive understanding of 1) methodological rigor---should include aspects related to 2) what background knowledge informs what to work on (epistemic rigor); 3) how disciplinary, community, or personal norms, standards, or beliefs influence the work (normative rigor); 4) how clearly articulated the theoretical constructs under use are (conceptual rigor); 5) what is reported and how (reporting rigor); and 6) how well-supported the inferences from existing evidence are (interpretative rigor). In doing so, we also provide useful language and a framework for much needed dialogue about the AI community's work by researchers, policymakers, journalists, and other stakeholders.

NeurIPS Conference 2025 Conference Paper

Unifying Re-Identification, Attribute Inference, and Data Reconstruction Risks in Differential Privacy

  • Bogdan Kulynych
  • Juan Gomez
  • Georgios Kaissis
  • Jamie Hayes
  • Borja Balle
  • Flavio Calmon
  • Jean Raisaro

Differentially private (DP) mechanisms are difficult to interpret and calibrate because existing methods for mapping standard privacy parameters to concrete privacy risks—re-identification, attribute inference, and data reconstruction—are both overly pessimistic and inconsistent. In this work, we use the hypothesis-testing interpretation of DP ($f$-DP), and determine that bounds on attack success can take the same unified form across re-identification, attribute inference, and data reconstruction risks. Our unified bounds are (1) consistent across a multitude of attack settings, and (2) tunable, enabling practitioners to evaluate risk with respect to arbitrary, including worst-case, levels of baseline risk. Empirically, our results are tighter than prior methods using $\varepsilon$-DP, R\'enyi DP, and concentrated DP. As a result, calibrating noise using our bounds can reduce the required noise by 20% at the same risk level, which yields, e. g. , an accuracy increase from 52% to 70% in a text classification task. Overall, this unifying perspective provides a principled framework for interpreting and calibrating the degree of protection in DP against specific levels of re-identification, attribute inference, or data reconstruction risk.

NeurIPS Conference 2023 Conference Paper

Adapting Fairness Interventions to Missing Values

  • Raymond Feng
  • Flavio Calmon
  • Hao Wang

Missing values in real-world data pose a significant and unique challenge to algorithmic fairness. Different demographic groups may be unequally affected by missing data, and the standard procedure for handling missing values where first data is imputed, then the imputed data is used for classification—a procedure referred to as "impute-then-classify"—can exacerbate discrimination. In this paper, we analyze how missing values affect algorithmic fairness. We first prove that training a classifier from imputed data can significantly worsen the achievable values of group fairness and average accuracy. This is because imputing data results in the loss of the missing pattern of the data, which often conveys information about the predictive label. We present scalable and adaptive algorithms for fair classification with missing values. These algorithms can be combined with any preexisting fairness-intervention algorithm to handle all possible missing patterns while preserving information encoded within the missing patterns. Numerical experiments with state-of-the-art fairness interventions demonstrate that our adaptive algorithms consistently achieve higher fairness and accuracy than impute-then-classify across different datasets.

NeurIPS Conference 2023 Conference Paper

Aleatoric and Epistemic Discrimination: Fundamental Limits of Fairness Interventions

  • Hao Wang
  • Luxi He
  • Rui Gao
  • Flavio Calmon

Machine learning (ML) models can underperform on certain population groups due to choices made during model development and bias inherent in the data. We categorize sources of discrimination in the ML pipeline into two classes: aleatoric discrimination, which is inherent in the data distribution, and epistemic discrimination, which is due to decisions made during model development. We quantify aleatoric discrimination by determining the performance limits of a model under fairness constraints, assuming perfect knowledge of the data distribution. We demonstrate how to characterize aleatoric discrimination by applying Blackwell's results on comparing statistical experiments. We then quantify epistemic discrimination as the gap between a model's accuracy when fairness constraints are applied and the limit posed by aleatoric discrimination. We apply this approach to benchmark existing fairness interventions and investigate fairness risks in data with missing values. Our results indicate that state-of-the-art fairness interventions are effective at removing epistemic discrimination on standard (overused) tabular datasets. However, when data has missing values, there is still significant room for improvement in handling aleatoric discrimination.

NeurIPS Conference 2023 Conference Paper

Individual Arbitrariness and Group Fairness

  • Carol Long
  • Hsiang Hsu
  • Wael Alghamdi
  • Flavio Calmon

Machine learning tasks may admit multiple competing models that achieve similar performance yet produce conflicting outputs for individual samples---a phenomenon known as predictive multiplicity. We demonstrate that fairness interventions in machine learning optimized solely for group fairness and accuracy can exacerbate predictive multiplicity. Consequently, state-of-the-art fairness interventions can mask high predictive multiplicity behind favorable group fairness and accuracy metrics. We argue that a third axis of ``arbitrariness'' should be considered when deploying models to aid decision-making in applications of individual-level impact. To address this challenge, we propose an ensemble algorithm applicable to any fairness intervention that provably ensures more consistent predictions.

NeurIPS Conference 2022 Conference Paper

Beyond Adult and COMPAS: Fair Multi-Class Prediction via Information Projection

  • Wael Alghamdi
  • Hsiang Hsu
  • Haewon Jeong
  • Hao Wang
  • Peter Michalak
  • Shahab Asoodeh
  • Flavio Calmon

We consider the problem of producing fair probabilistic classifiers for multi-class classification tasks. We formulate this problem in terms of ``projecting'' a pre-trained (and potentially unfair) classifier onto the set of models that satisfy target group-fairness requirements. The new, projected model is given by post-processing the outputs of the pre-trained classifier by a multiplicative factor. We provide a parallelizable, iterative algorithm for computing the projected classifier and derive both sample complexity and convergence guarantees. Comprehensive numerical comparisons with state-of-the-art benchmarks demonstrate that our approach maintains competitive performance in terms of accuracy-fairness trade-off curves, while achieving favorable runtime on large datasets. We also evaluate our method at scale on an open dataset with multiple classes, multiple intersectional groups, and over 1M samples.

NeurIPS Conference 2022 Conference Paper

On the Epistemic Limits of Personalized Prediction

  • Lucas Monteiro Paes
  • Carol Long
  • Berk Ustun
  • Flavio Calmon

Machine learning models are often personalized by using group attributes that encode personal characteristics (e. g. , sex, age group, HIV status). In such settings, individuals expect to receive more accurate predictions in return for disclosing group attributes to the personalized model. We study when we can tell that a personalized model upholds this principle for every group who provides personal data. We introduce a metric called the benefit of personalization (BoP) to measure the smallest gain in accuracy that any group expects to receive from a personalized model. We describe how the BoP can be used to carry out basic routines to audit a personalized model, including: (i) hypothesis tests to check that a personalized model improves performance for every group; (ii) estimation procedures to bound the minimum gain in personalization. We characterize the reliability of these routines in a finite-sample regime and present minimax bounds on both the probability of error for BoP hypothesis tests and the mean-squared error of BoP estimates. Our results show that we can only claim that personalization improves performance for each group who provides data when we explicitly limit the number of group attributes used by a personalized model. In particular, we show that it is impossible to reliably verify that a personalized classifier with $k \geq 19$ binary group attributes will benefit every group who provides personal data using a dataset of $n = 8\times10^9$ samples -- one for each person in the world.

NeurIPS Conference 2022 Conference Paper

Rashomon Capacity: A Metric for Predictive Multiplicity in Classification

  • Hsiang Hsu
  • Flavio Calmon

Predictive multiplicity occurs when classification models with statistically indistinguishable performances assign conflicting predictions to individual samples. When used for decision-making in applications of consequence (e. g. , lending, education, criminal justice), models developed without regard for predictive multiplicity may result in unjustified and arbitrary decisions for specific individuals. We introduce a new metric, called Rashomon Capacity, to measure predictive multiplicity in probabilistic classification. Prior metrics for predictive multiplicity focus on classifiers that output thresholded (i. e. , 0-1) predicted classes. In contrast, Rashomon Capacity applies to probabilistic classifiers, capturing more nuanced score variations for individual samples. We provide a rigorous derivation for Rashomon Capacity, argue its intuitive appeal, and demonstrate how to estimate it in practice. We show that Rashomon Capacity yields principled strategies for disclosing conflicting models to stakeholders. Our numerical experiments illustrate how Rashomon Capacity captures predictive multiplicity in various datasets and learning models, including neural networks. The tools introduced in this paper can help data scientists measure and report predictive multiplicity prior to model deployment.

NeurIPS Conference 2021 Conference Paper

Analyzing the Generalization Capability of SGLD Using Properties of Gaussian Channels

  • Hao Wang
  • Yizhe Huang
  • Rui Gao
  • Flavio Calmon

Optimization is a key component for training machine learning models and has a strong impact on their generalization. In this paper, we consider a particular optimization method---the stochastic gradient Langevin dynamics (SGLD) algorithm---and investigate the generalization of models trained by SGLD. We derive a new generalization bound by connecting SGLD with Gaussian channels found in information and communication theory. Our bound can be computed from the training data and incorporates the variance of gradients for quantifying a particular kind of "sharpness" of the loss landscape. We also consider a closely related algorithm with SGLD, namely differentially private SGD (DP-SGD). We prove that the generalization capability of DP-SGD can be amplified by iteration. Specifically, our bound can be sharpened by including a time-decaying factor if the DP-SGD algorithm outputs the last iterate while keeping other iterates hidden. This decay factor enables the contribution of early iterations to our bound to reduce with time and is established by strong data processing inequalities---a fundamental tool in information theory. We demonstrate our bound through numerical experiments, showing that it can predict the behavior of the true generalization gap.

NeurIPS Conference 2017 Conference Paper

Optimized Pre-Processing for Discrimination Prevention

  • Flavio Calmon
  • Dennis Wei
  • Bhanukiran Vinzamuri
  • Karthikeyan Natesan Ramamurthy
  • Kush Varshney

Non-discrimination is a recognized objective in algorithmic decision making. In this paper, we introduce a novel probabilistic formulation of data pre-processing for reducing discrimination. We propose a convex optimization for learning a data transformation with three goals: controlling discrimination, limiting distortion in individual data samples, and preserving utility. We characterize the impact of limited sample size in accomplishing this objective. Two instances of the proposed optimization are applied to datasets, including one on real-world criminal recidivism. Results show that discrimination can be greatly reduced at a small cost in classification accuracy.