NeurIPS Conference 2025 Conference Paper
On Optimal Steering to Achieve Exact Fairness
- Mohit Sharma
- Amit Deshpande
- Chiranjib Bhattacharyya
- Rajiv Ratn Shah
To fix the `bias in, bias out' problem in fair machine learning, it is important to steer feature distributions of data or internal representations of Large Language Models (LLMs) to \emph{ideal} ones that guarantee group-fair outcomes. Previous work on fair generative models and representation steering could greatly benefit from provable fairness guarantees on the model output. We define a distribution as \emph{ideal} if the minimizer of any cost-sensitive risk on it is guaranteed to have exact group-fair outcomes (e. g. , demographic parity, equal opportunity)---in other words, it has no fairness-utility trade-off. We formulate an optimization program for optimal steering by finding the nearest \emph{ideal} distribution in KL-divergence, and provide efficient algorithms for it when the underlying distributions come from well-known parametric families (e. g. , normal, log-normal). Empirically, our optimal steering techniques on both synthetic and real-world datasets improve fairness without diminishing utility (and sometimes even improve utility). We demonstrate affine steering of LLM representations to reduce bias in multi-class classification, e. g. , occupation prediction from a short biography in Bios dataset (De-Arteaga et al. ). Furthermore, we steer internal representations of LLMs towards desired outputs so that it works equally well across different groups.