Author name cluster

Dhruv Rohatgi

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

17 papers

2 author rows

ICLR Conference 2025 Conference Paper

Self-Improvement in Language Models: The Sharpening Mechanism

Audrey Huang
Adam Block
Dylan J. Foster
Dhruv Rohatgi
Cyril Zhang
Max Simchowitz
Jordan T. Ash
Akshay Krishnamurthy

Recent work in language modeling has raised the possibility of “self-improvement,” where an LLM evaluates and refines its own generations to achieve higher performance without external feedback. It is impossible for this self-improvement to create information that is not already in the model, so why should we expect that this will lead to improved capabilities? We offer a new theoretical perspective on the capabilities of self-improvement through a lens we refer to as “sharpening.” Motivated by the observation that language models are often better at verifying response quality than they are at generating correct responses, we formalize self-improvement as using the model itself as a verifier during post-training in order to ‘sharpen’ the model to one placing large mass on high-quality sequences, thereby amortizing the expensive inference-time computation of generating good sequences. We begin by introducing a new statistical framework for sharpening in which the learner has sample access to a pre-trained base policy. Then, we analyze two natural families of self improvement algorithms based on SFT and RLHF. We find that (i) the SFT-based approach is minimax optimal whenever the initial model has sufficient coverage, but (ii) the RLHF-based approach can improve over SFT-based self- improvement by leveraging online exploration, bypassing the need for coverage. We view these findings as a starting point toward a foundational understanding that can guide the design and evaluation of self-improvement algorithms.

Details

NeurIPS Conference 2025 Conference Paper

To Distill or Decide? Understanding the Algorithmic Trade-off in Partially Observable RL

Yuda Song
Dhruv Rohatgi
Aarti Singh
J. Bagnell

Partial observability is a notorious challenge in reinforcement learning (RL), due to the need to learn complex, history-dependent policies. Recent empirical successes have used privileged expert distillation -- which leverages availability of latent state information during training (e. g. , from a simulator) to learn and imitate the optimal latent, Markovian policy -- to disentangle the task of ''learning to see'' from ''learning to act''. While expert distillation is more computationally efficient than RL without latent state information, it also has well-documented failure modes. In this paper -- through a simple but instructive theoretical model called the perturbed Block MDP, and controlled experiments on challenging simulated locomotion tasks -- we investigate the algorithmic trade-off between privileged expert distillation and standard RL without privileged information. Our main findings are: (1) The trade-off empirically hinges on the stochasticity of the latent dynamics, as theoretically predicted by contrasting approximate decodability with belief contraction in the perturbed Block MDP; and (2) The optimal latent policy is not always the best latent policy to distill. Our results suggest new guidelines for effectively exploiting privileged information, potentially advancing the efficiency of policy learning across many practical partially observable domains.

PDF Details

ICML Conference 2025 Conference Paper

Towards characterizing the value of edge embeddings in Graph Neural Networks

Dhruv Rohatgi
Tanya Marwah
Zachary C. Lipton
Jianfeng Lu 0001
Ankur Moitra
Andrej Risteski

Graph neural networks (GNNs) are the dominant approach to solving machine learning problems defined over graphs. Despite much theoretical and empirical work in recent years, our understanding of finer-grained aspects of architectural design for GNNs remains impoverished. In this paper, we consider the benefits of architectures that maintain and update edge embeddings. On the theoretical front, under a suitable computational abstraction for a layer in the model, as well as memory constraints on the embeddings, we show that there are natural tasks on graphical models for which architectures leveraging edge embeddings can be much shallower. Our techniques are inspired by results on time-space tradeoffs in theoretical computer science. Empirically, we show architectures that maintain edge embeddings almost always improve on their node-based counterparts—frequently significantly so in topologies that have "hub" nodes.

Details

FOCS Conference 2024 Conference Paper

Exploration is Harder than Prediction: Cryptographically Separating Reinforcement Learning from Supervised Learning

Noah Golowich
Ankur Moitra
Dhruv Rohatgi

Supervised learning is often computationally easy in practice. But to what extent does this mean that other modes of learning, such as reinforcement learning (RL), ought to be computationally easy by extension? In this work we show the first cryptographic separation between RL and supervised learning, by exhibiting a class of block MDPs and associated decoding functions where reward-free exploration is provably computationally harder than the associated regression problem. We also show that there is no computationally efficient algorithm for reward-directed RL in block MDPs, even when given access to an oracle for this regression problem. It is known that being able to perform regression in block MDPs is necessary for finding a good policy; our results suggest that it is not sufficient. Our separation lower bound uses a new robustness property of the Learning Parities with Noise (LPN) hardness assumption, which is crucial in handling the dependent nature of RL data. We argue that separations and oracle lower bounds, such as ours, are a more meaningful way to prove hardness of learning because the constructions better reflect the practical reality that supervised learning by itself is often not the computational bottleneck.

Details

STOC Conference 2024 Conference Paper

Exploring and Learning in Sparse Linear MDPs without Computationally Intractable Oracles

Noah Golowich
Ankur Moitra
Dhruv Rohatgi

Details

NeurIPS Conference 2024 Conference Paper

Online Control in Population Dynamics

Noah Golowich
Elad Hazan
Zhou Lu
Dhruv Rohatgi
Y. Jennifer Sun

The study of population dynamics originated with early sociological works but has since extended into many fields, including biology, epidemiology, evolutionary game theory, and economics. Most studies on population dynamics focus on the problem of prediction rather than control. Existing mathematical models for population control are often restricted to specific, noise-free dynamics, while real-world population changes can be complex and adversarial. To address this gap, we propose a new framework based on the paradigm of online control. We first characterize a set of linear dynamical systems that can naturally model evolving populations. We then give an efficient gradient-based controller for these systems, with near-optimal regret bounds with respect to a broad class of linear policies. Our empirical evaluations demonstrate the effectiveness of the proposed algorithm for population control even in non-linear models such as SIR and replicator dynamics.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

Feature Adaptation for Sparse Linear Regression

Jonathan Kelner
Frederic Koehler
Raghu Meka
Dhruv Rohatgi

Sparse linear regression is a central problem in high-dimensional statistics. We study the correlated random design setting, where the covariates are drawn from a multivariate Gaussian $N(0, \Sigma)$, and we seek an estimator with small excess risk. If the true signal is $t$-sparse, information-theoretically, it is possible to achieve strong recovery guarantees with only $O(t\log n)$ samples. However, computationally efficient algorithms have sample complexity linear in (some variant of) the *condition number* of $\Sigma$. Classical algorithms such as the Lasso can require significantly more samples than necessary even if there is only a single sparse approximate dependency among the covariates. We provide a polynomial-time algorithm that, given $\Sigma$, automatically adapts the Lasso to tolerate a small number of approximate dependencies. In particular, we achieve near-optimal sample complexity for constant sparsity and if $\Sigma$ has few ``outlier'' eigenvalues. Our algorithm fits into a broader framework of *feature adaptation* for sparse linear regression with ill-conditioned covariates. With this framework, we additionally provide the first polynomial-factor improvement over brute-force search for constant sparsity $t$ and arbitrary covariance $\Sigma$.

PDF Details

STOC Conference 2023 Conference Paper

Planning and Learning in Partially Observable Systems via Filter Stability

Noah Golowich
Ankur Moitra
Dhruv Rohatgi

Details

NeurIPS Conference 2023 Conference Paper

Provable benefits of score matching

Chirag Pabbaraju
Dhruv Rohatgi
Anish Prasad Sevekari
Holden Lee
Ankur Moitra
Andrej Risteski

Score matching is an alternative to maximum likelihood (ML) for estimating a probability distribution parametrized up to a constant of proportionality. By fitting the ''score'' of the distribution, it sidesteps the need to compute this constant of proportionality (which is often intractable). While score matching and variants thereof are popular in practice, precise theoretical understanding of the benefits and tradeoffs with maximum likelihood---both computational and statistical---are not well understood. In this work, we give the first example of a natural exponential family of distributions such that the score matching loss is computationally efficient to optimize, and has a comparable statistical efficiency to ML, while the ML loss is intractable to optimize using a gradient-based method. The family consists of exponentials of polynomials of fixed degree, and our result can be viewed as a continuous analogue of recent developments in the discrete setting. Precisely, we show: (1) Designing a zeroth-order or first-order oracle for optimizing the maximum likelihood loss is NP-hard. (2) Maximum likelihood has a statistical efficiency polynomial in the ambient dimension and the radius of the parameters of the family. (3) Minimizing the score matching loss is both computationally and statistically efficient, with complexity polynomial in the ambient dimension.

PDF Details

ICLR Conference 2023 Conference Paper

Provably Auditing Ordinary Least Squares in Low Dimensions

Ankur Moitra
Dhruv Rohatgi

Auditing the stability of a machine learning model to small changes in the training procedure is critical for engendering trust in practical applications. For example, a model should not be overly sensitive to removing a small fraction of its training data. However, algorithmically validating this property seems computationally challenging, even for the simplest of models: Ordinary Least Squares (OLS) linear regression. Concretely, recent work defines the stability of a regression as the minimum number of samples that need to be removed so that rerunning the analysis overturns the conclusion (Broderick et al., 2020), specifically meaning that the sign of a particular coefficient of the OLS regressor changes. But the only known approach for estimating this metric, besides the obvious exponential-time algorithm, is a greedy heuristic that may produce severe overestimates and therefore cannot certify stability. We show that stability can be efficiently certified in the low-dimensional regime: when the number of covariates is a constant but the number of samples is large, there are polynomial-time algorithms for estimating (a fractional version of) stability, with provable approximation guarantees. Applying our algorithms to the Boston Housing dataset, we exhibit regression analyses where our estimator outperforms the greedy heuristic, and can successfully certify stability even in the regime where a constant fraction of the samples are dropped.

Details

NeurIPS Conference 2022 Conference Paper

Learning in Observable POMDPs, without Computationally Intractable Oracles

Noah Golowich
Ankur Moitra
Dhruv Rohatgi

Much of reinforcement learning theory is built on top of oracles that are computationally hard to implement. Specifically for learning near-optimal policies in Partially Observable Markov Decision Processes (POMDPs), existing algorithms either need to make strong assumptions about the model dynamics (e. g. deterministic transitions) or assume access to an oracle for solving a hard optimistic planning or estimation problem as a subroutine. In this work we develop the first oracle-free learning algorithm for POMDPs under reasonable assumptions. Specifically, we give a quasipolynomial-time end-to-end algorithm for learning in ``observable'' POMDPs, where observability is the assumption that well-separated distributions over states induce well-separated distributions over observations. Our techniques circumvent the more traditional approach of using the principle of optimism under uncertainty to promote exploration, and instead give a novel application of barycentric spanners to constructing policy covers.

PDF Details

NeurIPS Conference 2022 Conference Paper

Lower Bounds on Randomly Preconditioned Lasso via Robust Sparse Designs

Jonathan Kelner
Frederic Koehler
Raghu Meka
Dhruv Rohatgi

Sparse linear regression with ill-conditioned Gaussian random covariates is widely believed to exhibit a statistical/computational gap, but there is surprisingly little formal evidence for this belief. Recent work has shown that, for certain covariance matrices, the broad class of Preconditioned Lasso programs provably cannot succeed on polylogarithmically sparse signals with a sublinear number of samples. However, this lower bound only holds against deterministic preconditioners, and in many contexts randomization is crucial to the success of preconditioners. We prove a stronger lower bound that rules out randomized preconditioners. For an appropriate covariance matrix, we construct a single signal distribution on which any invertibly-preconditioned Lasso program fails with high probability, unless it receives a linear number of samples. Surprisingly, at the heart of our lower bound is a new robustness result in compressed sensing. In particular, we study recovering a sparse signal when a few measurements can be erased adversarially. To our knowledge, this natural question has not been studied before for sparse measurements. We surprisingly show that standard sparse Bernoulli measurements are almost-optimally robust to adversarial erasures: if $b$ measurements are erased, then all but $O(b)$ of the coordinates of the signal are identifiable.

PDF Details

NeurIPS Conference 2022 Conference Paper

Robust Generalized Method of Moments: A Finite Sample Viewpoint

Dhruv Rohatgi
Vasilis Syrgkanis

For many inference problems in statistics and econometrics, the unknown parameter is identified by a set of moment conditions. A generic method of solving moment conditions is the Generalized Method of Moments (GMM). However, classical GMM estimation is potentially very sensitive to outliers. Robustified GMM estimators have been developed in the past, but suffer from several drawbacks: computational intractability, poor dimension-dependence, and no quantitative recovery guarantees in the presence of a constant fraction of outliers. In this work, we develop the first computationally efficient GMM estimator (under intuitive assumptions) that can tolerate a constant $\epsilon$ fraction of adversarially corrupted samples, and that has an $\ell_2$ recovery guarantee of $O(\sqrt{\epsilon})$. To achieve this, we draw upon and extend a recent line of work on algorithmic robust statistics for related but simpler problems such as mean estimation, linear regression and stochastic optimization. As a special case, we apply our algorithm to instrumental variables linear regression with heterogeneous treatment effects, and experimentally demonstrate that it can tolerate as much as $10$ -- $15\%$ corruption, significantly improving upon baseline methods.

PDF Details

FOCS Conference 2021 Conference Paper

On the Power of Preconditioning in Sparse Linear Regression

Jonathan A. Kelner
Frederic Koehler
Raghu Meka
Dhruv Rohatgi

Sparse linear regression is a fundamental problem in high-dimensional statistics, but strikingly little is known about how to efficiently solve it without restrictive conditions on the design matrix. We consider the (correlated) random design setting, where the covariates are independently drawn from a multivariate Gaussian $N(0, \ \Sigma)$, for some $n\times n$ positive semi-definite matrix $\Sigma$, and seek estimators $\hat{w}$ minimizing $(\hat{w}-w^{\ast})^{T}\Sigma(\hat{w}-w^{\ast})$, where $w^{\ast}$ is the k-sparse ground truth. Information theoretically, one can achieve strong error bounds with only $O(k\log n)$ samples for arbitrary $\Sigma$ and $w^{\ast}$; however, no efficient algorithms are known to match these guarantees even with $o(n)$ samples, without further assumptions on $\Sigma$ or $w^{\ast}$. Yet there is little evidence for this gap in the random design setting: computational lower bounds are only known for worst-case design matrices. To date, random-design instances (i. e. specific covariance matrices $\Sigma$ ) have only been proven hard against the Lasso program and variants. More precisely, these “hard” instances can often be solved by Lasso after a simple change-of-basis (i. e. preconditioning). In this work, we give both upper and lower bounds clarifying the power of preconditioning as a tool for solving sparse linear regression problems. On the one hand, we show that the preconditioned Lasso can solve a large class of sparse linear regression problems nearly optimally: it succeeds whenever the dependency structure of the covariates, in the sense of the Markov property, has low treewidth - even if $\Sigma$ is highly ill-conditioned. This upper bound builds on ideas from the wavelet and signal processing literature. As a special case of this result, we give an algorithm for sparse linear regression with covariates from an autoregressive time series model, where we also show that the (usual) Lasso provably fails. On the other hand, we construct (for the first time) random-design instances which are provably hard even for an optimally preconditioned Lasso. In fact, we complete our treewidth classification by proving that for any treewidth-t graph, there exists a Gaussian Markov Random Field on this graph such that the preconditioned Lasso, with any choice of preconditioner, requires $\Omega(t^{1/20})$ samples to recover $O(\log n)$ -sparse signals when covariates are drawn from this model.

Details

NeurIPS Conference 2020 Conference Paper

Constant-Expansion Suffices for Compressed Sensing with Generative Priors

Constantinos Daskalakis
Dhruv Rohatgi
Emmanouil Zampetakis

Generative neural networks have been empirically found very promising in providing effective structural priors for compressed sensing, since they can be trained to span low-dimensional data manifolds in high-dimensional signal spaces. Despite the non-convexity of the resulting optimization problem, it has also been shown theoretically that, for neural networks with random Gaussian weights, a signal in the range of the network can be efficiently, approximately recovered from a few noisy measurements. However, a major bottleneck of these theoretical guarantees is a network \emph{expansivity} condition: that each layer of the neural network must be larger than the previous by a logarithmic factor. Our main contribution is to break this strong expansivity assumption, showing that \emph{constant} expansivity suffices to get efficient recovery algorithms, besides it also being information-theoretically necessary. To overcome the theoretical bottleneck in existing approaches we prove a novel uniform concentration theorem for random functions that might not be Lipschitz but satisfy a relaxed notion which we call ``pseudo-Lipschitzness. '' Using this theorem we can show that a matrix concentration inequality known as the \emph{Weight Distribution Condition (WDC)}, which was previously only known to hold for Gaussian matrices with logarithmic aspect ratio, in fact holds for constant aspect ratios too. Since WDC is a fundamental matrix concentration inequality in the heart of all existing theoretical guarantees on this problem, our tighter bound immediately yields improvements in all known results in the literature on compressed sensing with deep generative priors, including one-bit recovery, phase retrieval, and more.

PDF Details

SODA Conference 2020 Conference Paper

Near-Optimal Bounds for Online Caching with Machine Learned Advice

Dhruv Rohatgi

In the model of online caching with machine learned advice, introduced by Lykouris and Vassilvitskii, the goal is to solve the caching problem with an online algorithm that has access to next-arrival predictions: when each input element arrives, the algorithm is given a prediction of the next time when the element will reappear. The traditional model for online caching suffers from an Ω(log k ) competitive ratio lower bound (on a cache of size k ). In contrast, the augmented model admits algorithms which beat this lower bound when the predictions have low error, and asymptotically match the lower bound when the predictions have high error, even if the algorithms are oblivious to the prediction error. In particular, Lykouris and Vassilvitskii showed that there is a prediction-augmented caching algorithm with a competitive ratio of when the overall ℓ 1 prediction error is bounded by η, and OPT is the cost of the optimal offline algorithm. The dependence on k in the competitive ratio is optimal, but the dependence on η / opt may be far from optimal. In this work, we make progress towards closing this gap. Our contributions are twofold. First, we provide an improved algorithm with a competitive ratio of O (1 + min(( η / opt )/ k, 1) log k ). Second, we provide a lower bound of Ω(log min(( η / opt )/( k log k ), k )).

Details

NeurIPS Conference 2020 Conference Paper

Truncated Linear Regression in High Dimensions

Constantinos Daskalakis
Dhruv Rohatgi
Emmanouil Zampetakis

As in standard linear regression, in truncated linear regression, we are given access to observations (A i, y i) i whose dependent variable equals y i= A i^{\rm T} \cdot x^* + \eta i, where x^* is some fixed unknown vector of interest and \eta i is independent noise; except we are only given an observation if its dependent variable y i lies in some "truncation set" S \subset \mathbb{R}. The goal is to recover x^* under some favorable conditions on the A i's and the noise distribution. We prove that there exists a computationally and statistically efficient method for recovering k-sparse n-dimensional vectors x^* from m truncated samples, which attains an optimal \ell 2 reconstruction error of O(\sqrt{(k \log n)/m}). As a corollary, our guarantees imply a computationally efficient and information-theoretically optimal algorithm for compressed sensing with truncation, such as that which may arise from measurement saturation effects. Our result follows from a statistical and computational analysis of the Stochastic Gradient Descent (SGD) algorithm for solving a natural adaption of the LASSO optimization problem that accommodates truncation. This generalizes the works of both: (1) [Daskalakis et al. 2018], where no regularization is needed due to the low dimensionality of the data, and (2) [Wainright 2009], where the objective function is simple due to the absence of truncation. In order to deal with both truncation and high-dimensionality at the same time, we develop new techniques that not only generalize the existing ones but we believe are of independent interest.

PDF Details