Author name cluster

Adel Javanmard

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

21 papers

2 author rows

ICML Conference 2025 Conference Paper

DeepCrossAttention: Supercharging Transformer Residual Connections

Mike Heddes
Adel Javanmard
Kyriakos Axiotis
Gang Fu
MohammadHossein Bateni
Vahab Mirrokni

Transformer networks have achieved remarkable success across diverse domains, leveraging a variety of architectural innovations, including residual connections. However, traditional residual connections, which simply sum the outputs of previous layers, can dilute crucial information. This work introduces DeepCrossAttention (DCA), an approach that enhances residual learning in transformers. DCA employs learnable, input-dependent weights to dynamically combine layer outputs, enabling the model to selectively focus on the most relevant information in any of the previous layers. Furthermore, DCA incorporates depth-wise cross-attention, allowing for richer interactions between layers at different depths. Our language modeling experiments show that DCA achieves improved perplexity for a given training time. Moreover, DCA obtains the same model quality up to 3x faster while adding a negligible number of parameters (e. g. , 0. 2%). Theoretical analysis confirms that DCA provides an improved trade-off between accuracy and model size when the ratio of collective layer ranks to the ambient dimension falls below a critical threshold.

Details

ICML Conference 2025 Conference Paper

Improving the Variance of Differentially Private Randomized Experiments through Clustering

Adel Javanmard
Vahab Mirrokni
Jean Pouget-Abadie

Estimating causal effects from randomized experiments is only possible if participants are willing to disclose their potentially sensitive responses. Differential privacy, a widely used framework for ensuring an algorithm’s privacy guarantees, can encourage participants to share their responses without the risk of de-anonymization. However, many mechanisms achieve differential privacy by adding noise to the original dataset, which reduces the precision of causal effect estimation. This introduces a fundamental trade-off between privacy and variance when performing causal analyses on differentially private data. In this work, we propose a new differentially private mechanism, Cluster-DP, which leverages a given cluster structure in the data to improve the privacy-variance trade-off. While our results apply to any clustering, we demonstrate that selecting higher-quality clusters—according to a quality metric we introduce—can decrease the variance penalty without compromising privacy guarantees. Finally, we evaluate the theoretical and empirical performance of our Cluster-DP algorithm on both real and simulated data, comparing it to common baselines, including two special cases of our algorithm: its unclustered version and a uniform-prior version.

Details

ICML Conference 2025 Conference Paper

Integer Programming for Generalized Causal Bootstrap Designs

Jennifer Rogers Brennan
Sébastien Lahaie
Adel Javanmard
Nick Doudchenko
Jean Pouget-Abadie

In experimental causal inference, we distinguish between two sources of uncertainty: design uncertainty, due to the treatment assignment mechanism, and sampling uncertainty, when the sample is drawn from a super-population. This distinction matters in settings with small fixed samples and heterogeneous treatment effects, as in geographical experiments. The standard bootstrap procedure most often used by practitioners primarily estimates sampling uncertainty, and the causal bootstrap procedure, which accounts for design uncertainty, was developed for the completely randomized design and the difference-in-means estimator, whereas non-standard designs and estimators are often used in these low-power regimes. We address this gap by proposing an integer program which computes numerically the worst-case copula used as an input to the causal bootstrap method, in a wide range of settings. Specifically, we prove the asymptotic validity of our approach for unconfounded, conditionally unconfounded, and and individualistic with bounded confoundedness assignments, as well as generalizing to any linear-in-treatment and quadratic-in-treatment estimators. We demonstrate the refined confidence intervals achieved through simulations of small geographical experiments.

Details

ICML Conference 2025 Conference Paper

Retraining with Predicted Hard Labels Provably Increases Model Accuracy

Rudrajit Das
Inderjit S. Dhillon
Alessandro Epasto
Adel Javanmard
Jieming Mao
Vahab Mirrokni
Sujay Sanghavi
Peilin Zhong

The performance of a model trained with noisy labels is often improved by simply retraining the model with its own predicted hard labels (i. e. , $1$/$0$ labels). Yet, a detailed theoretical characterization of this phenomenon is lacking. In this paper, we theoretically analyze retraining in a linearly separable binary classification setting with randomly corrupted labels given to us and prove that retraining can improve the population accuracy obtained by initially training with the given (noisy) labels. To the best of our knowledge, this is the first such theoretical result. Retraining finds application in improving training with local label differential privacy (DP), which involves training with noisy labels. We empirically show that retraining selectively on the samples for which the predicted label matches the given label significantly improves label DP training at no extra privacy cost; we call this consensus-based retraining. For example, when training ResNet-18 on CIFAR-100 with $\epsilon=3$ label DP, we obtain more than $6$% improvement in accuracy with consensus-based retraining.

Details

ICLR Conference 2025 Conference Paper

Robust Feature Learning for Multi-Index Models in High Dimensions

Alireza Mousavi Hosseini
Adel Javanmard
Murat A. Erdogdu

Recently, there have been numerous studies on feature learning with neural networks, specifically on learning single- and multi-index models where the target is a function of a low-dimensional projection of the input. Prior works have shown that in high dimensions, the majority of the compute and data resources are spent on recovering the low-dimensional projection; once this subspace is recovered, the remainder of the target can be learned independently of the ambient dimension. However, implications of feature learning in adversarial settings remain unexplored. In this work, we take the first steps towards understanding adversarially robust feature learning with neural networks. Specifically, we prove that the hidden directions of a multi-index model offer a Bayes optimal low-dimensional projection for robustness against $\ell_2$-bounded adversarial perturbations under the squared loss, assuming that the multi-index coordinates are statistically independent from the rest of the coordinates. Therefore, robust learning can be achieved by first performing standard feature learning, then robustly tuning a linear readout layer on top of the standard representations. In particular, we show that adversarially robust learning is just as easy as standard learning. Specifically, the additional number of samples needed to robustly learn multi-index models when compared to standard learning, does not depend on dimensionality.

Details

NeurIPS Conference 2025 Conference Paper

Self-Boost via Optimal Retraining: An Analysis via Approximate Message Passing

Adel Javanmard
Rudrajit Das
Alessandro Epasto
Vahab Mirrokni

Retraining a model using its own predictions together with the original, potentially noisy labels is a well-known strategy for improving the model’s performance. While prior works have demonstrated the benefits of specific heuristic retraining schemes, the question of how to optimally combine the model's predictions and the provided labels remains largely open. This paper addresses this fundamental question for binary classification tasks. We develop a principled framework based on approximate message passing (AMP) to analyze iterative retraining procedures for two ground truth settings: Gaussian mixture model (GMM) and generalized linear model (GLM). Our main contribution is the derivation of the Bayes optimal aggregator function to combine the current model's predictions and the given labels, which when used to retrain the same model, minimizes its prediction error. We also quantify the performance of this optimal retraining strategy over multiple rounds. We complement our theoretical results by proposing a practically usable version of the theoretically-optimal aggregator function and demonstrate its superiority over baseline methods under different label noise models.

PDF Details

ICLR Conference 2024 Conference Paper

Learning from Aggregate responses: Instance Level versus Bag Level Loss Functions

Adel Javanmard
Lin Chen
Vahab Mirrokni
Ashwinkumar Badanidiyuru
Gang Fu

Due to the rise of privacy concerns, in many practical applications, the training data is aggregated before being shared with the learner to protect the privacy of users' sensitive responses. In an aggregate learning framework, the dataset is grouped into bags of samples, where each bag is available only with an aggregate response, providing a summary of individuals' responses in that bag. In this paper, we study two natural loss functions for learning from aggregate responses: the bag-level loss and the instance-level loss. In the former, the model is learned by minimizing a loss between the aggregate responses and aggregate model predictions, while in the latter, the model aims to fit individual predictions to the aggregate responses. In this work, we show that the instance-level loss can be perceived as a regularized form of the bag-level loss. This observation allows us to compare the two approaches with respect to the bias and variance of the resulting estimators and to introduce a novel interpolating estimator that combines the two approaches. For linear regression tasks, we provide a precise characterization of the risk of the interpolating estimator in an asymptotic regime where the size of the training set grows in proportion to the feature dimension. Our analysis enables us to theoretically understand the effect of different factors, such as bag size, on the model's prediction risk. Additionally, we propose a mechanism for differentially private learning from aggregate responses and derive the optimal bag size in terms of the prediction risk-privacy trade-off. We also carry out thorough experiments to corroborate our theory and show the efficacy of the interpolating estimator.

Details

ICML Conference 2024 Conference Paper

PriorBoost: An Adaptive Algorithm for Learning from Aggregate Responses

Adel Javanmard
Matthew Fahrbach
Vahab Mirrokni

This work studies algorithms for learning from aggregate responses. We focus on the construction of aggregation sets (called bags in the literature) for event-level loss functions. We prove for linear regression and generalized linear models (GLMs) that the optimal bagging problem reduces to one-dimensional size-constrained $k$-means clustering. Further, we theoretically quantify the advantage of using curated bags over random bags. We then propose the $\texttt{PriorBoost}$ algorithm, which adaptively forms bags of samples that are increasingly homogeneous with respect to (unobserved) individual responses to improve model quality. We study label differential privacy for aggregate learning, and we also provide extensive experiments showing that $\texttt{PriorBoost}$ regularly achieves optimal model quality for event-level predictions, in stark contrast to non-adaptive algorithms.

Details

JMLR Journal 2024 Journal Article

Structured Dynamic Pricing: Optimal Regret in a Global Shrinkage Model

Rashmi Ranjan Bhuyan
Adel Javanmard
Sungchul Kim
Gourab Mukherjee
Ryan A. Rossi
Tong Yu
Handong Zhao

We consider dynamic pricing strategies in a streamed longitudinal data set-up where the objective is to maximize, over time, the cumulative profit across a large number of customer segments. We consider a dynamic model with the consumers’ preferences as well as price sensitivity varying over time. Building on the well-known finding that consumers sharing similar characteristics act in similar ways, we consider a global shrinkage structure, which assumes that the consumers’ preferences across the different segments can be well approximated by a spatial autoregressive (SAR) model. In such a streamed longitudinal setup, we measure the performance of a dynamic pricing policy via regret, which is the expected revenue loss compared to a clairvoyant that knows the sequence of model parameters in advance. We propose a pricing policy based on penalized stochastic gradient descent (PSGD) and explicitly characterize its regret as functions of time, the temporal variability in the model parameters as well as the strength of the auto-correlation network structure spanning the varied customer segments. Our regret analysis results not only demonstrate asymptotic optimality of the proposed policy but also show that for policy planning it is essential to incorporate available structural information as policies based on unshrunken models are highly sub-optimal in the aforementioned set-up. We conduct simulation experiments across a wide range of regimes as well as real-world networks based studies and report encouraging performance for our proposed method. [abs] [ pdf ][ bib ] &copy JMLR 2024. ( edit, beta )

PDF Details

NeurIPS Conference 2023 Conference Paper

Anonymous Learning via Look-Alike Clustering: A Precise Analysis of Model Generalization

Adel Javanmard
Vahab Mirrokni

While personalized recommendations systems have become increasingly popular, ensuring user data protection remains a top concern in the development of these learning systems. A common approach to enhancing privacy involves training models using anonymous data rather than individual data. In this paper, we explore a natural technique called "look-alike clustering", which involves replacing sensitive features of individuals with the cluster's average values. We provide a precise analysis of how training models using anonymous cluster centers affects their generalization capabilities. We focus on an asymptotic regime where the size of the training set grows in proportion to the features dimension. Our analysis is based on the Convex Gaussian Minimax Theorem (CGMT) and allows us to theoretically understand the role of different model components on the generalization error. In addition, we demonstrate that in certain high-dimensional regimes, training over anonymous cluster centers acts as a regularization and improves generalization error of the trained models. Finally, we corroborate our asymptotic theory with finite-sample numerical experiments where we observe a perfect match when the sample size is only of order of a few hundreds.

PDF Details

ICML Conference 2023 Conference Paper

Learning Rate Schedules in the Presence of Distribution Shift

Matthew Fahrbach
Adel Javanmard
Vahab Mirrokni
Pratik Worah

We design learning rate schedules that minimize regret for SGD-based online learning in the presence of a changing data distribution. We fully characterize the optimal learning rate schedule for online linear regression via a novel analysis with stochastic differential equations. For general convex loss functions, we propose new learning rate schedules that are robust to distribution shift, and give upper and lower bounds for the regret that only differ by constants. For non-convex loss functions, we define a notion of regret based on the gradient norm of the estimated models and propose a learning schedule that minimizes an upper bound on the total expected regret. Intuitively, one expects changing loss landscapes to require more exploration, and we confirm that optimal learning rate schedules typically have higher learning rates in the presence of distribution shift. Finally, we provide experiments that illustrate these learning rate schedules and their regret.

Details

ICML Conference 2021 Conference Paper

Fundamental Tradeoffs in Distributionally Adversarial Training

Mohammad Mehrabi
Adel Javanmard
Ryan A. Rossi
Anup B. Rao
Tung Mai

Adversarial training is among the most effective techniques to improve robustness of models against adversarial perturbations. However, the full effect of this approach on models is not well understood. For example, while adversarial training can reduce the adversarial risk (prediction error against an adversary), it sometimes increase standard risk (generalization error when there is no adversary). In this paper, we focus on \emph{distribution perturbing} adversary framework wherein the adversary can change the test distribution within a neighborhood of the training data distribution. The neighborhood is defined via Wasserstein distance between distributions and the radius of the neighborhood is a measure of adversary’s manipulative power. We study the tradeoff between standard risk and adversarial risk and derive the Pareto-optimal tradeoff, achievable over specific classes of models, in the infinite data limit with features dimension kept fixed. We consider three learning settings: 1) Regression with the class of linear models; 2) Binary classification under the Gaussian mixtures data model, with the class of linear classifiers; 3) Regression with the class of random features model (which can be equivalently represented as two-layer neural network with random first-layer weights). We show that a tradeoff between standard and adversarial risk is manifested in all three settings. We further characterize the Pareto-optimal tradeoff curves and discuss how a variety of factors, such as features correlation, adversary’s power or the width of two-layer neural network would affect this tradeoff.

Details

JMLR Journal 2020 Journal Article

Perturbation Bounds for Procrustes, Classical Scaling, and Trilateration, with Applications to Manifold Learning

Ery Arias-Castro
Adel Javanmard
Bruno Pelletier

One of the common tasks in unsupervised learning is dimensionality reduction, where the goal is to find meaningful low-dimensional structures hidden in high-dimensional data. Sometimes referred to as manifold learning, this problem is closely related to the problem of localization, which aims at embedding a weighted graph into a low-dimensional Euclidean space. Several methods have been proposed for localization, and also manifold learning. Nonetheless, the robustness property of most of them is little understood. In this paper, we obtain perturbation bounds for classical scaling and trilateration, which are then applied to derive performance bounds for Isomap, Landmark Isomap, and Maximum Variance Unfolding. A new perturbation bound for procrustes analysis plays a key role. [abs] [ pdf ][ bib ] &copy JMLR 2020. ( edit, beta )

PDF Details

NeurIPS Conference 2019 Conference Paper

Dynamic Incentive-Aware Learning: Robust Pricing in Contextual Auctions

Negin Golrezaei
Adel Javanmard
Vahab Mirrokni

Motivated by pricing in ad exchange markets, we consider the problem of robust learning of reserve prices against strategic buyers in repeated contextual second-price auctions. Buyers' valuations \new{for} an item depend on the context that describes the item. However, the seller is not aware of the relationship between the context and buyers' valuations, i. e. , buyers' preferences. The seller's goal is to design a learning policy to set reserve prices via observing the past sales data, and her objective is to minimize her regret for revenue, where the regret is computed against a clairvoyant policy that knows buyers' heterogeneous preferences. Given the seller's goal, utility-maximizing buyers have the incentive to bid untruthfully in order to manipulate the seller's learning policy. We propose two learning policies that are robust to such strategic behavior. These policies use the outcomes of the auctions, rather than the submitted bids, to estimate the preferences while controlling the long-term effect of the outcome of each auction on the future reserve prices. The first policy called Contextual Robust Pricing (CORP) is designed for the setting where the market noise distribution is known to the seller and achieves a T-period regret of $O(d\log(Td) \log (T))$, where $d$ is the dimension of {the} contextual information. The second policy, which is a variant of the first policy, is called Stable CORP (SCORP). This policy is tailored to the setting where the market noise distribution is unknown to the seller and belongs to an ambiguity set. We show that the SCORP policy has a T-period regret of $O(\sqrt{d\log(Td)}\; T^{2/3})$.

PDF Details

JMLR Journal 2019 Journal Article

Dynamic Pricing in High-dimensions

Adel Javanmard
Hamid Nazerzadeh

We study the pricing problem faced by a firm that sells a large number of products, described via a wide range of features, to customers that arrive over time. Customers independently make purchasing decisions according to a general choice model that includes products features and customers' characteristics, encoded as $d$-dimensional numerical vectors, as well as the price offered. The parameters of the choice model are a priori unknown to the firm, but can be learned as the (binary-valued) sales data accrues over time. The firm's objective is to maximize its revenue. We benchmark the performance using the classic regret minimization framework where the regret is defined as the expected revenue loss against a clairvoyant policy that knows the parameters of the choice model in advance, and always offers the revenue-maximizing price. This setting is motivated in part by the prevalence of online marketplaces that allow for real-time pricing. We assume a structured choice model, parameters of which depend on $s_0$ out of the $d$ product features. Assuming that the market noise distribution is known, we propose a dynamic policy, called Regularized Maximum Likelihood Pricing (RMLP) that leverages the (sparsity) structure of the high-dimensional model and obtains a logarithmic regret in $T$. More specifically, the regret of our algorithm is of $O(s_0 \log d \cdot \log T)$. Furthermore, we show that no policy can obtain regret better than $O(s_0 (\log d + \log T))$. {In addition, we propose a generalization of our policy to a setting that the market noise distribution is unknown but belongs to a parametrized family of distributions. This policy obtains regret of $O(\sqrt{(\log d)T})$. We further show that no policy can obtain regret better than $\Omega(\sqrt{T})$ in such environments.} [abs] [ pdf ][ bib ] &copy JMLR 2019. ( edit, beta )

PDF Details

JMLR Journal 2017 Journal Article

Perishability of Data: Dynamic Pricing under Varying-Coefficient Models

Adel Javanmard

We consider a firm that sells a large number of products to its customers in an online fashion. Each product is described by a high dimensional feature vector, and the market value of a product is assumed to be linear in the values of its features. Parameters of the valuation model are unknown and can change over time. The firm sequentially observes a product's features and can use the historical sales data (binary sale/no sale feedbacks) to set the price of current product, with the objective of maximizing the collected revenue. We measure the performance of a dynamic pricing policy via regret, which is the expected revenue loss compared to a clairvoyant that knows the sequence of model parameters in advance. We propose a pricing policy based on projected stochastic gradient descent (PSGD) and characterize its regret in terms of time $T$, features dimension $d$, and the temporal variability in the model parameters, $\delta_t$. We consider two settings. In the first one, feature vectors are chosen antagonistically by nature and we prove that the regret of PSGD pricing policy is of order $O(\sqrt{T} + \sum_{t=1}^T \sqrt{t}\delta_t)$. In the second setting (referred to as stochastic features model), the feature vectors are drawn independently from an unknown distribution. We show that in this case, the regret of PSGD pricing policy is of order $O(d^2 \log T + \sum_{t=1}^T t\delta_t/d)$. [abs] [ pdf ][ bib ] &copy JMLR 2017. ( edit, beta )

PDF Details

JMLR Journal 2014 Journal Article

Confidence Intervals and Hypothesis Testing for High-Dimensional Regression

Adel Javanmard
Andrea Montanari

Fitting high-dimensional statistical models often requires the use of non-linear parameter estimation procedures. As a consequence, it is generally impossible to obtain an exact characterization of the probability distribution of the parameter estimates. This in turn implies that it is extremely challenging to quantify the uncertainty associated with a certain parameter estimate. Concretely, no commonly accepted procedure exists for computing classical measures of uncertainty and statistical significance as confidence intervals or $p$-values for these models. We consider here high- dimensional linear regression problem, and propose an efficient algorithm for constructing confidence intervals and $p$-values. The resulting confidence intervals have nearly optimal size. When testing for the null hypothesis that a certain parameter is vanishing, our method has nearly optimal power. Our approach is based on constructing a `de-biased' version of regularized M-estimators. The new construction improves over recent work in the field in that it does not assume a special structure on the design matrix. We test our method on synthetic data and a high- throughput genomic data set about riboflavin production rate, made publicly available by BÃ¼hlmann et al. (2014). [abs] [ pdf ][ bib ] &copy JMLR 2014. ( edit, beta )

PDF Details

NeurIPS Conference 2013 Conference Paper

Confidence Intervals and Hypothesis Testing for High-Dimensional Statistical Models

Adel Javanmard
Andrea Montanari

Fitting high-dimensional statistical models often requires the use of non-linear parameter estimation procedures. As a consequence, it is generally impossible to obtain an exact characterization of the probability distribution of the parameter estimates. This in turn implies that it is extremely challenging to quantify the uncertainty' associated with a certain parameter estimate. Concretely, no commonly accepted procedure exists for computing classical measures of uncertainty and statistical significance as confidence intervals or p-values. We consider here a broad class of regression problems, and propose an efficient algorithm for constructing confidence intervals and p-values. The resulting confidence intervals have nearly optimal size. When testing for the null hypothesis that a certain parameter is vanishing, our method has nearly optimal power. Our approach is based on constructing a de-biased' version of regularized M-estimators. The new construction improves over recent work in the field in that it does not assume a special structure on the design matrix. Furthermore, proofs are remarkably simple. We test our method on a diabetes prediction problem.

PDF Details

ICML Conference 2013 Conference Paper

Learning Linear Bayesian Networks with Latent Variables

Anima Anandkumar
Daniel J. Hsu
Adel Javanmard
Sham M. Kakade

This work considers the problem of learning linear Bayesian networks when some of the variables are unobserved. Identifiability and efficient recovery from low-order observable moments are established under a novel graphical constraint. The constraint concerns the expansion properties of the underlying directed acyclic graph (DAG) between observed and unobserved variables in the network, and it is satisfied by many natural families of DAGs that include multi-level DAGs, DAGs with effective depth one, as well as certain families of polytrees.

Details

NeurIPS Conference 2013 Conference Paper

Model Selection for High-Dimensional Regression under the Generalized Irrepresentability Condition

Adel Javanmard
Andrea Montanari

In the high-dimensional regression model a response variable is linearly related to $p$ covariates, but the sample size $n$ is smaller than $p$. We assume that only a small subset of covariates is `active' (i. e. , the corresponding coefficients are non-zero), and consider the model-selection problem of identifying the active covariates. A popular approach is to estimate the regression coefficients through the Lasso ($\ell_1$-regularized least squares). This is known to correctly identify the active set only if the irrelevant covariates are roughly orthogonal to the relevant ones, as quantified through the so called `irrepresentability' condition. In this paper we study the `Gauss-Lasso' selector, a simple two-stage method that first solves the Lasso, and then performs ordinary least squares restricted to the Lasso active set. We formulate `generalized irrepresentability condition' (GIC), an assumption that is substantially weaker than irrepresentability. We prove that, under GIC, the Gauss-Lasso correctly recovers the active set.

PDF Details

NeurIPS Conference 2012 Conference Paper

Efficient Reinforcement Learning for High Dimensional Linear Quadratic Systems

Morteza Ibrahimi
Adel Javanmard
Benjamin Roy

We study the problem of adaptive control of a high dimensional linear quadratic (LQ) system. Previous work established the asymptotic convergence to an optimal controller for various adaptive control schemes. More recently, an asymptotic regret bound of $\tilde{O}(\sqrt{T})$ was shown for $T \gg p$ where $p$ is the dimension of the state space. In this work we consider the case where the matrices describing the dynamic of the LQ system are sparse and their dimensions are large. We present an adaptive control scheme that for $p \gg 1$ and $T \gg \polylog(p)$ achieves a regret bound of $\tilde{O}(p \sqrt{T})$. In particular, our algorithm has an average cost of $(1+\eps)$ times the optimum cost after $T = \polylog(p) O(1/\eps^2)$. This is in comparison to previous work on the dense dynamics where the algorithm needs $\Omega(p)$ samples before it can estimate the unknown dynamic with any significant accuracy. We believe our result has prominent applications in the emerging area of computational advertising, in particular targeted online advertising and advertising in social networks.

PDF Details