Arrow Research search

Author name cluster

Raman Arora

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

58 papers
2 author rows

Possible papers

58

ICML Conference 2025 Conference Paper

Backdoor Attacks in Token Selection of Attention Mechanism

  • Yunjuan Wang
  • Raman Arora

Despite the remarkable success of large foundation models across a range of tasks, they remain susceptible to security threats such as backdoor attacks. By injecting poisoned data containing specific triggers during training, adversaries can manipulate model predictions in a targeted manner. While prior work has focused on empirically designing and evaluating such attacks, a rigorous theoretical understanding of when and why they succeed is lacking. In this work, we analyze backdoor attacks that exploit the token selection process within attention mechanisms–a core component of transformer-based architectures. We show that single-head self-attention transformers trained via gradient descent can interpolate poisoned training data. Moreover, we prove that when the backdoor triggers are sufficiently strong but not overly dominant, attackers can successfully manipulate model predictions. Our analysis characterizes how adversaries manipulate token selection to alter outputs and identifies the theoretical conditions under which these attacks succeed. We validate our findings through experiments on synthetic datasets.

ICML Conference 2025 Conference Paper

Policy-Regret Minimization in Markov Games with Function Approximation

  • Thanh Nguyen-Tang
  • Raman Arora

We study policy-regret minimization problem in dynamically evolving environments, modeled as Markov games between a learner and a strategic, adaptive opponent. We propose a general algorithmic framework that achieves the optimal $\mathcal{O}(\sqrt{T})$ policy regret for a wide class of large-scale problems characterized by an Eluder-type condition–extending beyond the tabular settings of previous work. Importantly, our framework uncovers a simpler yet powerful algorithmic approach for handling reactive adversaries, demonstrating that leveraging opponent learning in such settings is key to attaining the optimal $\mathcal{O}(\sqrt{T})$ policy regret.

NeurIPS Conference 2025 Conference Paper

When Does Curriculum Learning Help? A Theoretical Perspective

  • Raman Arora
  • Yunjuan Wang
  • Kaibo Zhang

Curriculum learning has emerged as an effective strategy to enhance the training efficiency and generalization of machine learning models. However, its theoretical underpinnings remain relatively underexplored. In this work, we develop a theoretical framework for curriculum learning based on biased regularized empirical risk minimization (RERM), identifying conditions under which curriculum learning provably improves generalization. We introduce a sufficient condition that characterizes a "good" curriculum and analyze a multi-task curriculum framework, where solving a sequence of convex tasks can facilitate better generalization. We also demonstrate how these theoretical insights translate to practical benefits when using stochastic gradient descent (SGD) as an optimization method. Beyond convex settings, we explore the utility of curriculum learning for non-convex tasks. Empirical evaluations on synthetic datasets and MNIST validate our theoretical findings and highlight the practical efficacy of curriculum-based training.

ICML Conference 2024 Conference Paper

Adversarially Robust Hypothesis Transfer Learning

  • Yunjuan Wang
  • Raman Arora

In this work, we explore Hypothesis Transfer Learning (HTL) under adversarial attacks. In this setting, a learner has access to a training dataset of size $n$ from an underlying distribution $\mathcal{D}$ and a set of auxiliary hypotheses. These auxiliary hypotheses, which can be viewed as prior information originating either from expert knowledge or as pre-trained foundation models, are employed as an initialization for the learning process. Our goal is to develop an adversarially robust model for $\mathcal{D}$. We begin by examining an adversarial variant of the regularized empirical risk minimization learning rule that we term A-RERM. Assuming a non-negative smooth loss function with a strongly convex regularizer, we establish a bound on the robust generalization error of the hypothesis returned by A-RERM in terms of the robust empirical loss and the quality of the initialization. If the initialization is good, i. e. , there exists a weighted combination of auxiliary hypotheses with a small robust population loss, the bound exhibits a fast rate of $\mathcal{O}(1/n)$. Otherwise, we get the standard rate of $\mathcal{O}(1/\sqrt{n})$. Additionally, we provide a bound on the robust excess risk which is similar in nature, albeit with a slightly worse rate. We also consider solving the problem using a practical variant, namely proximal stochastic adversarial training, and present a bound that depends on the initialization. This bound has the same dependence on the sample size as the ARERM bound, except for an additional term that depends on the size of the adversarial perturbation.

NeurIPS Conference 2024 Conference Paper

Adversarially Robust Multi-task Representation Learning

  • Austin Watkins
  • Thanh Nguyen-Tang
  • Enayat Ullah
  • Raman Arora

We study adversarially robust transfer learning, wherein, given labeled data on multiple (source) tasks, the goal is to train a model with small robust error on a previously unseen (target) task. In particular, we consider a multi-task representation learning (MTRL) setting, i. e. , we assume that the source and target tasks admit a simple (linear) predictor on top of a shared representation (e. g. , the final hidden layer of a deep neural network). In this general setting, we provide rates on~the excess adversarial (transfer) risk for Lipschitz losses and smooth nonnegative losses. These rates show that learning a representation using adversarial training on diverse tasks helps protect against inference-time attacks in data-scarce environments. Additionally, we provide novel rates for the single-task setting.

ICML Conference 2024 Conference Paper

Benign Overfitting in Adversarial Training of Neural Networks

  • Yunjuan Wang
  • Kaibo Zhang
  • Raman Arora

Benign overfitting is the phenomenon wherein none of the predictors in the hypothesis class can achieve perfect accuracy (i. e. , non-realizable or noisy setting), but a model that interpolates the training data still achieves good generalization. A series of recent works aim to understand this phenomenon for regression and classification tasks using linear predictors as well as two-layer neural networks. In this paper, we study such a benign overfitting phenomenon in an adversarial setting. We show that under a distributional assumption, interpolating neural networks found using adversarial training generalize well despite inference-time attacks. Specifically, we provide convergence and generalization guarantees for adversarial training of two-layer networks (with smooth as well as non-smooth activation functions) showing that under moderate $\ell_2$ norm perturbation budget, the trained model has near-zero robust training loss and near-optimal robust generalization error. We support our theoretical findings with an empirical study on synthetic and real-world data.

NeurIPS Conference 2024 Conference Paper

Learning in Markov Games with Adaptive Adversaries: Policy Regret, Fundamental Barriers, and Efficient Algorithms

  • Thanh Nguyen-Tang
  • Raman Arora

We study learning in a dynamically evolving environment modeled as a Markov game between a learner and a strategic opponent that can adapt to the learner's strategies. While most existing works in Markov games focus on external regret as the learning objective, external regret becomes inadequate when the adversaries are adaptive. In this work, we focus on \emph{policy regret} -- a counterfactual notion that aims to compete with the return that would have been attained if the learner had followed the best fixed sequence of policy, in hindsight. We show that if the opponent has unbounded memory or if it is non-stationary, then sample-efficient learning is not possible. For memory-bounded and stationary, we show that learning is still statistically hard if the set of feasible strategies for the learner is exponentially large. To guarantee learnability, we introduce a new notion of \emph{consistent} adaptive adversaries, wherein, the adversary responds similarly to similar strategies of the learner. We provide algorithms that achieve $\sqrt{T}$ policy regret against memory-bounded, stationary, and consistent adversaries.

NeurIPS Conference 2024 Conference Paper

Offline Multitask Representation Learning for Reinforcement Learning

  • Haque Ishfaq
  • Thanh Nguyen-Tang
  • Songtao Feng
  • Raman Arora
  • Mengdi Wang
  • Ming Yin
  • Doina Precup

We study offline multitask representation learning in reinforcement learning (RL), where a learner is provided with an offline dataset from different tasks that share a common representation and is asked to learn the shared representation. We theoretically investigate offline multitask low-rank RL, and propose a new algorithm called MORL for offline multitask representation learning. Furthermore, we examine downstream RL in reward-free, offline and online scenarios, where a new task is introduced to the agent that shares the same representation as the upstream offline tasks. Our theoretical results demonstrate the benefits of using the learned representation from the upstream offline task instead of directly learning the representation of the low-rank model.

NeurIPS Conference 2024 Conference Paper

On the Stability and Generalization of Meta-Learning

  • Yunjuan Wang
  • Raman Arora

We focus on developing a theoretical understanding of meta-learning. Given multiple tasks drawn i. i. d. from some (unknown) task distribution, the goal is to find a good pre-trained model that can be adapted to a new, previously unseen, task with little computational and statistical overhead. We introduce a novel notion of stability for meta-learning algorithms, namely uniform meta-stability. We instantiate two uniformly meta-stable learning algorithms based on regularized empirical risk minimization and gradient descent and give explicit generalization bounds for convex learning problems with smooth losses and for weakly convex learning problems with non-smooth losses. Finally, we extend our results to stochastic and adversarially robust variants of our meta-learning algorithm.

ICML Conference 2024 Conference Paper

On The Statistical Complexity of Offline Decision-Making

  • Thanh Nguyen-Tang
  • Raman Arora

We study the statistical complexity of offline decision-making with function approximation, establishing (near) minimax-optimal rates for stochastic contextual bandits and Markov decision processes. The performance limits are captured by the pseudo-dimension of the (value) function class and a new characterization of the behavior policy that strictly subsumes all the previous notions of data coverage in the offline decision-making literature. In addition, we seek to understand the benefits of using offline data in online decision-making and show nearly minimax-optimal rates in a wide range of regimes.

NeurIPS Conference 2024 Conference Paper

Public-data Assisted Private Stochastic Optimization: Power and Limitations

  • Enayat Ullah
  • Michael Menart
  • Raef Bassily
  • Cristóbal Guzmán
  • Raman Arora

We study the limits and capability of public-data assisted differentially private (PA-DP) algorithms. Specifically, we focus on the problem of stochastic convex optimization (SCO) with either labeled or unlabeled public data. For complete/labeled public data, we show that any $(\epsilon, \delta)$-PA-DP has excess risk $\tilde{\Omega}\big(\min(\frac{1}{\sqrt{n_{\text{pub}}}}, \frac{1}{\sqrt{n}}+\frac{\sqrt{d}}{n\epsilon} ) \big)$, where $d$ is the dimension, ${n_{\text{pub}}}$ is the number of public samples, ${n_{\text{priv}}}$ is the number of private samples, and $n={n_{\text{pub}}}+{n_{\text{priv}}}$. These lower bounds are established via our new lower bounds for PA-DP mean estimation, which are of a similar form. Up to constant factors, these lower bounds show that the simple strategy of either treating all data as private or discarding the private data, is optimal. We also study PA-DP supervised learning with \textit{unlabeled} public samples. In contrast to our previous result, we here show novel methods for leveraging public data in private supervised learning. For generalized linear models (GLM) with unlabeled public data, we show an efficient algorithm which, given $\tilde{O}({n_{\text{priv}}}\epsilon)$ unlabeled public samples, achieves the dimension independent rate $\tilde{O}\big(\frac{1}{\sqrt{{n_{\text{priv}}}}} + \frac{1}{\sqrt{{n_{\text{priv}}}\epsilon}}\big)$. We develop new lower bounds for this setting which shows that this rate cannot be improved with more public samples, and any fewer public samples leads to a worse rate. Finally, we provide extensions of this result to general hypothesis classes with finite \textit{fat-shattering dimension} with applications to neural networks and non-Euclidean geometries.

NeurIPS Conference 2024 Conference Paper

Stability and Generalization of Adversarial Training for Shallow Neural Networks with Smooth Activation

  • Kaibo Zhang
  • Yunjuan Wang
  • Raman Arora

Adversarial training has emerged as a popular approach for training models that are robust to inference-time adversarial attacks. However, our theoretical understanding of why and when it works remains limited. Prior work has offered generalization analysis of adversarial training, but they are either restricted to the Neural Tangent Kernel (NTK) regime or they make restrictive assumptions about data such as (noisy) linear separability or robust realizability. In this work, we study the stability and generalization of adversarial training for two-layer networks without any data distribution assumptions and beyond the NTK regime. Our findings suggest that for networks with any given initialization and sufficiently large width, the generalization bound can be effectively controlled via early stopping. We further improve the generalization bound by leveraging smoothing using Moreau’s envelope.

AAAI Conference 2023 Conference Paper

A Risk-Sensitive Approach to Policy Optimization

  • Jared Markowitz
  • Ryan W. Gardner
  • Ashley Llorens
  • Raman Arora
  • I-Jeng Wang

Standard deep reinforcement learning (DRL) aims to maximize expected reward, considering collected experiences equally in formulating a policy. This differs from human decision-making, where gains and losses are valued differently and outlying outcomes are given increased consideration. It also fails to capitalize on opportunities to improve safety and/or performance through the incorporation of distributional context. Several approaches to distributional DRL have been investigated, with one popular strategy being to evaluate the projected distribution of returns for possible actions. We propose a more direct approach whereby risk-sensitive objectives, specified in terms of the cumulative distribution function (CDF) of the distribution of full-episode rewards, are optimized. This approach allows for outcomes to be weighed based on relative quality, can be used for both continuous and discrete action spaces, and may naturally be applied in both constrained and unconstrained settings. We show how to compute an asymptotically consistent estimate of the policy gradient for a broad class of risk-sensitive objectives via sampling, subsequently incorporating variance reduction and regularization measures to facilitate effective on-policy learning. We then demonstrate that the use of moderately "pessimistic" risk profiles, which emphasize scenarios where the agent performs poorly, leads to enhanced exploration and a continual focus on addressing deficiencies. We test the approach using different risk profiles in six OpenAI Safety Gym environments, comparing to state of the art on-policy methods. Without cost constraints, we find that pessimistic risk profiles can be used to reduce cost while improving total reward accumulation. With cost constraints, they are seen to provide higher positive rewards than risk-neutral approaches at the prescribed allowable cost.

TMLR Journal 2023 Journal Article

Clustering using Approximate Nearest Neighbour Oracles

  • Enayat Ullah
  • Harry Lang
  • Raman Arora
  • Vladimir Braverman

We study the problem of clustering data points in a streaming setting when one has access to the geometry of the space only via approximate nearest neighbour (ANN) oracles. In this setting, we present algorithms for streaming $O(1)$-approximate $k$-median clustering and its (streaming) coreset construction. In certain domains of interest, such as spaces with constant expansion, our algorithms improve upon the best-known runtime of both these problems. Furthermore, our results extend to cost functions satisfying the approximate triangle inequality, which subsumes $k$-means clustering and $M$-estimators. Finally, we run experiments on Census1990 dataset wherein the results empirically support our theory.

ICML Conference 2023 Conference Paper

Faster Rates of Convergence to Stationary Points in Differentially Private Optimization

  • Raman Arora
  • Raef Bassily
  • Tomás González
  • Cristóbal Guzmán
  • Michael Menart
  • Enayat Ullah

We study the problem of approximating stationary points of Lipschitz and smooth functions under $(\varepsilon, \delta)$-differential privacy (DP) in both the finite-sum and stochastic settings. A point $\widehat{w}$ is called an $\alpha$-stationary point of a function $F: \mathbb{R}^d\rightarrow\mathbb{R}$ if $\|\nabla F(\widehat{w})\|\leq \alpha$. We give a new construction that improves over the existing rates in the stochastic optimization setting, where the goal is to find approximate stationary points of the population risk given $n$ samples. Our construction finds a $\tilde{O}\big(\frac{1}{n^{1/3}} + \big[\frac{\sqrt{d}}{n\varepsilon}\big]^{1/2}\big)$-stationary point of the population risk in time linear in $n$. We also provide an efficient algorithm that finds an $\tilde{O}\big(\big[\frac{\sqrt{d}}{n\varepsilon}\big]^{2/3}\big)$-stationary point in the finite-sum setting. This improves on the previous best rate of $\tilde{O}\big(\big[\frac{\sqrt{d}}{n\varepsilon}\big]^{1/2}\big)$. Furthermore, under the additional assumption of convexity, we completely characterize the sample complexity of finding stationary points of the population risk (up to polylog factors) and show that the optimal rate on population stationarity is $\tilde \Theta\big(\frac{1}{\sqrt{n}}+\frac{\sqrt{d}}{n\varepsilon}\big)$. Finally, we show that our methods can be used to provide dimension-independent rates of $O\big(\frac{1}{\sqrt{n}}+\min\big(\big[\frac{\sqrt{rank}}{n\varepsilon}\big]^{2/3}, \frac{1}{(n\varepsilon)^{2/5}}\big)\big)$ on population stationarity for Generalized Linear Models (GLM), where $rank$ is the rank of the design matrix, which improves upon the previous best known rate.

ICML Conference 2023 Conference Paper

From Adaptive Query Release to Machine Unlearning

  • Enayat Ullah
  • Raman Arora

We formalize the problem of machine unlearning as design of efficient unlearning algorithms corresponding to learning algorithms which perform a selection of adaptive queries from structured query classes. We give efficient unlearning algorithms for linear and prefix-sum query classes. As applications, we show that unlearning in many problems, in particular, stochastic convex optimization (SCO), can be reduced to the above, yielding improved guarantees for the problem. In particular, for smooth Lipschitz losses and any $\rho>0$, our results yield an unlearning algorithm with excess population risk of $\tilde O\big(\frac{1}{\sqrt{n}}+\frac{\sqrt{d}}{n\rho}\big)$ with unlearning query (gradient) complexity $\tilde O(\rho \cdot \text{Retraining Complexity})$, where $d$ is the model dimensionality and $n$ is the initial number of samples. For non-smooth Lipschitz losses, we give an unlearning algorithm with excess population risk $\tilde O\big(\frac{1}{\sqrt{n}}+\big(\frac{\sqrt{d}}{n\rho}\big)^{1/2}\big)$ with the same unlearning query (gradient) complexity. Furthermore, in the special case of Generalized Linear Models (GLMs), such as those in linear and logistic regression, we get dimension-independent rates of $\tilde O\big(\frac{1}{\sqrt{n}} +\frac{1}{(n\rho)^{2/3}}\big)$ and $\tilde O\big(\frac{1}{\sqrt{n}} +\frac{1}{(n\rho)^{1/3}}\big)$ for smooth Lipschitz and non-smooth Lipschitz losses respectively. Finally, we give generalizations of the above from one unlearning request to dynamic streams consisting of insertions and deletions.

TMLR Journal 2023 Journal Article

Generalization bounds for Kernel Canonical Correlation Analysis

  • Enayat Ullah
  • Raman Arora

We study the problem of multiview representation learning using kernel canonical correlation analysis (KCCA) and establish non-asymptotic bounds on generalization error for regularized empirical risk minimization. In particular, we give fine-grained high-probability bounds on generalization error ranging from $O(n^{-1/6})$ to $O(n^{-1/5})$ depending on underlying distributional properties, where $n$ is the number of data samples. For the special case of finite-dimensional Hilbert spaces (such as linear CCA), our rates improve, ranging from $O(n^{-1/2})$ to $O(n^{-1})$. Finally, our results generalize to the problem of functional canonical correlation analysis over abstract Hilbert spaces.

NeurIPS Conference 2023 Conference Paper

Multi-Agent Learning with Heterogeneous Linear Contextual Bandits

  • Anh Do
  • Thanh Nguyen-Tang
  • Raman Arora

As trained intelligent systems become increasingly pervasive, multiagent learning has emerged as a popular framework for studying complex interactions between autonomous agents. Yet, a formal understanding of how and when learners in heterogeneous environments benefit from sharing their respective experiences is far from complete. In this paper, we seek answers to these questions in the context of linear contextual bandits. We present a novel distributed learning algorithm based on the upper confidence bound (UCB) algorithm, which we refer to as H-LINUCB, wherein agents cooperatively minimize the group regret under the coordination of a central server. In the setting where the level of heterogeneity or dissimilarity across the environments is known to the agents, we show that H-LINUCB is provably optimal in regimes where the tasks are highly similar or highly dissimilar.

AAAI Conference 2023 Conference Paper

On Instance-Dependent Bounds for Offline Reinforcement Learning with Linear Function Approximation

  • Thanh Nguyen-Tang
  • Ming Yin
  • Sunil Gupta
  • Svetha Venkatesh
  • Raman Arora

Sample-efficient offline reinforcement learning (RL) with linear function approximation has been studied extensively recently. Much of the prior work has yielded instance-independent rates that hold even for the worst-case realization of problem instances. This work seeks to understand instance-dependent bounds for offline RL with linear function approximation. We present an algorithm called Bootstrapped and Constrained Pessimistic Value Iteration (BCP-VI), which leverages data bootstrapping and constrained optimization on top of pessimism. We show that under a partial data coverage assumption, that of concentrability with respect to an optimal policy, the proposed algorithm yields a fast rate for offline RL when there is a positive gap in the optimal Q-value functions, even if the offline data were collected adaptively. Moreover, when the linear features of the optimal actions in the states reachable by an optimal policy span those reachable by the behavior policy and the optimal actions are unique, offline RL achieves absolute zero sub-optimality error when the number of episodes exceeds a (finite) instance-dependent threshold. To the best of our knowledge, these are the first results that give a fast rate bound on the sub-optimality and an absolute zero sub-optimality bound for offline RL with linear function approximation from adaptive data with partial coverage. We also provide instance-agnostic and instance-dependent information-theoretical lower bounds to complement our upper bounds.

NeurIPS Conference 2023 Conference Paper

On Sample-Efficient Offline Reinforcement Learning: Data Diversity, Posterior Sampling and Beyond

  • Thanh Nguyen-Tang
  • Raman Arora

We seek to understand what facilitates sample-efficient learning from historical datasets for sequential decision-making, a problem that is popularly known as offline reinforcement learning (RL). Further, we are interested in algorithms that enjoy sample efficiency while leveraging (value) function approximation. In this paper, we address these fundamental questions by (i) proposing a notion of data diversity that subsumes the previous notions of coverage measures in offline RL and (ii) using this notion to \emph{unify} three distinct classes of offline RL algorithms based on version spaces (VS), regularized optimization (RO), and posterior sampling (PS). We establish that VS-based, RO-based, and PS-based algorithms, under standard assumptions, achieve \emph{comparable} sample efficiency, which recovers the state-of-the-art sub-optimality bounds for finite and linear model classes with the standard assumptions. This result is surprising, given that the prior work suggested an unfavorable sample complexity of the RO-based algorithm compared to the VS-based algorithm, whereas posterior sampling is rarely considered in offline RL due to its explorative nature. Notably, our proposed model-free PS-based algorithm for offline RL is \emph{novel}, with sub-optimality bounds that are \emph{frequentist} (i. e. , worst-case) in nature.

NeurIPS Conference 2023 Conference Paper

Optimistic Rates for Multi-Task Representation Learning

  • Austin Watkins
  • Enayat Ullah
  • Thanh Nguyen-Tang
  • Raman Arora

We study the problem of transfer learning via Multi-Task Representation Learning (MTRL), wherein multiple source tasks are used to learn a good common representation, and a predictor is trained on top of it for the target task. Under standard regularity assumptions on the loss function and task diversity, we provide new statistical rates on the excess risk of the target task, which demonstrate the benefit of representation learning. Importantly, our rates are optimistic, i. e. , they interpolate between the standard $O(m^{-1/2})$ rate and the fast $O(m^{-1})$ rate, depending on the difficulty of the learning task, where $m$ is the number of samples for the target task. Besides the main result, we make several new contributions, including giving optimistic rates for excess risk of source tasks (multi-task learning (MTL)), a local Rademacher complexity theorem for MTRL and MTL, as well as a chain rule for local Rademacher complexity for composite predictor classes.

NeurIPS Conference 2023 Conference Paper

Robustness Guarantees for Adversarially Trained Neural Networks

  • Poorya Mianjy
  • Raman Arora

We study robust adversarial training of two-layer neural networks as a bi-level optimization problem. In particular, for the inner loop that implements the adversarial attack during training using projected gradient descent (PGD), we propose maximizing a \emph{lower bound} on the $0/1$-loss by reflecting a surrogate loss about the origin. This allows us to give a convergence guarantee for the inner-loop PGD attack. Furthermore, assuming the data is linearly separable, we provide precise iteration complexity results for end-to-end adversarial training, which holds for any width and initialization. We provide empirical evidence to support our theoretical results.

ICLR Conference 2023 Conference Paper

VIPeR: Provably Efficient Algorithm for Offline RL with Neural Function Approximation

  • Thanh Nguyen-Tang
  • Raman Arora

We propose a novel algorithm for offline reinforcement learning called Value Iteration with Perturbed Rewards (VIPeR), which amalgamates the pessimism principle with random perturbations of the value function. Most current offline RL algorithms explicitly construct statistical confidence regions to obtain pessimism via lower confidence bounds (LCB), which cannot easily scale to complex problems where a neural network is used to estimate the value functions. Instead, VIPeR implicitly obtains pessimism by simply perturbing the offline data multiple times with carefully-designed i.i.d. Gaussian noises to learn an ensemble of estimated state-action {value functions} and acting greedily with respect to the minimum of the ensemble. The estimated state-action values are obtained by fitting a parametric model (e.g., neural networks) to the perturbed datasets using gradient descent. As a result, VIPeR only needs $\mathcal{O}(1)$ time complexity for action selection, while LCB-based algorithms require at least $\Omega(K^2)$, where $K$ is the total number of trajectories in the offline data. We also propose a novel data-splitting technique that helps remove a factor involving the log of the covering number in our bound. We prove that VIPeR yields a provable uncertainty quantifier with overparameterized neural networks and enjoys a bound on sub-optimality of $\tilde{\mathcal{O}}( { \kappa H^{5/2} \tilde{d} }/{\sqrt{K}})$, where $\tilde{d}$ is the effective dimension, $H$ is the horizon length and $\kappa$ measures the distributional shift. We corroborate the statistical and computational efficiency of VIPeR with an empirical evaluation on a wide set of synthetic and real-world datasets. To the best of our knowledge, VIPeR is the first algorithm for offline RL that is provably efficient for general Markov decision processes (MDPs) with neural network function approximation.

NeurIPS Conference 2022 Conference Paper

Adversarial Robustness is at Odds with Lazy Training

  • Yunjuan Wang
  • Enayat Ullah
  • Poorya Mianjy
  • Raman Arora

Recent works show that adversarial examples exist for random neural networks [Daniely and Schacham, 2020] and that these examples can be found using a single step of gradient ascent [Bubeck et al. , 2021]. In this work, we extend this line of work to ``lazy training'' of neural networks -- a dominant model in deep learning theory in which neural networks are provably efficiently learnable. We show that over-parametrized neural networks that are guaranteed to generalize well and enjoy strong computational guarantees remain vulnerable to attacks generated using a single step of gradient ascent.

NeurIPS Conference 2022 Conference Paper

Differentially Private Generalized Linear Models Revisited

  • Raman Arora
  • Raef Bassily
  • Cristóbal Guzmán
  • Michael Menart
  • Enayat Ullah

We study the problem of $(\epsilon, \delta)$-differentially private learning of linear predictors with convex losses. We provide results for two subclasses of loss functions. The first case is when the loss is smooth and non-negative but not necessarily Lipschitz (such as the squared loss). For this case, we establish an upper bound on the excess population risk of $\tilde{O}\left(\frac{\Vert w^*\Vert}{\sqrt{n}} + \min\left\{\frac{\Vert w^* \Vert^2}{(n\epsilon)^{2/3}}, \frac{\sqrt{d}\Vert w^*\Vert^2}{n\epsilon}\right\}\right)$, where $n$ is the number of samples, $d$ is the dimension of the problem, and $w^*$ is the minimizer of the population risk. Apart from the dependence on $\Vert w^\ast\Vert$, our bound is essentially tight in all parameters. In particular, we show a lower bound of $\tilde{\Omega}\left(\frac{1}{\sqrt{n}} + {\min\left\{\frac{\Vert w^*\Vert^{4/3}}{(n\epsilon)^{2/3}}, \frac{\sqrt{d}\Vert w^*\Vert}{n\epsilon}\right\}}\right)$. We also revisit the previously studied case of Lipschitz losses \cite{SSTT21}. For this case, we close the gap in the existing work and show that the optimal rate is (up to log factors) $\Theta\left(\frac{\Vert w^*\Vert}{\sqrt{n}} + \min\left\{\frac{\Vert w^*\Vert}{\sqrt{n\epsilon}}, \frac{\sqrt{\text{rank}}\Vert w^*\Vert}{n\epsilon}\right\}\right)$, where $\text{rank}$ is the rank of the design matrix. This improves over existing work in the high privacy regime. Finally, our algorithms involve a private model selection approach that we develop to enable attaining the stated rates without a-priori knowledge of $\Vert w^*\Vert$.

ICML Conference 2021 Conference Paper

Dropout: Explicit Forms and Capacity Control

  • Raman Arora
  • Peter L. Bartlett
  • Poorya Mianjy
  • Nathan Srebro

We investigate the capacity control provided by dropout in various machine learning problems. First, we study dropout for matrix completion, where it induces a distribution-dependent regularizer that equals the weighted trace-norm of the product of the factors. In deep learning, we show that the distribution-dependent regularizer due to dropout directly controls the Rademacher complexity of the underlying class of deep neural networks. These developments enable us to give concrete generalization error bounds for the dropout algorithm in both matrix completion as well as training deep neural networks.

ICML Conference 2021 Conference Paper

Robust Learning for Data Poisoning Attacks

  • Yunjuan Wang
  • Poorya Mianjy
  • Raman Arora

We investigate the robustness of stochastic approximation approaches against data poisoning attacks. We focus on two-layer neural networks with ReLU activation and show that under a specific notion of separability in the RKHS induced by the infinite-width network, training (finite-width) networks with stochastic gradient descent is robust against data poisoning attacks. Interestingly, we find that in addition to a lower bound on the width of the network, which is standard in the literature, we also require a distribution-dependent upper bound on the width for robust generalization. We provide extensive empirical evaluations that support and validate our theoretical results.

NeurIPS Conference 2020 Conference Paper

Adversarial Robustness of Supervised Sparse Coding

  • Jeremias Sulam
  • Ramchandran Muthukumar
  • Raman Arora

Several recent results provide theoretical insights into the phenomena of adversarial examples. Existing results, however, are often limited due to a gap between the simplicity of the models studied and the complexity of those deployed in practice. In this work, we strike a better balance by considering a model that involves learning a representation while at the same time giving a precise generalization bound and a robustness certificate. We focus on the hypothesis class obtained by combining a sparsity-promoting encoder coupled with a linear classifier, and show an interesting interplay between the expressivity and stability of the (supervised) representation map and a notion of margin in the feature space. We bound the robust risk (to $\ell_2$-bounded perturbations) of hypotheses parameterized by dictionaries that achieve a mild encoder gap on training data. Furthermore, we provide a robustness certificate for end-to-end classification. We demonstrate the applicability of our analysis by computing certified accuracy on real data, and compare with other alternatives for certified robustness.

ICML Conference 2020 Conference Paper

FetchSGD: Communication-Efficient Federated Learning with Sketching

  • Daniel Rothchild
  • Ashwinee Panda
  • Enayat Ullah
  • Nikita Ivkin
  • Ion Stoica
  • Vladimir Braverman
  • Joseph E. Gonzalez
  • Raman Arora

Existing approaches to federated learning suffer from a communication bottleneck as well as convergence issues due to sparse client participation. In this paper we introduce a novel algorithm, called FetchSGD, to overcome these challenges. FetchSGD compresses model updates using a Count Sketch, and then takes advantage of the mergeability of sketches to combine model updates from many workers. A key insight in the design of FetchSGD is that, because the Count Sketch is linear, momentum and error accumulation can both be carried out within the sketch. This allows the algorithm to move momentum and error accumulation from clients to the central aggregator, overcoming the challenges of sparse client participation while still achieving high compression rates and good convergence. We prove that FetchSGD has favorable convergence guarantees, and we demonstrate its empirical effectiveness by training two residual networks and a transformer model.

NeurIPS Conference 2020 Conference Paper

On Convergence and Generalization of Dropout Training

  • Poorya Mianjy
  • Raman Arora

We study dropout in two-layer neural networks with rectified linear unit (ReLU) activations. Under mild overparametrization and assuming that the limiting kernel can separate the data distribution with a positive margin, we show that the dropout training with logistic loss achieves $\epsilon$-suboptimality in the test error in $O(1/\epsilon)$ iterations.

NeurIPS Conference 2019 Conference Paper

Bandits with Feedback Graphs and Switching Costs

  • Raman Arora
  • Teodor Vanislavov Marinov
  • Mehryar Mohri

We study the adversarial multi-armed bandit problem where the learner is supplied with partial observations modeled by a \emph{feedback graph} and where shifting to a new action incurs a fixed \emph{switching cost}. We give two new algorithms for this problem in the informed setting. Our best algorithm achieves a pseudo-regret of $\tilde O(\gamma(G)^{\frac{1}{3}}T^{\frac{2}{3}})$, where $\gamma(G)$ is the domination number of the feedback graph. This significantly improves upon the previous best result for the same problem, which was based on the independence number of $G$. We also present matching lower bounds for our result that we describe in detail. Finally, we give a new algorithm with improved policy regret bounds when partial counterfactual feedback is available.

NeurIPS Conference 2019 Conference Paper

Communication-efficient Distributed SGD with Sketching

  • Nikita Ivkin
  • Daniel Rothchild
  • Enayat Ullah
  • Vladimir Braverman
  • Ion Stoica
  • Raman Arora

Large-scale distributed training of neural networks is often limited by network bandwidth, wherein the communication time overwhelms the local computation time. Motivated by the success of sketching methods in sub-linear/streaming algorithms, we introduce Sketched-SGD, an algorithm for carrying out distributed SGD by communicating sketches instead of full gradients. We show that \ssgd has favorable convergence rates on several classes of functions. When considering all communication -- both of gradients and of updated model weights -- Sketched-SGD reduces the amount of communication required compared to other gradient compression methods from $\mathcal{O}(d)$ or $\mathcal{O}(W)$ to $\mathcal{O}(\log d)$, where $d$ is the number of model parameters and $W$ is the number of workers participating in training. We run experiments on a transformer model, an LSTM, and a residual network, demonstrating up to a 40x reduction in total communication cost with no loss in final model performance. We also show experimentally that Sketched-SGD scales to at least 256 workers without increasing communication cost or degrading model performance.

NeurIPS Conference 2019 Conference Paper

Efficient Convex Relaxations for Streaming PCA

  • Raman Arora
  • Teodor Vanislavov Marinov

We revisit two algorithms, matrix stochastic gradient (MSG) and $\ell_2$-regularized MSG (RMSG), that are instances of stochastic gradient descent (SGD) on a convex relaxation to principal component analysis (PCA). These algorithms have been shown to outperform Oja’s algorithm, empirically, in terms of the iteration complexity, and to have runtime comparable with Oja’s. However, these findings are not supported by existing theoretical results. While the iteration complexity bound for $\ell_2$-RMSG was recently shown to match that of Oja’s algorithm, its theoretical efficiency was left as an open problem. In this work, we give improved bounds on per iteration cost of mini-batched variants of both MSG and $\ell_2$-RMSG and arrive at an algorithm with total computational complexity matching that of Oja's algorithm.

NeurIPS Conference 2019 Conference Paper

On Differentially Private Graph Sparsification and Applications

  • Raman Arora
  • Jalaj Upadhyay

In this paper, we study private sparsification of graphs. In particular, we give an algorithm that given an input graph, returns a sparse graph which approximates the spectrum of the input graph while ensuring differential privacy. This allows one to solve many graph problems privately yet efficiently and accurately. This is exemplified with application of the proposed meta-algorithm to graph algorithms for privately answering cut-queries, as well as practical algorithms for computing {\scshape MAX-CUT} and {\scshape SPARSEST-CUT} with better accuracy than previously known. We also give the first efficient private algorithm to learn Laplacian eigenmap on a graph.

ICML Conference 2019 Conference Paper

On Dropout and Nuclear Norm Regularization

  • Poorya Mianjy
  • Raman Arora

We give a formal and complete characterization of the explicit regularizer induced by dropout in deep linear networks with squared loss. We show that (a) the explicit regularizer is composed of an $\ell_2$-path regularizer and other terms that are also re-scaling invariant, (b) the convex envelope of the induced regularizer is the squared nuclear norm of the network map, and (c) for a sufficiently large dropout rate, we characterize the global optima of the dropout objective. We validate our theoretical findings with empirical results.

UAI Conference 2019 Conference Paper

On Fast Convergence of Proximal Algorithms for SQRT-Lasso Optimization: Don't Worry About its Nonsmooth Loss Function

  • Xingguo Li
  • Haoming Jiang
  • Jarvis D. Haupt
  • Raman Arora
  • Han Liu 0001
  • Mingyi Hong 0001
  • Tuo Zhao

Many machine learning techniques sacrifice convenient computational structures to gain estimation robustness and modeling flexibility. However, by exploring the modeling structures, we find these “sacrifices” do not always require more computational efforts. To shed light on such a “free-lunch” phenomenon, we study the square-root-Lasso (SQRT-Lasso) type regression problem. Specifically, we show that the nonsmooth loss functions of SQRT-Lasso type regression ease tuning effort and gain adaptivity to inhomogeneous noise, but is not necessarily more challenging than Lasso in computation. We can directly apply proximal algorithms (e. g. proximal gradient descent, proximal Newton, and proximal quasi-Newton algorithms) without worrying about the nonsmoothness of the loss function. Theoretically, we prove that the proximal algorithms combined with the pathwise optimization scheme enjoy fast convergence guarantees with high probability. Numerical results are provided to support our theory.

ICRA Conference 2019 Conference Paper

Visual Robot Task Planning

  • Chris Paxton 0001
  • Yotam Barnoy
  • Kapil D. Katyal
  • Raman Arora
  • Gregory D. Hager

Prospection is key to solving challenging problems in new environments, but it has not been deeply explored as applied to task planning for perception-driven robotics. We propose visual robot task planning, where we take in an input image and must generate a sequence of high-level actions and associated observations that achieve some task. In this paper, we describe a neural network architecture and associated planning algorithm that (1) learns a representation of the world that can generate prospective futures, (2) uses this generative model to simulate the result of sequences of high-level actions in a variety of environments, and (3) evaluates these actions via a variant of Monte Carlo Tree Search to find a viable solution to a particular problem. Our approach allows us to visualize intermediate motion goals and learn to plan complex activity from visual information, and used this to generate and visualize task plans on held-out examples of a block-stacking simulation.

NeurIPS Conference 2018 Conference Paper

Differentially Private Robust Low-Rank Approximation

  • Raman Arora
  • Vladimir Braverman
  • Jalaj Upadhyay

In this paper, we study the following robust low-rank matrix approximation problem: given a matrix $A \in \R^{n \times d}$, find a rank-$k$ matrix $B$, while satisfying differential privacy, such that $ \norm{ A - B }_p \leq \alpha \mathsf{OPT}_k(A) + \tau, $ where $\norm{ M }_p$ is the entry-wise $\ell_p$-norm and $\mathsf{OPT}_k(A): =\min_{\mathsf{rank}(X) \leq k} \norm{ A - X}_p$. It is well known that low-rank approximation w. r. t. entrywise $\ell_p$-norm, for $p \in [1, 2)$, yields robustness to gross outliers in the data. We propose an algorithm that guarantees $\alpha=\widetilde{O}(k^2), \tau=\widetilde{O}(k^2(n+kd)/\varepsilon)$, runs in $\widetilde O((n+d)\poly~k)$ time and uses $O(k(n+d)\log k)$ space. We study extensions to the streaming setting where entries of the matrix arrive in an arbitrary order and output is produced at the very end or continually. We also study the related problem of differentially private robust principal component analysis (PCA), wherein we return a rank-$k$ projection matrix $\Pi$ such that $\norm{ A - A \Pi }_p \leq \alpha \mathsf{OPT}_k(A) + \tau. $

JMLR Journal 2018 Journal Article

On Faster Convergence of Cyclic Block Coordinate Descent-type Methods for Strongly Convex Minimization

  • Xingguo Li
  • Tuo Zhao
  • Raman Arora
  • Han Liu
  • Mingyi Hong

The cyclic block coordinate descent-type (CBCD-type) methods, which perform iterative updates for a few coordinates (a block) simultaneously throughout the procedure, have shown remarkable computational performance for solving strongly convex minimization problems. Typical applications include many popular statistical machine learning methods such as elastic-net regression, ridge penalized logistic regression, and sparse additive regression. Existing optimization literature has shown that for strongly convex minimization, the CBCD-type methods attain iteration complexity of $\mathcal{O}(p\log(1/\epsilon))$, where $\epsilon$ is a pre-specified accuracy of the objective value, and $p$ is the number of blocks. However, such iteration complexity explicitly depends on $p$, and therefore is at least $p$ times worse than the complexity $\mathcal{O}(\log(1/\epsilon))$ of gradient descent (GD) methods. To bridge this theoretical gap, we propose an improved convergence analysis for the CBCD-type methods. In particular, we first show that for a family of quadratic minimization problems, the iteration complexity $\mathcal{O}(\log^2(p)\cdot\log(1/\epsilon))$ of the CBCD-type methods matches that of the GD methods in term of dependency on $p$, up to a $\log^2 p$ factor. Thus our complexity bounds are sharper than the existing bounds by at least a factor of $p/\log^2(p)$. We also provide a lower bound to confirm that our improved complexity bounds are tight (up to a $\log^2 (p)$ factor), under the assumption that the largest and smallest eigenvalues of the Hessian matrix do not scale with $p$. Finally, we generalize our analysis to other strongly convex minimization problems beyond quadratic ones. [abs] [ pdf ][ bib ] &copy JMLR 2018. ( edit, beta )

ICML Conference 2018 Conference Paper

On the Implicit Bias of Dropout

  • Poorya Mianjy
  • Raman Arora
  • René Vidal

Algorithmic approaches endow deep learning systems with implicit bias that helps them generalize even in over-parametrized settings. In this paper, we focus on understanding such a bias induced in learning through dropout, a popular technique to avoid overfitting in deep learning. For single hidden-layer linear neural networks, we show that dropout tends to make the norm of incoming/outgoing weight vectors of all the hidden nodes equal. In addition, we provide a complete characterization of the optimization landscape induced by dropout.

NeurIPS Conference 2018 Conference Paper

Policy Regret in Repeated Games

  • Raman Arora
  • Michael Dinitz
  • Teodor Vanislavov Marinov
  • Mehryar Mohri

The notion of policy regret'' in online learning is supposed to capture the reactions of the adversary to the actions taken by the learner, which more traditional notions such as external regret do not take into account. We revisit this notion of policy regret, and first show that there are online learning settings in which policy regret and external regret are incompatible: any sequence of play which does well with respect to one must do poorly with respect to the other. We then focus on the game theoretic setting, when the adversary is a self-interested agent. In this setting we show that the external regret and policy regret are not in conflict, and in fact that a wide class of algorithms can ensure both as long as the adversary is also using such an algorithm. We also define a new notion of equilibrium which we call a policy equilibrium'', and show that no-policy regret algorithms will have play which converges to such an equilibrium. Relating this back to external regret, we show that coarse correlated equilibria (which no-external regret players will converge to) are a strict subset of policy equilibria. So in game-theoretic settings every sequence of play with no external regret also has no policy regret, but the converse is not true.

ICML Conference 2018 Conference Paper

Stochastic PCA with 𝓁 2 and 𝓁 1 Regularization

  • Poorya Mianjy
  • Raman Arora

We revisit convex relaxation based methods for stochastic optimization of principal component analysis (PCA). While methods that directly solve the nonconvex problem have been shown to be optimal in terms of statistical and computational efficiency, the methods based on convex relaxation have been shown to enjoy comparable, or even superior, empirical performance – this motivates the need for a deeper formal understanding of the latter. Therefore, in this paper, we study variants of stochastic gradient descent for a convex relaxation of PCA with (a) $\ell_2$, (b) $\ell_1$, and (c) elastic net ($\ell_1+\ell_2)$ regularization in the hope that these variants yield (a) better iteration complexity, (b) better control on the rank of the intermediate iterates, and (c) both, respectively. We show, theoretically and empirically, that compared to previous work on convex relaxation based methods, the proposed variants yield faster convergence and improve overall runtime to achieve a certain user-specified $\epsilon$-suboptimality on the PCA objective. Furthermore, the proposed methods are shown to converge both in terms of the PCA objective as well as the distance between subspaces. However, there still remains a gap in computational requirements for the proposed methods when compared with existing nonconvex approaches.

NeurIPS Conference 2018 Conference Paper

Streaming Kernel PCA with $\tilde{O}(\sqrt{n})$ Random Features

  • Enayat Ullah
  • Poorya Mianjy
  • Teodor Vanislavov Marinov
  • Raman Arora

We study the statistical and computational aspects of kernel principal component analysis using random Fourier features and show that under mild assumptions, $O(\sqrt{n} \log n)$ features suffices to achieve $O(1/\epsilon^2)$ sample complexity. Furthermore, we give a memory efficient streaming algorithm based on classical Oja's algorithm that achieves this rate

ICML Conference 2018 Conference Paper

Streaming Principal Component Analysis in Noisy Settings

  • Teodor V. Marinov
  • Poorya Mianjy
  • Raman Arora

We study streaming algorithms for principal component analysis (PCA) in noisy settings. We present computationally efficient algorithms with sub-linear regret bounds for PCA in the presence of noise, missing data, and gross outliers.

NeurIPS Conference 2018 Conference Paper

The Physical Systems Behind Optimization Algorithms

  • Lin Yang
  • Raman Arora
  • Vladimir Braverman
  • Tuo Zhao

We use differential equations based approaches to provide some {\it \textbf{physics}} insights into analyzing the dynamics of popular optimization algorithms in machine learning. In particular, we study gradient descent, proximal gradient descent, coordinate gradient descent, proximal coordinate gradient, and Newton's methods as well as their Nesterov's accelerated variants in a unified framework motivated by a natural connection of optimization algorithms to physical systems. Our analysis is applicable to more general algorithms and optimization problems {\it \textbf{beyond}} convexity and strong convexity, e. g. Polyak-\L ojasiewicz and error bound conditions (possibly nonconvex).

NeurIPS Conference 2017 Conference Paper

Stochastic Approximation for Canonical Correlation Analysis

  • Raman Arora
  • Teodor Vanislavov Marinov
  • Poorya Mianjy
  • Nati Srebro

We propose novel first-order stochastic approximation algorithms for canonical correlation analysis (CCA). Algorithms presented are instances of inexact matrix stochastic gradient (MSG) and inexact matrix exponentiated gradient (MEG), and achieve $\epsilon$-suboptimality in the population objective in $\operatorname{poly}(\frac{1}{\epsilon})$ iterations. We also consider practical variants of the proposed algorithms and compare them with other methods for CCA both theoretically and empirically.

NeurIPS Conference 2016 Conference Paper

Disease Trajectory Maps

  • Peter Schulam
  • Raman Arora

Medical researchers are coming to appreciate that many diseases are in fact complex, heterogeneous syndromes composed of subpopulations that express different variants of a related complication. Longitudinal data extracted from individual electronic health records (EHR) offer an exciting new way to study subtle differences in the way these diseases progress over time. In this paper, we focus on answering two questions that can be asked using these databases of longitudinal EHR data. First, we want to understand whether there are individuals with similar disease trajectories and whether there are a small number of degrees of freedom that account for differences in trajectories across the population. Second, we want to understand how important clinical outcomes are associated with disease trajectories. To answer these questions, we propose the Disease Trajectory Map (DTM), a novel probabilistic model that learns low-dimensional representations of sparse and irregularly sampled longitudinal data. We propose a stochastic variational inference algorithm for learning the DTM that allows the model to scale to large modern medical datasets. To demonstrate the DTM, we analyze data collected on patients with the complex autoimmune disease, scleroderma. We find that DTM learns meaningful representations of disease trajectories and that the representations are significantly associated with important clinical outcomes.

ICML Conference 2016 Conference Paper

Stochastic Optimization for Multiview Representation Learning using Partial Least Squares

  • Raman Arora
  • Poorya Mianjy
  • Teodor V. Marinov

Partial Least Squares (PLS) is a ubiquitous statistical technique for bilinear factor analysis. It is used in many data analysis, machine learning, and information retrieval applications to model the covariance structure between a pair of data matrices. In this paper, we consider PLS for representation learning in a multiview setting where we have more than one view in data at training time. Furthermore, instead of framing PLS as a problem about a fixed given data set, we argue that PLS should be studied as a stochastic optimization problem, especially in a "big data" setting, with the goal of optimizing a population objective based on sample. This view suggests using Stochastic Approximation (SA) approaches, such as Stochastic Gradient Descent (SGD) and enables a rigorous analysis of their benefits. In this paper, we develop SA approaches to PLS and provide iteration complexity bounds for the proposed algorithms.

ICML Conference 2016 Conference Paper

Stochastic Variance Reduced Optimization for Nonconvex Sparse Learning

  • Xingguo Li
  • Tuo Zhao
  • Raman Arora
  • Han Liu 0001
  • Jarvis D. Haupt

We propose a stochastic variance reduced optimization algorithm for solving a class of large-scale nonconvex optimization problems with cardinality constraints, and provide sufficient conditions under which the proposed algorithm enjoys strong linear convergence guarantees and optimal estimation accuracy in high dimensions. Numerical experiments demonstrate the efficiency of our method in terms of both parameter estimation and computational performance.

ICML Conference 2015 Conference Paper

On Deep Multi-View Representation Learning

  • Weiran Wang
  • Raman Arora
  • Karen Livescu
  • Jeff A. Bilmes

We consider learning representations (features) in the setting in which we have access to multiple unlabeled views of the data for representation learning while only one view is available at test time. Previous work on this problem has proposed several techniques based on deep neural networks, typically involving either autoencoder-like networks with a reconstruction objective or paired feedforward networks with a correlation-based objective. We analyze several techniques based on prior work, as well as new variants, and compare them experimentally on visual, speech, and language domains. To our knowledge this is the first head-to-head comparison of a variety of such techniques on multiple tasks. We find an advantage for correlation-based representation learning, while the best results on most tasks are obtained with our new variant, deep canonically correlated autoencoders (DCCAE).

NeurIPS Conference 2014 Conference Paper

Accelerated Mini-batch Randomized Block Coordinate Descent Method

  • Tuo Zhao
  • Mo Yu
  • Yiming Wang
  • Raman Arora
  • Han Liu

We consider regularized empirical risk minimization problems. In particular, we minimize the sum of a smooth empirical risk function and a nonsmooth regularization function. When the regularization function is block separable, we can solve the minimization problems in a randomized block coordinate descent (RBCD) manner. Existing RBCD methods usually decrease the objective value by exploiting the partial gradient of a randomly selected block of coordinates in each iteration. Thus they need all data to be accessible so that the partial gradient of the block gradient can be exactly obtained. However, such a ``batch setting may be computationally expensive in practice. In this paper, we propose a mini-batch randomized block coordinate descent (MRBCD) method, which estimates the partial gradient of the selected block based on a mini-batch of randomly sampled data in each iteration. We further accelerate the MRBCD method by exploiting the semi-stochastic optimization scheme, which effectively reduces the variance of the partial gradient estimators. Theoretically, we show that for strongly convex functions, the MRBCD method attains lower overall iteration complexity than existing RBCD methods. As an application, we further trim the MRBCD method to solve the regularized sparse learning problems. Our numerical experiments shows that the MRBCD method naturally exploits the sparsity structure and achieves better computational performance than existing methods. "

ICML Conference 2013 Conference Paper

Deep Canonical Correlation Analysis

  • Galen Andrew
  • Raman Arora
  • Jeff A. Bilmes
  • Karen Livescu

We introduce Deep Canonical Correlation Analysis (DCCA), a method to learn complex nonlinear transformations of two views of data such that the resulting representations are highly linearly correlated. Parameters of both transformations are jointly learned to maximize the (regularized) total correlation. It can be viewed as a nonlinear extension of the linear method \emphcanonical correlation analysis (CCA). It is an alternative to the nonparametric method \emphkernel canonical correlation analysis (KCCA) for learning correlated nonlinear transformations. Unlike KCCA, DCCA does not require an inner product, and has the advantages of a parametric method: training time scales well with data size and the training data need not be referenced when computing the representations of unseen instances. In experiments on two real-world datasets, we find that DCCA learns representations with significantly higher correlation than those learned by CCA and KCCA. We also introduce a novel non-saturating sigmoid function based on the cube root that may be useful more generally in feedforward neural networks.

JMLR Journal 2013 Journal Article

Similarity-based Clustering by Left-Stochastic Matrix Factorization

  • Raman Arora
  • Maya R. Gupta
  • Amol Kapila
  • Maryam Fazel

For similarity-based clustering, we propose modeling the entries of a given similarity matrix as the inner products of the unknown cluster probabilities. To estimate the cluster probabilities from the given similarity matrix, we introduce a left-stochastic non-negative matrix factorization problem. A rotation-based algorithm is proposed for the matrix factorization. Conditions for unique matrix factorizations and clusterings are given, and an error bound is provided. The algorithm is particularly efficient for the case of two clusters, which motivates a hierarchical variant for cases where the number of desired clusters is large. Experiments show that the proposed left-stochastic decomposition clustering model produces relatively high within-cluster similarity on most data sets and can match given class labels, and that the efficient hierarchical variant performs surprisingly well. [abs] [ pdf ][ bib ] &copy JMLR 2013. ( edit, beta )

NeurIPS Conference 2013 Conference Paper

Stochastic Optimization of PCA with Capped MSG

  • Raman Arora
  • Andy Cotter
  • Nati Srebro

We study PCA as a stochastic optimization problem and propose a novel stochastic approximation algorithm which we refer to as Matrix Stochastic Gradient'' (MSG), as well as a practical variant, Capped MSG. We study the method both theoretically and empirically. "

UAI Conference 2012 Conference Paper

Deterministic MDPs with Adversarial Rewards and Bandit Feedback

  • Raman Arora
  • Ofer Dekel
  • Ambuj Tewari

We consider a Markov decision process with deterministic state transition dynamics, adversarially generated rewards that change arbitrarily from round to round, and a bandit feedback model in which the decision maker only observes the rewards it receives. In this setting, we present a novel and efficient online decision making algorithm named MarcoPolo. Under mild assumptions on the structure of the transition dynamics, we prove that MarcoPolo enjoys a regret of O(T3/4 √ log T) against the best deterministic policy in hindsight. Specifically, our analysis does not rely on the stringent unichain assumption, which dominates much of the previous work on this topic.

NeurIPS Conference 2009 Conference Paper

On Learning Rotations

  • Raman Arora

An algorithm is presented for online learning of rotations. The proposed algorithm involves matrix exponentiated gradient updates and is motivated by the Von Neumann divergence. The additive updates are skew-symmetric matrices with trace zero which comprise the Lie algebra of the rotation group. The orthogonality and unit determinant of the matrix parameter are preserved using matrix logarithms and exponentials and the algorithm lends itself to interesting interpretations in terms of the computational topology of the compact Lie groups. The stability and the computational complexity of the algorithm are discussed.