Arrow Research search

Author name cluster

Sanjeev Arora

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

98 papers
2 author rows

Possible papers

98

TMLR Journal 2026 Journal Article

Unrealized Expectations: Comparing AI Methods vs Classical Algorithms for Maximum Independent Set

  • Yikai Wu
  • Haoyu Zhao
  • Sanjeev Arora

AI methods, such as generative models and reinforcement learning, have recently been applied to combinatorial optimization (CO) problems, especially NP-hard ones. This paper compares such GPU-based methods with classical CPU-based methods on Maximum Independent Set (MIS). Strikingly, even on in-distribution random graphs, leading AI-inspired methods are consistently outperformed by state-of-art classical solver KaMIS running on a single CPU, and some AI-inspired methods frequently fail to surpass even the simplest degree-based greedy heuristic. Even with post-processing techniques like local search, AI-inspired methods still perform worse than CPU-based solvers. To better understand the source of these failures, we introduce a novel analysis, serialization, which reveals that non-backtracking AI-inspired methods, e.g. LTFT (which is based on GFlowNets), end up reasoning similarly to the simplest degree-based greedy, and thus worse than KaMIS. More generally, our findings suggest a need for a rethinking of current approaches in AI for CO, advocating for more rigorous benchmarking and the principled integration of classical heuristics. Additionally, we also find that CPU-based algorithm KaMIS have strong performance on sparse random graphs, which appears to show that the shattering threshold conjecture for large independent sets proposed by Coja-Oghlan & Efthymiou (2015) is either false or does not apply for real-life sizes (such as $10^6$ nodes).

ICML Conference 2025 Conference Paper

Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate Modality Imbalance in VLMs?

  • Simon Park 0002
  • Abhishek Panigrahi
  • Yun Cheng
  • Dingli Yu
  • Anirudh Goyal
  • Sanjeev Arora

Vision Language Models (VLMs) are impressive at visual question answering and image captioning. But they underperform on multi-step visual reasoning—even compared to LLMs on the same tasks presented in text form—giving rise to perceptions of modality imbalance or brittleness. Towards a systematic study of such issues, we introduce a synthetic framework for assessing the ability of VLMs to perform algorithmic visual reasoning, comprising three tasks: Table Readout, Grid Navigation, and Visual Analogy. Each has two levels of difficulty, SIMPLE and HARD, and even the SIMPLE versions are difficult for frontier VLMs. We propose strategies for training on the SIMPLE version of tasks that improve performance on the corresponding HARD task, i. e. , simple-to-hard (S2H) generalization. This controlled setup, where each task also has an equivalent text-only version, allows a quantification of the modality imbalance and how it is impacted by training strategy. We show that 1) explicit image-to-text conversion is important in promoting S2H generalization on images, by transferring reasoning from text; 2) conversion can be internalized at test time. We also report results of mechanistic study of this phenomenon. We identify measures of gradient alignment that can identify training strategies that promote better S2H generalization. Ablations highlight the importance of chain-of-thought.

NeurIPS Conference 2025 Conference Paper

Ineq-Comp: Benchmarking Human-Intuitive Compositional Reasoning in Automated Theorem Proving of Inequalities

  • Haoyu Zhao
  • Yihan Geng
  • Shange Tang
  • Yong Lin
  • Bohan Lyu
  • Hongzhou Lin
  • Chi Jin
  • Sanjeev Arora

LLM-based formal proof assistants (e. g. , in Lean) hold great promise for automating mathematical discovery. But beyond syntactic correctness, do these systems truly understand mathematical structure as humans do? We investigate this question in context of mathematical inequalities---specifically the prover's ability to recognize that the given problem simplifies by applying a known inequality such as AM/GM. Specifically, we are interested in their ability to do this in a {\em compositional setting} where multiple inequalities must be applied as part of a solution. We introduce \ineqcomp, a benchmark built from elementary inequalities through systematic transformations, including variable duplication, algebraic rewriting, and multi-step composition. Although these problems remain easy for humans, we find that most provers---including Goedel, STP, and Kimina-7B---struggle significantly. DeepSeek-Prover-V2-7B shows relative robustness, but still suffers a 20\% performance drop (pass@32). Even for DeepSeek-Prover-V2-671B model, the gap between compositional variants and seed problems exists, implying that simply scaling up the model size alone does not fully solve the compositional weakness. Strikingly, performance remains poor for all models even when formal proofs of the constituent parts are provided in context, revealing that the source of weakness is indeed in compositional reasoning. Our results expose a persisting gap between the generalization behavior of current AI provers and human mathematical intuition. All data and evaluation code can be found at \url{https: //github. com/haoyuzhao123/LeanIneqComp}.

ICLR Conference 2025 Conference Paper

Instruct-SkillMix: A Powerful Pipeline for LLM Instruction Tuning

  • Simran Kaur 0001
  • Simon Park 0002
  • Anirudh Goyal
  • Sanjeev Arora

We introduce INSTRUCT-SKILLMIX, an automated approach for creating diverse, high quality SFT data for instruction-following. The pipeline involves two stages, each leveraging an existing powerful LLM: (1) Skill extraction: uses the LLM to extract core “skills” for instruction-following by directly prompting the model. This is inspired by “LLM metacognition” of (Didolkar et al., 2024); (2) Data generation: uses the powerful LLM to generate (instruction, response) data that exhibit a randomly chosen pair of these skills. Here, the use of random skill combinations promotes diversity and difficulty. The estimated cost of creating the dataset is under $600. Vanilla SFT (i.e., no PPO, DPO, or RL methods) on data generated from INSTRUCT-SKILLMIX leads to strong gains on instruction following benchmarks such as AlpacaEval 2.0, MT-Bench, and WildBench. With just 4K examples, LLaMA-3-8B-Base achieves 42.76% length-controlled win rate on AlpacaEval 2.0, a level similar to frontier models like Claude 3 Opus and LLaMA-3.1-405B-Instruct. Ablation studies also suggest plausible reasons for why creating open instruction-tuning datasets via naive crowd-sourcing has proved difficult. In our dataset,adding 20% low quality answers (“shirkers”) causes a noticeable degradation in performance. The INSTRUCT-SKILLMIX pipeline seems flexible and adaptable to other settings.

NeurIPS Conference 2025 Conference Paper

LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming?

  • Zihan Zheng
  • Zerui Cheng
  • Zeyu Shen
  • Shang Zhou
  • Kaiyuan Liu
  • Hansen He
  • Dongruixuan Li
  • Stanley Wei

Recent reports claim that large language models (LLMs) now outperform elite humans in competitive programming. Drawing on knowledge from a group of medalists in international algorithmic contests, we revisit this claim, examining how LLMs differ from human experts and where limitations still remain. We introduce LiveCodeBench Pro, a benchmark composed of problems from Codeforces, ICPC, and IOI that are continuously updated to reduce the likelihood of data contamination. A team of Olympiad medalists annotates every problem for algorithmic categories and conducts a line-by-line analysis of failed model-generated submissions. Using this new data and benchmark, we find that frontier models still have significant limitations: without external tools, the best model achieves only 53\% pass@1 on medium-difficulty problems and 0\% on hard problems, domains where expert humans still excel. We also find that LLMs succeed at implementation-heavy problems but struggle with nuanced algorithmic reasoning and complex case analysis, often generating confidently incorrect justifications. High performance appears largely driven by implementation precision and tool augmentation, not superior reasoning. LiveCodeBench Pro thus highlights the significant gap to human grandmaster levels, while offering fine-grained diagnostics to steer future improvements in code-centric LLM reasoning.

TMLR Journal 2025 Journal Article

Low Compute Unlearning via Sparse Representations

  • Vedant Shah
  • Frederik Träuble
  • Ashish Malik
  • Hugo Larochelle
  • Michael Curtis Mozer
  • Sanjeev Arora
  • Yoshua Bengio
  • Anirudh Goyal

Machine unlearning, which involves erasing knowledge about a \emph{forget set} from a trained model, can prove to be costly and infeasible using existing techniques. We propose a low-compute unlearning technique based on a discrete representational bottleneck. We show that the proposed technique efficiently unlearns the forget set and incurs negligible damage to the model's performance on the rest of the dataset. We evaluate the proposed technique on the problem of class unlearning using four datasets: CIFAR-10, CIFAR-100, LACUNA-100 and ImageNet-1k. We compare the proposed technique to SCRUB, a state-of-the-art approach which uses knowledge distillation for unlearning. Across all four datasets, the proposed technique performs as well as, if not better than SCRUB while incurring almost no computational cost.

ICML Conference 2025 Conference Paper

On the Power of Context-Enhanced Learning in LLMs

  • Xingyu Zhu
  • Abhishek Panigrahi
  • Sanjeev Arora

We formalize a new concept for LLMs, context-enhanced learning. It involves standard gradient-based learning on text except that the context is enhanced with additional data on which no auto-regressive gradients are computed. This setting is a gradient-based analog of usual in-context learning (ICL) and appears in some recent works. Using a multi-step reasoning task, we prove in a simplified setting that context-enhanced learning can be exponentially more sample-efficient than standard learning when the model is capable of ICL. At a mechanistic level, we find that the benefit of context-enhancement arises from a more accurate gradient learning signal. We also experimentally demonstrate that it appears hard to detect or recover learning materials that were used in the context during training. This may have implications for data security as well as copyright.

ICLR Conference 2025 Conference Paper

Provable unlearning in topic modeling and downstream tasks

  • Stanley Wei
  • Sadhika Malladi
  • Sanjeev Arora
  • Amartya Sanyal

Machine unlearning algorithms are increasingly important as legal concerns arise around the provenance of training data, but verifying the success of unlearning is often difficult. Provable guarantees for unlearning are often limited to supervised learning settings. In this paper, we provide the first theoretical guarantees for unlearning in the pre-training and fine-tuning paradigm by studying topic models, simple bag-of-words language models that can be adapted to solve downstream tasks like retrieval and classification. First, we design a provably effective unlearning algorithm for topic models that incurs a computational overhead independent of the size of the original dataset. Our analysis additionally quantifies the deletion capacity of the model -- i.e., the number of examples that can be unlearned without incurring a significant cost in model performance. Finally, we formally extend our analyses to account for adaptation to a given downstream task. In particular, we design an efficient algorithm to perform unlearning after fine-tuning the topic model via a linear head. Notably, we show that it is easier to unlearn pre-training data from models that have been fine-tuned to a particular task, and one can unlearn this data without modifying the base model.

ICLR Conference 2025 Conference Paper

Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization

  • Noam Razin
  • Sadhika Malladi
  • Adithya Bhaskar
  • Danqi Chen 0001
  • Sanjeev Arora
  • Boris Hanin

Direct Preference Optimization (DPO) and its variants are increasingly used for aligning language models with human preferences. Although these methods are designed to teach a model to generate preferred responses more frequently relative to dispreferred responses, prior work has observed that the likelihood of preferred responses often decreases during training. The current work sheds light on the causes and implications of this counterintuitive phenomenon, which we term *likelihood displacement*. We demonstrate that likelihood displacement can be *catastrophic*, shifting probability mass from preferred responses to responses with an opposite meaning. As a simple example, training a model to prefer $\texttt{No}$ over $\texttt{Never}$ can sharply increase the probability of $\texttt{Yes}$. Moreover, when aligning the model to refuse unsafe prompts, we show that such displacement can *unintentionally lead to unalignment*, by shifting probability mass from preferred refusal responses to harmful responses (e.g., reducing the refusal rate of Llama-3-8B-Instruct from 74.4% to 33.4%). We theoretically characterize that likelihood displacement is driven by preferences that induce similar embeddings, as measured by a *centered hidden embedding similarity (CHES)* score. Empirically, the CHES score enables identifying which training samples contribute most to likelihood displacement in a given dataset. Filtering out these samples effectively mitigated unintentional unalignment in our experiments. More broadly, our results highlight the importance of curating data with sufficiently distinct preferences, for which we believe the CHES score may prove valuable.

ICML Conference 2025 Conference Paper

Weak-to-Strong Generalization Even in Random Feature Networks, Provably

  • Marko Medvedev
  • Kaifeng Lyu
  • Dingli Yu
  • Sanjeev Arora
  • Zhiyuan Li 0005
  • Nathan Srebro

Weak-to-Strong Generalization (Burns et al. ,2024) is the phenomenon whereby a strong student, say GPT-4, learns a task from a weak teacher, say GPT-2, and ends up significantly outperforming the teacher. We show that this phenomenon does not require a complex and pretrained learner like GPT-4, can arise even in simple non-pretrained models, simply due to the size advantage of the student. But, we also show that there are inherint limits to the extent of such weak to strong generalization. We consider students and teachers that are random feature models, described by two-layer networks with a random and fixed bottom layer and trained top layer. A weak’ teacher, with a small number of units (i. e. random features), is trained on the population, and a strong’ student, with a much larger number of units (i. e. random features), is trained only on labels generated by the weak teacher. We demonstrate, prove, and understand how the student can outperform the teacher, even though trained only on data labeled by the teacher. We also explain how such weak-to-strong generalization is enabled by early stopping. We then show the quantitative limits of weak-to-strong generalization in this model, and in fact in a much broader class of models, for arbitrary teacher and student feature spaces and a broad class of learning rules, including when the student features are pre-trained or otherwise more informative. In particular, we show that in such models the student’s error can only approach zero if the teacher’s error approaches zero, and a strong student cannot “boost” a slightly-better-then-chance teacher to obtain a small error.

NeurIPS Conference 2025 Conference Paper

What Makes a Reward Model a Good Teacher? An Optimization Perspective

  • Noam Razin
  • Zixuan Wang
  • Hubert Strauss
  • Stanley Wei
  • Jason Lee
  • Sanjeev Arora

The success of Reinforcement Learning from Human Feedback (RLHF) critically depends on the quality of the reward model. However, while this quality is primarily evaluated through accuracy, it remains unclear whether accuracy fully captures what makes a reward model an effective teacher. We address this question from an optimization perspective. First, we prove that regardless of how accurate a reward model is, if it induces low reward variance, then the RLHF objective suffers from a flat landscape. Consequently, even a perfectly accurate reward model can lead to extremely slow optimization, underperforming less accurate models that induce higher reward variance. We additionally show that a reward model that works well for one language model can induce low reward variance, and thus a flat objective landscape, for another. These results establish a fundamental limitation of evaluating reward models solely based on accuracy or independently of the language model they guide. Experiments using models of up to 8B parameters corroborate our theory, demonstrating the interplay between reward variance, accuracy, and reward maximization rate. Overall, our findings highlight that beyond accuracy, a reward model needs to induce sufficient variance for efficient optimization.

ICLR Conference 2024 Conference Paper

A Quadratic Synchronization Rule for Distributed Deep Learning

  • Xinran Gu
  • Kaifeng Lyu
  • Sanjeev Arora
  • Jingzhao Zhang
  • Longbo Huang

In distributed deep learning with data parallelism, synchronizing gradients at each training step can cause a huge communication overhead, especially when many nodes work together to train large models. Local gradient methods, such as Local SGD, address this issue by allowing workers to compute locally for $H$ steps without synchronizing with others, hence reducing communication frequency. While $H$ has been viewed as a hyperparameter to trade optimization efficiency for communication cost, recent research indicates that setting a proper $H$ value can lead to generalization improvement. Yet, selecting a proper $H$ is elusive. This work proposes a theory-grounded method for determining $H$, named the Quadratic Synchronization Rule (QSR), which recommends dynamically setting $H$ in proportion to $\frac{1}{\eta^2}$ as the learning rate $\eta$ decays over time. Extensive ImageNet experiments on ResNet and ViT show that local gradient methods with QSR consistently improve the test accuracy over other synchronization strategies. Compared to the standard data parallel training, QSR enables Local AdamW to cut the training time on 16 or 64 GPUs down from 26.7 to 20.2 hours or from 8.6 to 5.5 hours and, at the same time, achieves 1.16% or 0.84% higher top-1 validation accuracy.

NeurIPS Conference 2024 Conference Paper

Can Models Learn Skill Composition from Examples?

  • Haoyu Zhao
  • Simran Kaur
  • Dingli Yu
  • Anirudh Goyal
  • Sanjeev Arora

As large language models (LLMs) become increasingly advanced, their ability to exhibit compositional generalization---the capacity to combine learned skills in novel ways not encountered during training---has garnered significant attention. This type of generalization, particularly in scenarios beyond training data, is also of great interest in the study of AI safety and alignment. A recent study introduced the Skill-Mix evaluation, where models are tasked with composing a short paragraph demonstrating the use of a specified $k$-tuple of language skills. While small models struggled with composing even with $k=3$, larger models like GPT-4 performed reasonably well with $k=5$ and $6$. In this paper, we employ a setup akin to Skill-Mix to evaluate the capacity of smaller models to learn compositional generalization from examples. Utilizing a diverse set of language skills---including rhetorical, literary, reasoning, theory of mind, and common sense---GPT was used to generate text samples that exhibit random subsets of $k$ skills. Subsequent fine-tuning of 7B and 13B parameter models on these combined skill texts, for increasing values of $k$, revealed the following findings: (1) Training on combinations of $k=2$ and $3$ skills results in noticeable improvements in the ability to compose texts with $k=4$ and $5$ skills, despite models never having seen such examples during training. (2) When skill categories are split into training and held-out groups, models significantly improve at composing texts with held-out skills during testing despite having only seen training skills during fine-tuning, illustrating the efficacy of the training approach even with previously unseen skills. This study also suggests that incorporating skill-rich (potentially synthetic) text into training can substantially enhance the compositional capabilities of models.

NeurIPS Conference 2024 Conference Paper

CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs

  • Zirui Wang
  • Mengzhou Xia
  • Luxi He
  • Howard Chen
  • Yitao Liu
  • Richard Zhu
  • Kaiqu Liang
  • Xindi Wu

Chart understanding plays a pivotal role when applying Multimodal Large Language Models (MLLMs) to real-world tasks such as analyzing scientific papers or financial reports. However, existing datasets often focus on oversimplified and homogeneous charts with template-based questions, leading to an overly optimistic measure of progress. We demonstrate that although open-source models can appear to outperform strong proprietary models on these benchmarks, a simple stress test with slightly different charts or questions deteriorates performance by up to 34. 5%. In this work, we propose CharXiv, a comprehensive evaluation suite involving 2, 323 natural, challenging, and diverse charts from scientific papers. CharXiv includes two types of questions: 1) descriptive questions about examining basic chart elements and 2) reasoning questions that require synthesizing information across complex visual elements in the chart. To ensure quality, all charts and questions are handpicked, curated, and verified by human experts. Our results reveal a substantial, previously underestimated gap between the reasoning skills of the strongest proprietary model (i. e. , GPT-4o), which achieves 47. 1% accuracy, and the strongest open-source model (i. e. , InternVL Chat V1. 5), which achieves 29. 2%. All models lag far behind human performance of 80. 5%, underscoring weaknesses in the chart understanding capabilities of existing MLLMs. We hope that CharXiv facilitates future research on MLLM chart understanding by providing a more realistic and faithful measure of progress. Project website: https: //charxiv. github. io/

NeurIPS Conference 2024 Conference Paper

ConceptMix: A Compositional Image Generation Benchmark with Controllable Difficulty

  • Xindi Wu
  • Dingli Yu
  • Yangsibo Huang
  • Olga Russakovsky
  • Sanjeev Arora

Compositionality is a critical capability in Text-to-Image (T2I) models, as it reflects their ability to understand and combine multiple concepts from text descriptions. Existing evaluations of compositional capability rely heavily on human-designed text prompts or fixed templates, limiting their diversity and complexity, and yielding low discriminative power. We propose ConceptMix, a scalable, controllable, and customizable benchmark which automatically evaluates compositional generation ability of T2I models. This is done in two stages. First, ConceptMix generates the text prompts: concretely, using categories of visual concepts (e. g. , objects, colors, shapes, spatial relationships), it randomly samples an object and k-tuples of visual concepts, then uses GPT-4o to generate text prompts for image generation based on these sampled concepts. Second, ConceptMix evaluates the images generated in response to these prompts: concretely, it checks how many of the k concepts actually appeared in the image by generating one question per visual concept and using a strong VLM to answer them. Through administering ConceptMix to a diverse set of T2I models (proprietary as well as open ones) using increasing values of k, we show that our ConceptMix has higher discrimination power than earlier benchmarks. Specifically, ConceptMix reveals that the performance of several models, especially open models, drops dramatically with increased k. Importantly, it also provides insight into the lack of prompt diversity in widely-used training datasets. Additionally, we conduct extensive human studies to validate the design of ConceptMix and compare our automatic grading with human judgement. We hope it will guide future T2I model development.

NeurIPS Conference 2024 Conference Paper

Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates

  • Kaifeng Lyu
  • Haoyu Zhao
  • Xinran Gu
  • Dingli Yu
  • Anirudh Goyal
  • Sanjeev Arora

Public LLMs such as the Llama 2-Chat underwent alignment training and were considered safe. Recently Qi et al. (2024) reported that even benign fine-tuning on seemingly safe datasets can give rise to unsafe behaviors in the models. The current paper is about methods and best practices to mitigate such loss of alignment. We focus on the setting where a public model is fine-tuned before serving users for specific usage, where the model should improve on the downstream task while maintaining alignment. Through extensive experiments on several chat models (Meta's Llama 2-Chat, Mistral AI's Mistral 7B Instruct v0. 2, and OpenAI's GPT-3. 5 Turbo), this paper uncovers that the prompt templates used during fine-tuning and inference play a crucial role in preserving safety alignment, and proposes the “Pure Tuning, Safe Testing” (PTST) strategy --- fine-tune models without a safety prompt, but include it at test time. This seemingly counterintuitive strategy incorporates an intended distribution shift to encourage alignment preservation. Fine-tuning experiments on GSM8K, ChatDoctor, and OpenOrca show that PTST significantly reduces the rise of unsafe behaviors.

ICML Conference 2024 Conference Paper

Language Models as Science Tutors

  • Alexis Chevalier
  • Jiayi Geng
  • Alexander Wettig
  • Howard Chen 0003
  • Sebastian Mizera
  • Toni Annala
  • Max Jameson Aragon
  • Arturo Rodríguez Fanlo

NLP has recently made exciting progress toward training language models (LMs) with strong scientific problem-solving skills. However, model development has not focused on real-life use-cases of LMs for science, including applications in education that require processing long scientific documents. To address this, we introduce TutorEval and TutorChat. TutorEval is a diverse question-answering benchmark consisting of questions about long chapters from STEM textbooks, written by experts. TutorEval helps measure real-life usability of LMs as scientific assistants, and it is the first benchmark combining long contexts, free-form generation, and multi-disciplinary scientific knowledge. Moreover, we show that fine-tuning base models with existing dialogue datasets leads to poor performance on TutorEval. Therefore, we create TutorChat, a dataset of 80, 000 long synthetic dialogues about textbooks. We use TutorChat to fine-tune Llemma models with 7B and 34B parameters. These LM tutors specialized in math have a 32K-token context window, and they excel at TutorEval while performing strongly on GSM8K and MATH. Our datasets build on open-source materials, and we release our models, data, and evaluations publicly.

ICML Conference 2024 Conference Paper

LESS: Selecting Influential Data for Targeted Instruction Tuning

  • Mengzhou Xia
  • Sadhika Malladi
  • Suchin Gururangan
  • Sanjeev Arora
  • Danqi Chen 0001

Instruction tuning has unlocked powerful capabilities in large language models (LLMs), using combined datasets to develop general-purpose chatbots. However, real-world applications often require a specialized suite of skills (e. g. , reasoning). The challenge lies in identifying the most relevant data from these extensive datasets to effectively develop specific capabilities, a setting we frame as targeted instruction tuning. We propose LESS, an optimizer-aware and practically efficient algorithm to estimate data influences and perform L ow-rank gradi E nt S imilarity S earch for instruction data selection. Crucially, LESS adapts existing influence formulations to work with the Adam optimizer and variable-length instruction data. LESS first constructs a highly reusable and transferable gradient datastore with low-dimensional gradient features and then selects examples based on their similarity to few-shot examples embodying a specific capability. Experiments show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks. Furthermore, the selected data is highly transferable: smaller models can be leveraged to select useful data for larger models and models from different families. Our qualitative analysis shows that our method goes beyond surface form cues to identify data that exemplifies the necessary reasoning skills for the intended downstream application. To facilitate future work, we release code and data at princeton-nlp/LESS.

NeurIPS Conference 2024 Conference Paper

Metacognitive Capabilities of LLMs: An Exploration in Mathematical Problem Solving

  • Aniket Didolkar
  • Anirudh Goyal
  • Nan R. Ke
  • Siyuan Guo
  • Michal Valko
  • Timothy Lillicrap
  • Danilo Rezende
  • Yoshua Bengio

\emph{Metacognitive knowledge} refers to humans' intuitive knowledge of their own thinking and reasoning processes. Today's best LLMs clearly possess some reasoning processes. The paper gives evidence that they also have metacognitive knowledge, including ability to name skills and procedures to apply given a task. We explore this primarily in context of math reasoning, developing a prompt-guided interaction procedure to get a powerful LLM to assign sensible skill labels to math questions, followed by having it perform semantic clustering to obtain coarser families of skill labels. These coarse skill labels look interpretable to humans. To validate that these skill labels are meaningful and relevant to the LLM's reasoning processes we perform the following experiments. (a) We ask GPT-4 to assign skill labels to training questions in math datasets GSM8K and MATH. (b) When using an LLM to solve the test questions, we present it with the full list of skill labels and ask it to identify the skill needed. Then it is presented with randomly selected exemplar solved questions associated with that skill label. This improves accuracy on GSM8k and MATH for several strong LLMs, including code-assisted models. The methodology presented is domain-agnostic, even though this article applies it to math problems.

ICLR Conference 2024 Conference Paper

SKILL-MIX: a Flexible and Expandable Family of Evaluations for AI Models

  • Dingli Yu
  • Simran Kaur 0001
  • Arushi Gupta
  • Jonah Brown-Cohen
  • Anirudh Goyal
  • Sanjeev Arora

With LLMs shifting their role from statistical modeling of language to serving as general-purpose AI agents, how should LLM evaluations change? Arguably, a key ability of an AI agent is to flexibly combine, as needed, the basic skills it has learned. The capability to combine skills plays an important role in (human) pedagogy and also in a paper on emergence phenomena (Arora & Goyal, 2023). This work introduces SKILL-MIX, a new evaluation to measure ability to combine skills. Using a list of $N$ skills the evaluator repeatedly picks random subsets of $k$ skills and asks the LLM to produce text combining that subset of skills. Since the number of subsets grows like $N^k$, for even modest $k$ this evaluation will, with high probability, require the LLM to produce text significantly different from any text in the training set. The paper develops a methodology for (a) designing and administering such an evaluation, and (b) automatic grading (plus spot-checking by humans) of the results using GPT-4 as well as the open LLaMA-2 70B model. Administering a version of SKILL-MIX to popular chatbots gave results that, while generally in line with prior expectations, contained surprises. Sizeable differences exist among model capabilities that are not captured by their ranking on popular LLM leaderboards ("cramming for the leaderboard"). Furthermore, simple probability calculations indicate that GPT-4's reasonable performance on $k=5$ is suggestive of going beyond "stochastic parrot" behavior (Bender et al., 2021), i.e., it combines skills in ways that it had not seen during training. We sketch how the methodology can lead to a SKILL-MIX based eco-system of open evaluations for AI capabilities of future models. We maintain a leaderboard of SKILL-MIX at [https://skill-mix.github.io](https://skill-mix.github.io).

ICML Conference 2024 Conference Paper

Trainable Transformer in Transformer

  • Abhishek Panigrahi
  • Sadhika Malladi
  • Mengzhou Xia
  • Sanjeev Arora

Recent works attribute the capability of in-context learning (ICL) in large pre-trained language models to implicitly simulating and fine-tuning an internal model (e. g. , linear or 2-layer MLP) during inference. However, such constructions require large memory overhead, which makes simulation of more sophisticated internal models intractable. In this work, we propose a new efficient construction, Transformer in Transformer (in short, TINT), that allows a transformer to simulate and fine-tune more complex models during inference (e. g. , pre-trained language models). In particular, we introduce innovative approximation techniques that allow a TINT model with less than 2 billion parameters to simulate and fine-tune a 125 million parameter transformer model within a single forward pass. TINT accommodates many common transformer variants and its design ideas also improve the efficiency of past instantiations of simple models inside transformers. We conduct end-to-end experiments to validate the internal fine-tuning procedure of TINT on various language modeling and downstream tasks. For example, even with a limited one-step budget, we observe TINT for a OPT-125M model improves performance by 4 − 16% absolute on average compared to OPT-125M. These findings suggest that large pre-trained language models are capable of performing intricate subroutines. To facilitate further work, a modular and extensible codebase for TINT is included.

ICML Conference 2023 Conference Paper

A Kernel-Based View of Language Model Fine-Tuning

  • Sadhika Malladi
  • Alexander Wettig
  • Dingli Yu
  • Danqi Chen 0001
  • Sanjeev Arora

It has become standard to solve NLP tasks by fine-tuning pre-trained language models (LMs), especially in low-data settings. There is minimal theoretical understanding of empirical success, e. g. , why fine-tuning a model with $10^8$ or more parameters on a couple dozen training points does not result in overfitting. We investigate whether the Neural Tangent Kernel (NTK)—which originated as a model to study the gradient descent dynamics of infinitely wide networks with suitable random initialization—describes fine-tuning of pre-trained LMs. This study was inspired by the decent performance of NTK for computer vision tasks (Wei et al. , 2022). We extend the NTK formalism to Adam and use Tensor Programs (Yang, 2020) to characterize conditions under which the NTK lens may describe fine-tuning updates to pre-trained language models. Extensive experiments on 14 NLP tasks validate our theory and show that formulating the downstream task as a masked word prediction problem through prompting often induces kernel-based dynamics during fine-tuning. Finally, we use this kernel view to propose an explanation for the success of parameter-efficient subspace-based fine-tuning methods.

NeurIPS Conference 2023 Conference Paper

Fine-Tuning Language Models with Just Forward Passes

  • Sadhika Malladi
  • Tianyu Gao
  • Eshaan Nichani
  • Alex Damian
  • Jason D. Lee
  • Danqi Chen
  • Sanjeev Arora

Fine-tuning language models (LMs) has yielded success on diverse downstream tasks, but as LMs grow in size, backpropagation requires a prohibitively large amount of memory. Zeroth-order (ZO) methods can in principle estimate gradients using only two forward passes but are theorized to be catastrophically slow for optimizing large models. In this work, we propose a memory-efficient zerothorder optimizer (MeZO), adapting the classical ZO-SGD method to operate in-place, thereby fine-tuning LMs with the same memory footprint as inference. For example, with a single A100 80GB GPU, MeZO can train a 30-billion parameter model, whereas fine-tuning with backpropagation can train only a 2. 7B LM with the same budget. We conduct comprehensive experiments across model types (masked and autoregressive LMs), model scales (up to 66B), and downstream tasks (classification, multiple-choice, and generation). Our results demonstrate that (1) MeZO significantly outperforms in-context learning and linear probing; (2) MeZO achieves comparable performance to fine-tuning with backpropagation across multiple tasks, with up to 12× memory reduction and up to 2× GPU-hour reduction in our implementation; (3) MeZO is compatible with both full-parameter and parameter-efficient tuning techniques such as LoRA and prefix tuning; (4) MeZO can effectively optimize non-differentiable objectives (e. g. , maximizing accuracy or F1). We support our empirical findings with theoretical insights, highlighting how adequate pre-training and task prompts enable MeZO to fine-tune huge models, despite classical ZO analyses suggesting otherwise.

ICML Conference 2023 Conference Paper

Task-Specific Skill Localization in Fine-tuned Language Models

  • Abhishek Panigrahi
  • Nikunj Saunshi
  • Haoyu Zhao
  • Sanjeev Arora

Pre-trained language models can be fine-tuned to solve diverse NLP tasks, including in few-shot settings. Thus fine-tuning allows the model to quickly pick up task-specific "skills, " but there has been limited study of where these newly-learnt skills reside inside the massive model. This paper introduces the term skill localization for this problem and proposes a solution. Given the downstream task and a model fine-tuned on that task, a simple optimization is used to identify a very small subset of parameters ($\sim$0. 01% of model parameters) responsible for ($>$95%) of the model’s performance, in the sense that grafting the fine-tuned values for just this tiny subset onto the pre-trained model gives performance almost as well as the fine-tuned model. While reminiscent of recent works on parameter-efficient fine-tuning, the novel aspects here are that: (i) No further retraining is needed on the subset (unlike, say, with lottery tickets). (ii) Notable improvements are seen over vanilla fine-tuning with respect to calibration of predictions in-distribution (40-90% error reduction) as well as quality of predictions out-of-distribution (OOD). In models trained on multiple tasks, a stronger notion of skill localization is observed, where the sparse regions corresponding to different tasks are almost disjoint, and their overlap (when it happens) is a proxy for task similarity. Experiments suggest that localization via grafting can assist certain forms continual learning.

ICLR Conference 2023 Conference Paper

Understanding Influence Functions and Datamodels via Harmonic Analysis

  • Nikunj Saunshi
  • Arushi Gupta
  • Mark Braverman
  • Sanjeev Arora

Influence functions estimate effect of individual data points on predictions of the model on test data and were adapted to deep learning in \cite{koh2017understanding}. They have been used for detecting data poisoning, detecting helpful and harmful examples, influence of groups of datapoints, etc. Recently, \cite{ilyas2022datamodels} introduced a linear regression method they termed {\em datamodels} to predict the effect of training points on outputs on test data. The current paper seeks to provide a better theoretical understanding of such interesting empirical phenomena. The primary tool is harmonic analysis and the idea of {\em noise stability}. Contributions include: (a) Exact characterization of the learnt datamodel in terms of Fourier coefficients. (b) An efficient method to estimate the residual error and quality of the optimum linear datamodel without having to train the datamodel. (c) New insights into when influences of groups of datapoints may or may not add up linearly.

ICLR Conference 2023 Conference Paper

Why (and When) does Local SGD Generalize Better than SGD?

  • Xinran Gu
  • Kaifeng Lyu
  • Longbo Huang
  • Sanjeev Arora

Local SGD is a communication-efficient variant of SGD for large-scale training, where multiple GPUs perform SGD independently and average the model parameters periodically. It has been recently observed that Local SGD can not only achieve the design goal of reducing the communication overhead but also lead to higher test accuracy than the corresponding SGD baseline (Lin et al., 2020b), though the training regimes for this to happen are still in debate (Ortiz et al., 2021). This paper aims to understand why (and when) Local SGD generalizes better based on Stochastic Differential Equation (SDE) approximation. The main contributions of this paper include (i) the derivation of an SDE that captures the long-term behavior of Local SGD in the small learning rate regime, showing how noise drives the iterate to drift and diffuse after it has reached close to the manifold of local minima, (ii) a comparison between the SDEs of Local SGD and SGD, showing that Local SGD induces a stronger drift term that can result in a stronger effect of regularization, e.g., a faster reduction of sharpness, and (iii) empirical evidence validating that having a small learning rate and long enough training time enables the generalization improvement over SGD but removing either of the two conditions leads to no improvement.

NeurIPS Conference 2022 Conference Paper

Implicit Bias of Gradient Descent on Reparametrized Models: On Equivalence to Mirror Descent

  • Zhiyuan Li
  • Tianhao Wang
  • Jason D. Lee
  • Sanjeev Arora

As part of the effort to understand implicit bias of gradient descent in overparametrized models, several results have shown how the training trajectory on the overparametrized model can be understood as mirror descent on a different objective. The main result here is a complete characterization of this phenomenon under a notion termed commuting parametrization, which encompasses all the previous results in this setting. It is shown that gradient flow with any commuting parametrization is equivalent to continuous mirror descent with a related mirror map. Conversely, continuous mirror descent with any mirror map can be viewed as gradient flow with a related commuting parametrization. The latter result relies upon Nash's embedding theorem.

NeurIPS Conference 2022 Conference Paper

New Definitions and Evaluations for Saliency Methods: Staying Intrinsic, Complete and Sound

  • Arushi Gupta
  • Nikunj Saunshi
  • Dingli Yu
  • Kaifeng Lyu
  • Sanjeev Arora

Saliency methods compute heat maps that highlight portions of an input that were most important for the label assigned to it by a deep net. Evaluations of saliency methods convert this heat map into a new masked input by retaining the $k$ highest-ranked pixels of the original input and replacing the rest with "uninformative" pixels, and checking if the net's output is mostly unchanged. This is usually seen as an explanation of the output, but the current paper highlights reasons why this inference of causality may be suspect. Inspired by logic concepts of completeness & soundness, it observes that the above type of evaluation focuses on completeness of the explanation, but ignores soundness. New evaluation metrics are introduced to capture both notions, while staying in an intrinsic framework---i. e. , using the dataset and the net, but no separately trained nets, human evaluations, etc. A simple saliency method is described that matches or outperforms prior methods in the evaluations. Experiments also suggest new intrinsic justifications, based on soundness, for popular heuristic tricks such as TV regularization and upsampling.

ICLR Conference 2022 Conference Paper

On Predicting Generalization using GANs

  • Yi Zhang 0074
  • Arushi Gupta
  • Nikunj Saunshi
  • Sanjeev Arora

Research on generalization bounds for deep networks seeks to give ways to predict test error using just the training dataset and the network parameters. While generalization bounds can give many insights about architecture design, training algorithms etc., what they do not currently do is yield good predictions for actual test error. A recently introduced Predicting Generalization in Deep Learning competition aims to encourage discovery of methods to better predict test error. The current paper investigates a simple idea: can test error be predicted using {\em synthetic data,} produced using a Generative Adversarial Network (GAN) that was trained on the same training dataset? Upon investigating several GAN models and architectures, we find that this turns out to be the case. In fact, using GANs pre-trained on standard datasets, the test error can be predicted without requiring any additional hyper-parameter tuning. This result is surprising because GANs have well-known limitations (e.g. mode collapse) and are known to not learn the data distribution accurately. Yet the generated samples are good enough to substitute for test data. Several additional experiments are presented to explore reasons why GANs do well at this task. In addition to a new approach for predicting generalization, the counter-intuitive phenomena presented in our work may also call for a better understanding of GANs' strengths and limitations.

NeurIPS Conference 2022 Conference Paper

On the SDEs and Scaling Rules for Adaptive Gradient Algorithms

  • Sadhika Malladi
  • Kaifeng Lyu
  • Abhishek Panigrahi
  • Sanjeev Arora

Approximating Stochastic Gradient Descent (SGD) as a Stochastic Differential Equation (SDE) has allowed researchers to enjoy the benefits of studying a continuous optimization trajectory while carefully preserving the stochasticity of SGD. Analogous study of adaptive gradient methods, such as RMSprop and Adam, has been challenging because there were no rigorously proven SDE approximations for these methods. This paper derives the SDE approximations for RMSprop and Adam, giving theoretical guarantees of their correctness as well as experimental validation of their applicability to common large-scaling vision and language settings. A key practical result is the derivation of a square root scaling rule to adjust the optimization hyperparameters of RMSprop and Adam when changing batch size, and its empirical validation in deep learning settings.

ICML Conference 2022 Conference Paper

Understanding Contrastive Learning Requires Incorporating Inductive Biases

  • Nikunj Saunshi
  • Jordan T. Ash
  • Surbhi Goel
  • Dipendra Misra
  • Cyril Zhang
  • Sanjeev Arora
  • Sham M. Kakade
  • Akshay Krishnamurthy

Contrastive learning is a popular form of self-supervised learning that encourages augmentations (views) of the same input to have more similar representations compared to augmentations of different inputs. Recent attempts to theoretically explain the success of contrastive learning on downstream classification tasks prove guarantees depending on properties of augmentations and the value of contrastive loss of representations. We demonstrate that such analyses, that ignore inductive biases of the function class and training algorithm, cannot adequately explain the success of contrastive learning, even provably leading to vacuous guarantees in some settings. Extensive experiments on image and text domains highlight the ubiquity of this problem – different function classes and algorithms behave very differently on downstream tasks, despite having the same augmentations and contrastive losses. Theoretical analysis is presented for the class of linear representations, where incorporating inductive biases of the function class allows contrastive learning to work with less stringent conditions compared to prior analyses.

ICML Conference 2022 Conference Paper

Understanding Gradient Descent on the Edge of Stability in Deep Learning

  • Sanjeev Arora
  • Zhiyuan Li 0005
  • Abhishek Panigrahi

Deep learning experiments by \citet{cohen2021gradient} using deterministic Gradient Descent (GD) revealed an Edge of Stability (EoS) phase when learning rate (LR) and sharpness ( i. e. , the largest eigenvalue of Hessian) no longer behave as in traditional optimization. Sharpness stabilizes around $2/$LR and loss goes up and down across iterations, yet still with an overall downward trend. The current paper mathematically analyzes a new mechanism of implicit regularization in the EoS phase, whereby GD updates due to non-smooth loss landscape turn out to evolve along some deterministic flow on the manifold of minimum loss. This is in contrast to many previous results about implicit bias either relying on infinitesimal updates or noise in gradient. Formally, for any smooth function $L$ with certain regularity condition, this effect is demonstrated for (1) Normalized GD, i. e. , GD with a varying LR $\eta_t =\frac{\eta}{\norm{\nabla L(x(t))}}$ and loss $L$; (2) GD with constant LR and loss $\sqrt{L- \min_x L(x)}$. Both provably enter the Edge of Stability, with the associated flow on the manifold minimizing $\lambda_{1}(\nabla^2 L)$. The above theoretical results have been corroborated by an experimental study.

NeurIPS Conference 2022 Conference Paper

Understanding the Generalization Benefit of Normalization Layers: Sharpness Reduction

  • Kaifeng Lyu
  • Zhiyuan Li
  • Sanjeev Arora

Normalization layers (e. g. , Batch Normalization, Layer Normalization) were introduced to help with optimization difficulties in very deep nets, but they clearly also help generalization, even in not-so-deep nets. Motivated by the long-held belief that flatter minima lead to better generalization, this paper gives mathematical analysis and supporting experiments suggesting that normalization (together with accompanying weight-decay) encourages GD to reduce the sharpness of loss surface. Here ``sharpness'' is carefully defined given that the loss is scale-invariant, a known consequence of normalization. Specifically, for a fairly broad class of neural nets with normalization, our theory explains how GD with a finite learning rate enters the so-called Edge of Stability (EoS) regime, and characterizes the trajectory of GD in this regime via a continuous sharpness-reduction flow.

ICLR Conference 2022 Conference Paper

What Happens after SGD Reaches Zero Loss? --A Mathematical Framework

  • Zhiyuan Li 0005
  • Tianhao Wang 0017
  • Sanjeev Arora

Understanding the implicit bias of Stochastic Gradient Descent (SGD) is one of the key challenges in deep learning, especially for overparametrized models, where the local minimizers of the loss function $L$ can form a manifold. Intuitively, with a sufficiently small learning rate $\eta$, SGD tracks Gradient Descent (GD) until it gets close to such manifold, where the gradient noise prevents further convergence. In such regime, Blanc et al. (2020) proved that SGD with label noise locally decreases a regularizer-like term, the sharpness of loss, $\text{tr}[\nabla^2 L]$. The current paper gives a general framework for such analysis by adapting ideas from Katzenberger (1991). It allows in principle a complete characterization for the regularization effect of SGD around such manifold---i.e., the "implicit bias"---using a stochastic differential equation (SDE) describing the limiting dynamics of the parameters, which is determined jointly by the loss function and the noise covariance. This yields some new results: (1) a *global* analysis of the implicit bias valid for $\eta^{-2}$ steps, in contrast to the local analysis of Blanc et al. (2020) that is only valid for $\eta^{-1.6}$ steps and (2) allowing *arbitrary* noise covariance. As an application, we show with arbitrary large initialization, label noise SGD can always escape the kernel regime and only requires $O(\kappa\ln d)$ samples for learning an $\kappa$-sparse overparametrized linear model in $\mathbb{R}^d$ (Woodworth et al., 2020), while GD initialized in the kernel regime requires $\Omega(d)$ samples. This upper bound is minimax optimal and improves the previous $\widetilde{O}(\kappa^2)$ upper bound (HaoChen et al., 2020).

ICLR Conference 2021 Conference Paper

A Mathematical Exploration of Why Language Models Help Solve Downstream Tasks

  • Nikunj Saunshi
  • Sadhika Malladi
  • Sanjeev Arora

Autoregressive language models, pretrained using large text corpora to do well on next word prediction, have been successful at solving many downstream tasks, even with zero-shot usage. However, there is little theoretical understanding of this success. This paper initiates a mathematical study of this phenomenon for the downstream task of text classification by considering the following questions: (1) What is the intuitive connection between the pretraining task of next word prediction and text classification? (2) How can we mathematically formalize this connection and quantify the benefit of language modeling? For (1), we hypothesize, and verify empirically, that classification tasks of interest can be reformulated as sentence completion tasks, thus making language modeling a meaningful pretraining task. With a mathematical formalization of this hypothesis, we make progress towards (2) and show that language models that are $\epsilon$-optimal in cross-entropy (log-perplexity) learn features that can linearly solve such classification tasks with $\mathcal{O}(\sqrt{\epsilon})$ error, thus demonstrating that doing well on language modeling can be beneficial for downstream tasks. We experimentally verify various assumptions and theoretical findings, and also use insights from the analysis to design a new objective function that performs well on some classification tasks.

NeurIPS Conference 2021 Conference Paper

Evaluating Gradient Inversion Attacks and Defenses in Federated Learning

  • Yangsibo Huang
  • Samyak Gupta
  • Zhao Song
  • Kai Li
  • Sanjeev Arora

Gradient inversion attack (or input recovery from gradient) is an emerging threat to the security and privacy preservation of Federated learning, whereby malicious eavesdroppers or participants in the protocol can recover (partially) the clients' private data. This paper evaluates existing attacks and defenses. We find that some attacks make strong assumptions about the setup. Relaxing such assumptions can substantially weaken these attacks. We then evaluate the benefits of three proposed defense mechanisms against gradient inversion attacks. We show the trade-offs of privacy leakage and data utility of these defense methods, and find that combining them in an appropriate manner makes the attack less effective, even under the original strong assumptions. We also estimate the computation cost of end-to-end recovery of a single image under each evaluated defense. Our findings suggest that the state-of-the-art attacks can currently be defended against with minor data utility loss, as summarized in a list of potential strategies.

NeurIPS Conference 2021 Conference Paper

Gradient Descent on Two-layer Nets: Margin Maximization and Simplicity Bias

  • Kaifeng Lyu
  • Zhiyuan Li
  • Runzhe Wang
  • Sanjeev Arora

The generalization mystery of overparametrized deep nets has motivated efforts to understand how gradient descent (GD) converges to low-loss solutions that generalize well. Real-life neural networks are initialized from small random values and trained with cross-entropy loss for classification (unlike the "lazy" or "NTK" regime of training where analysis was more successful), and a recent sequence of results (Lyu and Li, 2020; Chizat and Bach, 2020; Ji and Telgarsky, 2020) provide theoretical evidence that GD may converge to the "max-margin" solution with zero loss, which presumably generalizes well. However, the global optimality of margin is proved only in some settings where neural nets are infinitely or exponentially wide. The current paper is able to establish this global optimality for two-layer Leaky ReLU nets trained with gradient flow on linearly separable and symmetric data, regardless of the width. The analysis also gives some theoretical justification for recent empirical findings (Kalimeris et al. , 2019) on the so-called simplicity bias of GD towards linear or other "simple" classes of solutions, especially early in training. On the pessimistic side, the paper suggests that such results are fragile. A simple data manipulation can make gradient flow converge to a linear classifier with suboptimal margin.

NeurIPS Conference 2021 Conference Paper

On the Validity of Modeling SGD with Stochastic Differential Equations (SDEs)

  • Zhiyuan Li
  • Sadhika Malladi
  • Sanjeev Arora

It is generally recognized that finite learning rate (LR), in contrast to infinitesimal LR, is important for good generalization in real-life deep nets. Most attempted explanations propose approximating finite-LR SGD with Itô Stochastic Differential Equations (SDEs), but formal justification for this approximation (e. g. , Li et al. , 2019) only applies to SGD with tiny LR. Experimental verification of the approximation appears computationally infeasible. The current paper clarifies the picture with the following contributions: (a) An efficient simulation algorithm SVAG that provably converges to the conventionally used Itô SDE approximation. (b) A theoretically motivated testable necessary condition for the SDE approximation and its most famous implication, the linear scaling rule (Goyal et al. , 2017), to hold. (c) Experiments using this simulation to demonstrate that the previously proposed SDE approximation can meaningfully capture the training and generalization properties of common deep nets.

ICLR Conference 2021 Conference Paper

Why Are Convolutional Nets More Sample-Efficient than Fully-Connected Nets?

  • Zhiyuan Li 0005
  • Yi Zhang 0074
  • Sanjeev Arora

Convolutional neural networks often dominate fully-connected counterparts in generalization performance, especially on image classification tasks. This is often explained in terms of \textquotedblleft better inductive bias.\textquotedblright\ However, this has not been made mathematically rigorous, and the hurdle is that the sufficiently wide fully-connected net can always simulate the convolutional net. Thus the training algorithm plays a role. The current work describes a natural task on which a provable sample complexity gap can be shown, for standard training algorithms. We construct a single natural distribution on $\mathbb{R}^d\times\{\pm 1\}$ on which any orthogonal-invariant algorithm (i.e. fully-connected networks trained with most gradient-based methods from gaussian initialization) requires $\Omega(d^2)$ samples to generalize while $O(1)$ samples suffice for convolutional architectures. Furthermore, we demonstrate a single target function, learning which on all possible distributions leads to an $O(1)$ vs $\Omega(d^2/\varepsilon)$ gap. The proof relies on the fact that SGD on fully-connected network is orthogonal equivariant. Similar results are achieved for $\ell_2$ regression and adaptive training algorithms, e.g. Adam and AdaGrad, which are only permutation equivariant.

ICML Conference 2020 Conference Paper

A Sample Complexity Separation between Non-Convex and Convex Meta-Learning

  • Nikunj Saunshi
  • Yi Zhang 0074
  • Mikhail Khodak
  • Sanjeev Arora

One popular trend in meta-learning is to learn from many training tasks a common initialization that a gradient-based method can use to solve a new task with few samples. The theory of meta-learning is still in its early stages, with several recent learning-theoretic analyses of methods such as Reptile [Nichol et al. , 2018] being for \emph{convex models}. This work shows that convex-case analysis might be insufficient to understand the success of meta-learning, and that even for non-convex models it is important to look inside the optimization black-box, specifically at properties of the optimization trajectory. We construct a simple meta-learning instance that captures the problem of one-dimensional subspace learning. For the convex formulation of linear regression on this instance, we show that the new task sample complexity of any \emph{initialization-based meta-learning} algorithm is $\Omega(d)$, where $d$ is the input dimension. In contrast, for the non-convex formulation of a two layer linear network on the same instance, we show that both Reptile and multi-task representation learning can have new task sample complexity of $O(1)$, demonstrating a separation from convex meta-learning. Crucially, analyses of the training dynamics of these methods reveal that they can meta-learn the correct subspace onto which the data should be projected.

ICLR Conference 2020 Conference Paper

An Exponential Learning Rate Schedule for Deep Learning

  • Zhiyuan Li 0005
  • Sanjeev Arora

Intriguing empirical evidence exists that deep learning can work well with exotic schedules for varying the learning rate. This paper suggests that the phenomenon may be due to Batch Normalization or BN(Ioffe & Szegedy, 2015), which is ubiq- uitous and provides benefits in optimization and generalization across all standard architectures. The following new results are shown about BN with weight decay and momentum (in other words, the typical use case which was not considered in earlier theoretical analyses of stand-alone BN (Ioffe & Szegedy, 2015; Santurkar et al., 2018; Arora et al., 2018) • Training can be done using SGD with momentum and an exponentially in- creasing learning rate schedule, i.e., learning rate increases by some (1 + α) factor in every epoch for some α > 0. (Precise statement in the paper.) To the best of our knowledge this is the first time such a rate schedule has been successfully used, let alone for highly successful architectures. As ex- pected, such training rapidly blows up network weights, but the net stays well-behaved due to normalization. • Mathematical explanation of the success of the above rate schedule: a rigor- ous proof that it is equivalent to the standard setting of BN + SGD + Standard Rate Tuning + Weight Decay + Momentum. This equivalence holds for other normalization layers as well, Group Normalization(Wu & He, 2018), Layer Normalization(Ba et al., 2016), Instance Norm(Ulyanov et al., 2016), etc. • A worked-out toy example illustrating the above linkage of hyper- parameters. Using either weight decay or BN alone reaches global minimum, but convergence fails when both are used.

ICLR Conference 2020 Conference Paper

Harnessing the Power of Infinitely Wide Deep Nets on Small-data Tasks

  • Sanjeev Arora
  • Simon S. Du
  • Zhiyuan Li 0005
  • Ruslan Salakhutdinov
  • Ruosong Wang
  • Dingli Yu

Recent research shows that the following two models are equivalent: (a) infinitely wide neural networks (NNs) trained under l2 loss by gradient descent with infinitesimally small learning rate (b) kernel regression with respect to so-called Neural Tangent Kernels (NTKs) (Jacot et al., 2018). An efficient algorithm to compute the NTK, as well as its convolutional counterparts, appears in Arora et al. (2019a), which allowed studying performance of infinitely wide nets on datasets like CIFAR-10. However, super-quadratic running time of kernel methods makes them best suited for small-data tasks. We report results suggesting neural tangent kernels perform strongly on low-data tasks. 1. On a standard testbed of classification/regression tasks from the UCI database, NTK SVM beats the previous gold standard, Random Forests (RF), and also the corresponding finite nets. 2. On CIFAR-10 with 10 – 640 training samples, Convolutional NTK consistently beats ResNet-34 by 1% - 3%. 3. On VOC07 testbed for few-shot image classification tasks on ImageNet with transfer learning (Goyal et al., 2019), replacing the linear SVM currently used with a Convolutional NTK SVM consistently improves performance. 4. Comparing the performance of NTK with the finite-width net it was derived from, NTK behavior starts at lower net widths than suggested by theoretical analysis(Arora et al., 2019a). NTK’s efficacy may trace to lower variance of output.

ICML Conference 2020 Conference Paper

InstaHide: Instance-hiding Schemes for Private Distributed Learning

  • Yangsibo Huang
  • Zhao Song 0002
  • Kai Li 0001
  • Sanjeev Arora

How can multiple distributed entities train a shared deep net on their private data while protecting data privacy? This paper introduces InstaHide, a simple encryption of training images. Encrypted images can be used in standard deep learning pipelines (PyTorch, Federated Learning etc.) with no additional setup or infrastructure. The encryption has a minor effect on test accuracy (unlike differential privacy). Encryption consists of mixing the image with a set of other images (in the sense of Mixup data augmentation technique (Zhang et al. , 2018)) followed by applying a random pixel-wise mask on the mixed image. Other contributions of this paper are: (a) Use of large public dataset of images (e. g. ImageNet) for mixing during encryption; this improves security. (b) Experiments demonstrating effectiveness in protecting privacy against known attacks while preserving model accuracy. (c) Theoretical analysis showing that successfully attacking privacy requires attackers to solve a difficult computational problem. (d) Demonstration that Mixup alone is insecure as (contrary to recent proposals), by showing some efficient attacks. (e) Release of a challenge dataset to allow design of new attacks.

NeurIPS Conference 2020 Conference Paper

Over-parameterized Adversarial Training: An Analysis Overcoming the Curse of Dimensionality

  • Yi Zhang
  • Orestis Plevrakis
  • Simon S. Du
  • Xingguo Li
  • Zhao Song
  • Sanjeev Arora

Adversarial training is a popular method to give neural nets robustness against adversarial perturbations. In practice adversarial training leads to low robust training loss. However, a rigorous explanation for why this happens under natural conditions is still missing. Recently a convergence theory of standard (non-adversarial) supervised training was developed by various groups for {\em very overparametrized} nets. It is unclear how to extend these results to adversarial training because of the min-max objective. Recently, a first step towards this direction was made by Gao et al. using tools from online learning, but they require the width of the net to be \emph{exponential} in input dimension $d$, and with an unnatural activation function. Our work proves convergence to low robust training loss for \emph{polynomial} width instead of exponential, under natural assumptions and with ReLU activations. A key element of our proof is showing that ReLU networks near initialization can approximate the step function, which may be of independent interest.

ICML Conference 2020 Conference Paper

Provable Representation Learning for Imitation Learning via Bi-level Optimization

  • Sanjeev Arora
  • Simon S. Du
  • Sham M. Kakade
  • Yuping Luo
  • Nikunj Saunshi

A common strategy in modern learning systems is to learn a representation that is useful for many tasks, a. k. a. representation learning. We study this strategy in the imitation learning setting for Markov decision processes (MDPs) where multiple experts’ trajectories are available. We formulate representation learning as a bi-level optimization problem where the “outer" optimization tries to learn the joint representation and the “inner" optimization encodes the imitation learning setup and tries to learn task-specific parameters. We instantiate this framework for the imitation learning settings of behavior cloning and observation-alone. Theoretically, we show using our framework that representation learning can provide sample complexity benefits for imitation learning in both settings. We also provide proof-of-concept experiments to verify our theory.

NeurIPS Conference 2020 Conference Paper

Reconciling Modern Deep Learning with Traditional Optimization Analyses: The Intrinsic Learning Rate

  • Zhiyuan Li
  • Kaifeng Lyu
  • Sanjeev Arora

Recent works (e. g. , (Li \& Arora, 2020)) suggest that the use of popular normalization schemes (including Batch Normalization) in today's deep learning can move it far from a traditional optimization viewpoint, e. g. , use of exponentially increasing learning rates. The current paper highlights other ways in which behavior of normalized nets departs from traditional viewpoints, and then initiates a formal framework for studying their mathematics via suitable adaptation of the conventional framework namely, modeling SGD-induced training trajectory via a suitable stochastic differential equation (SDE) with a noise term that captures gradient noise. This yields: (a) A new \textquotedblleft intrinsic learning rate\textquotedblright\ parameter that is the product of the normal learning rate $\eta$ and weight decay factor $\lambda$. Analysis of the SDE shows how the effective speed of learning varies and equilibrates over time under the control of intrinsic LR. (b) A challenge---via theory and experiments---to popular belief that good generalization requires large learning rates at the start of training. (c) New experiments, backed by mathematical intuition, suggesting the number of steps to equilibrium (in function space) scales as the inverse of the intrinsic learning rate, as opposed to the exponential time convergence bound implied by SDE analysis. We name it the \emph{Fast Equilibrium Conjecture} and suggest it holds the key to why Batch Normalization is effective.

ICML Conference 2019 Conference Paper

A Theoretical Analysis of Contrastive Unsupervised Representation Learning

  • Nikunj Saunshi
  • Orestis Plevrakis
  • Sanjeev Arora
  • Mikhail Khodak
  • Hrishikesh Khandeparkar

Recent empirical works have successfully used unlabeled data to learn feature representations that are broadly useful in downstream classification tasks. Several of these methods are reminiscent of the well-known word2vec embedding algorithm: leveraging availability of pairs of semantically “similar" data points and “negative samples, " the learner forces the inner product of representations of similar pairs with each other to be higher on average than with negative samples. The current paper uses the term contrastive learning for such algorithms and presents a theoretical framework for analyzing them by introducing latent classes and hypothesizing that semantically similar points are sampled from the same latent class. This framework allows us to show provable guarantees on the performance of the learned representations on the average classification task that is comprised of a subset of the same set of latent classes. Our generalization bound also shows that learned representations can reduce (labeled) sample complexity on downstream tasks. We conduct controlled experiments in both the text and image domains to support the theory.

NeurIPS Conference 2019 Conference Paper

Explaining Landscape Connectivity of Low-cost Solutions for Multilayer Nets

  • Rohith Kuditipudi
  • Xiang Wang
  • Holden Lee
  • Yi Zhang
  • Zhiyuan Li
  • Wei Hu
  • Rong Ge
  • Sanjeev Arora

Mode connectivity is a surprising phenomenon in the loss landscape of deep nets. Optima---at least those discovered by gradient-based optimization---turn out to be connected by simple paths on which the loss function is almost constant. Often, these paths can be chosen to be piece-wise linear, with as few as two segments. We give mathematical explanations for this phenomenon, assuming generic properties (such as dropout stability and noise stability) of well-trained deep nets, which have previously been identified as part of understanding the generalization properties of deep nets. Our explanation holds for realistic multilayer nets, and experiments are presented to verify the theory.

ICML Conference 2019 Conference Paper

Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks

  • Sanjeev Arora
  • Simon S. Du
  • Wei Hu 0014
  • Zhiyuan Li 0005
  • Ruosong Wang

Recent works have cast some light on the mystery of why deep nets fit any data and generalize despite being very overparametrized. This paper analyzes training and generalization for a simple 2-layer ReLU net with random initialization, and provides the following improvements over recent works: (i) Using a tighter characterization of training speed than recent papers, an explanation for why training a neural net with random labels leads to slower training, as originally observed in [Zhang et al. ICLR’17]. (ii) Generalization bound independent of network size, using a data-dependent complexity measure. Our measure distinguishes clearly between random labels and true labels on MNIST and CIFAR, as shown by experiments. Moreover, recent papers require sample complexity to increase (slowly) with the size, while our sample complexity is completely independent of the network size. (iii) Learnability of a broad class of smooth functions by 2-layer ReLU nets trained via gradient descent. The key idea is to track dynamics of training and generalization via properties of a related kernel.

NeurIPS Conference 2019 Conference Paper

Implicit Regularization in Deep Matrix Factorization

  • Sanjeev Arora
  • Nadav Cohen
  • Wei Hu
  • Yuping Luo

Efforts to understand the generalization mystery in deep learning have led to the belief that gradient-based optimization induces a form of implicit regularization, a bias towards models of low "complexity. " We study the implicit regularization of gradient descent over deep linear neural networks for matrix completion and sensing, a model referred to as deep matrix factorization. Our first finding, supported by theory and experiments, is that adding depth to a matrix factorization enhances an implicit tendency towards low-rank solutions, oftentimes leading to more accurate recovery. Secondly, we present theoretical and empirical arguments questioning a nascent view by which implicit regularization in matrix factorization can be captured using simple mathematical norms. Our results point to the possibility that the language of standard regularizers may not be rich enough to fully encompass the implicit regularization brought forth by gradient-based optimization.

NeurIPS Conference 2019 Conference Paper

On Exact Computation with an Infinitely Wide Neural Net

  • Sanjeev Arora
  • Simon Du
  • Wei Hu
  • Zhiyuan Li
  • Russ Salakhutdinov
  • Ruosong Wang

How well does a classic deep net architecture like AlexNet or VGG19 classify on a standard dataset such as CIFAR-10 when its “width”— namely, number of channels in convolutional layers, and number of nodes in fully-connected internal layers — is allowed to increase to infinity? Such questions have come to the forefront in the quest to theoretically understand deep learning and its mysteries about optimization and generalization. They also connect deep learning to notions such as Gaussian processes and kernels. A recent paper [Jacot et al. , 2018] introduced the Neural Tangent Kernel (NTK) which captures the behavior of fully-connected deep nets in the infinite width limit trained by gradient descent; this object was implicit in some other recent papers. An attraction of such ideas is that a pure kernel-based method is used to capture the power of a fully-trained deep net of infinite width. The current paper gives the first efficient exact algorithm for computing the extension of NTK to convolutional neural nets, which we call Convolutional NTK (CNTK), as well as an efficient GPU implementation of this algorithm. This results in a significant new benchmark for performance of a pure kernel-based method on CIFAR-10, being 10% higher than the methods reported in [Novak et al. , 2019], and only 6% lower than the performance of the corresponding finite deep net architecture (once batch normalization etc. are turned off). Theoretically, we also give the first non-asymptotic proof showing that a fully-trained sufficiently wide net is indeed equivalent to the kernel regression predictor using NTK.

ICML Conference 2018 Conference Paper

On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization

  • Sanjeev Arora
  • Nadav Cohen
  • Elad Hazan

Conventional wisdom in deep learning states that increasing depth improves expressiveness but complicates optimization. This paper suggests that, sometimes, increasing depth can speed up optimization. The effect of depth on optimization is decoupled from expressiveness by focusing on settings where additional layers amount to overparameterization – linear neural networks, a well-studied model. Theoretical analysis, as well as experiments, show that here depth acts as a preconditioner which may accelerate convergence. Even on simple convex problems such as linear regression with $\ell_p$ loss, $p>2$, gradient descent can benefit from transitioning to a non-convex overparameterized objective, more than it would from some common acceleration schemes. We also prove that it is mathematically impossible to obtain the acceleration effect of overparametrization via gradients of any regularizer.

ICML Conference 2018 Conference Paper

Stronger Generalization Bounds for Deep Nets via a Compression Approach

  • Sanjeev Arora
  • Rong Ge 0001
  • Behnam Neyshabur
  • Yi Zhang 0074

Deep nets generalize well despite having more parameters than the number of training samples. Recent works try to give an explanation using PAC-Bayes and Margin-based analyses, but do not as yet result in sample complexity bounds better than naive parameter counting. The current paper shows generalization bounds that are orders of magnitude better in practice. These rely upon new succinct reparametrizations of the trained net — a compression that is explicit and efficient. These yield generalization bounds via a simple compression-based framework introduced here. Our results also provide some theoretical justification for widespread empirical success in compressing deep nets. Analysis of correctness of our compression relies upon some newly identified noise stability properties of trained deep nets, which are also experimentally verified. The study of these properties and resulting generalization bounds are also extended to convolutional nets, which had eluded earlier attempts on proving generalization.

ICML Conference 2017 Conference Paper

Generalization and Equilibrium in Generative Adversarial Nets (GANs)

  • Sanjeev Arora
  • Rong Ge 0001
  • Yingyu Liang
  • Tengyu Ma 0001
  • Yi Zhang 0074

It is shown that training of generative adversarial network (GAN) may not have good generalization properties; e. g. , training may appear successful but the trained distribution may be far from target distribution in standard metrics. However, generalization does occur for a weaker metric called neural net distance. It is also shown that an approximate pure equilibrium exists in the discriminator/generator game for a natural training objective (Wasserstein) when generator capacity and training set sizes are moderate. This existence of equilibrium inspires MIX+GAN protocol, which can be combined with any existing GAN training, and empirically shown to improve some of them.

ICML Conference 2016 Conference Paper

Provable Algorithms for Inference in Topic Models

  • Sanjeev Arora
  • Rong Ge 0001
  • Frederic Koehler
  • Tengyu Ma 0001
  • Ankur Moitra

Recently, there has been considerable progress on designing algorithms with provable guarantees —typically using linear algebraic methods—for parameter learning in latent variable models. Designing provable algorithms for inference has proved more difficult. Here we take a first step towards provable inference in topic models. We leverage a property of topic models that enables us to construct simple linear estimators for the unknown topic proportions that have small variance, and consequently can work with short documents. Our estimators also correspond to finding an estimate around which the posterior is well-concentrated. We show lower bounds that for shorter documents it can be information theoretically impossible to find the hidden topics. Finally, we give empirical results that demonstrate that our algorithm works on realistic topic models. It yields good solutions on synthetic data and runs in time comparable to a single iteration of Gibbs sampling.

ICML Conference 2014 Conference Paper

Provable Bounds for Learning Some Deep Representations

  • Sanjeev Arora
  • Aditya Bhaskara
  • Rong Ge 0001
  • Tengyu Ma 0001

We give algorithms with provable guarantees that learn a class of deep nets in the generative model view popularized by Hinton and others. Our generative model is an n node multilayer neural net that has degree at most n^γ for some γ< 1 and each edge has a random edge weight in [-1, 1]. Our algorithm learns almost all networks in this class with polynomial running time. The sample complexity is quadratic or cubic depending upon the details of the model. The algorithm uses layerwise learning. It is based upon a novel idea of observing correlations among features and using these to infer the underlying edge structure via a global graph recovery procedure. The analysis of the algorithm reveals interesting structure of neural nets with random edge weights.

ICML Conference 2013 Conference Paper

A Practical Algorithm for Topic Modeling with Provable Guarantees

  • Sanjeev Arora
  • Rong Ge 0001
  • Yonatan Halpern
  • David M. Mimno
  • Ankur Moitra
  • David A. Sontag
  • Yichen Wu
  • Michael Zhu

Topic models provide a useful method for dimensionality reduction and exploratory data analysis in large text corpora. Most approaches to topic model learning have been based on a maximum likelihood objective. Efficient algorithms exist that attempt to approximate this objective, but they have no provable guarantees. Recently, algorithms have been introduced that provide provable bounds, but these algorithms are not practical because they are inefficient and not robust to violations of model assumptions. In this paper we present an algorithm for learning topic models that is both provable and practical. The algorithm produces results comparable to the best MCMC implementations while running orders of magnitude faster.

FOCS Conference 2013 Conference Paper

Towards a Better Approximation for Sparsest Cut?

  • Sanjeev Arora
  • Rong Ge 0001
  • Ali Kemal Sinop

We give a new (1 + ε)-approximation for SPARSEST CUT problem on graphs where small sets expand significantly more than the sparsest cut (expansion of sets of size n/r exceeds that of the sparsest cut by a factor √log n log r, for some small r; this condition holds for many natural graph families). We give two different algorithms. One involves Guruswami-Sinop rounding on the level-r Lasserre relaxation. The other is combinatorial and involves a new notion called Small Set Expander Flows (inspired by the expander flows of [1]) which we show exists in the input graph. Both algorithms run in time 2 O(r) poly(n). We also show similar approximation algorithms in graphs with genus g with an analogous local expansion condition. This is the first algorithm we know of that achieves (1 + ε)-approximation on such general family of graphs.

FOCS Conference 2012 Conference Paper

Learning Topic Models - Going beyond SVD

  • Sanjeev Arora
  • Rong Ge 0001
  • Ankur Moitra

Topic Modeling is an approach used for automatic comprehension and classification of data in a variety of settings, and perhaps the canonical application is in uncovering thematic structure in a corpus of documents. A number of foundational works both in machine learning and in theory have suggested a probabilistic model for documents, whereby documents arise as a convex combination of (i. e. distribution on) a small number of topic vectors, each topic vector being a distribution on words (i. e. a vector of word-frequencies). Similar models have since been used in a variety of application areas, the Latent Dirichlet Allocation or LDA model of Blei et al. is especially popular. Theoretical studies of topic modeling focus on learning the model's parameters assuming the data is actually generated from it. Existing approaches for the most part rely on Singular Value Decomposition (SVD), and consequently have one of two limitations: these works need to either assume that each document contains only one topic, or else can only recover the {\em span} of the topic vectors instead of the topic vectors themselves. This paper formally justifies Nonnegative Matrix Factorization (NMF) as a main tool in this context, which is an analog of SVD where all vectors are nonnegative. Using this tool we give the first polynomial-time algorithm for learning topic models without the above two limitations. The algorithm uses a fairly mild assumption about the underlying topic matrix called separability, which is usually found to hold in real-life data. Perhaps the most attractive feature of our algorithm is that it generalizes to yet more realistic models that incorporate topic-topic correlations, such as the Correlated Topic Model (CTM) and the Pachinko Allocation Model (PAM). We hope that this paper will motivate further theoretical results that use NMF as a replacement for SVD -- just as NMF has come to replace SVD in many applications.

NeurIPS Conference 2012 Conference Paper

Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders

  • Sanjeev Arora
  • Rong Ge
  • Ankur Moitra
  • Sushant Sachdeva

We present a new algorithm for Independent Component Analysis (ICA) which has provable performance guarantees. In particular, suppose we are given samples of the form $y = Ax + \eta$ where $A$ is an unknown $n \times n$ matrix and $x$ is chosen uniformly at random from $\{+1, -1\}^n$, $\eta$ is an $n$-dimensional Gaussian random variable with unknown covariance $\Sigma$: We give an algorithm that provable recovers $A$ and $\Sigma$ up to an additive $\epsilon$ whose running time and sample complexity are polynomial in $n$ and $1 / \epsilon$. To accomplish this, we introduce a novel ``quasi-whitening'' step that may be useful in other contexts in which the covariance of Gaussian noise is not known in advance. We also give a general framework for finding all local optima of a function (given an oracle for approximately finding just one) and this is a crucial step in our algorithm, one that has been overlooked in previous attempts, and allows us to control the accumulation of error when we find the columns of $A$ one by one via local search.

FOCS Conference 2010 Conference Paper

Subexponential Algorithms for Unique Games and Related Problems

  • Sanjeev Arora
  • Boaz Barak
  • David Steurer

We give a subexponential time approximation algorithm for the Unique Games problem. The algorithms run in time that is exponential in an arbitrarily small polynomial of the input size, n ε. The approximation guarantee depends on ε, but not on the alphabet size or the number of variables. We also obtain a subexponential algorithms with improved approximations for SMALL-SET EXPANSION and MULTICUT. For MAX CUT, SPARSEST CUT, and VERTEX COVER, we give subexponential algorithms with improved approximations on some interesting subclasses of instances. Khot's Unique Games Conjecture (UGC) states that it is NP-hard to achieve approximation guarantees such as ours for the Unique Games. While our results stop short of refuting the UGC, they do suggest that Unique Games is significantly easier than NP-hard problems such as MAX 3SAT, MAX 3LIN, Label Cover and more, that are believed not to have a subexponential algorithm achieving a non-trivial approximation ratio. The main component in our algorithms is a new result on graph decomposition that may have other applications. Namely we show that for every ε > 0 and every regular n-vertex graph G, by changing at most ε fraction of G's edges, one can break G into disjoint parts so that the stochastic adjacency matrix of the induced graph on each part has at most n ε eigenvalues larger than 1 - η, where η depends polynomially on ε.

STOC Conference 2006 Conference Paper

New approximation guarantee for chromatic number

  • Sanjeev Arora
  • Eden Chlamtac

We describe how to color every 3-colorable graph with O(n 0.2111 ) colors, thus improving an algorithm of Blum and Karger from almost a decade ago. Our analysis uses new geometric ideas inspired by the recent work of Arora, Rao, and Vazirani on SPARSEST CUT, and these ideas show promise of leading to further improvements.

STOC Conference 2005 Conference Paper

Euclidean distortion and the sparsest cut

  • Sanjeev Arora
  • James R. Lee
  • Assaf Naor

We prove that every n-point metric space of negative type (in particular, every n-point subset of L 1 ) embeds into a Euclidean space with distortion O(√log n log log n), a result which is tight up to the O(log log n) factor. As a consequence, we obtain the best known polynomial-time approximation algorithm for the Sparsest Cut problem with general demands. If the demand is supported on a subset of size k, we achieve an approximation ratio of O(√log k log log k).

FOCS Conference 2005 Conference Paper

Fast Algorithms for Approximate Semide. nite Programming using the Multiplicative Weights Update Method

  • Sanjeev Arora
  • Elad Hazan
  • Satyen Kale

Semidefinite programming (SDP) relaxations appear in many recent approximation algorithms but the only general technique for solving such SDP relaxations is via interior point methods. We use a Lagrangian-relaxation based technique (modified from the papers of Plotkin, Shmoys, and Tardos (PST), and Klein and Lu) to derive faster algorithms for approximately solving several families of SDP relaxations. The algorithms are based upon some improvements to the PST ideas - which lead to new results even for their framework - as well as improvements in approximate eigenvalue computations by using random sampling.

FOCS Conference 2005 Conference Paper

On Non-Approximability for Quadratic Programs

  • Sanjeev Arora
  • Eli Berger
  • Elad Hazan
  • Guy Kindler
  • Muli Safra

This paper studies the computational complexity of the following type of quadratic programs: given an arbitrary matrix whose diagonal elements are zero, find x /spl isin/ {-1, 1}/sup n/ that maximizes x/sup T/Mx. This problem recently attracted attention due to its application in various clustering settings, as well as an intriguing connection to the famous Grothendieck inequality. It is approximable to within a factor of O(log n), and known to be NP-hard to approximate within any factor better than 13/11 - /spl epsi/ for all /spl epsi/ > 0. We show that it is quasi-NP-hard to approximate to a factor better than O(log/sup /spl gamma// n)for some /spl gamma/ > 0. The integrality gap of the natural semidefinite relaxation for this problem is known as the Grothendieck constant of the complete graph, and known to be /spl Theta/(log n). The proof of this fact was nonconstructive, and did not yield an explicit problem instance where this integrality gap is achieved. Our techniques yield an explicit instance for which the integrality gap is /spl Omega/ (log n/log log n), essentially answering one of the open problems of Alon et al. [AMMN].

STOC Conference 2005 Conference Paper

Towards strong nonapproximability results in the Lovasz-Schrijver hierarchy

  • Michael Alekhnovich
  • Sanjeev Arora
  • Iannis Tourlakis

Lovász and Schrijver described a generic method of tightening the LP and SDP relaxation for any 0-1 optimization problem. These tightened relaxations were the basis of several celebrated approximation algorithms (such as for max-cu , max-3sat , and sparsest cut ).We prove strong inapproximability results in this model for well-known problems such as max-3sat , hypergraph vertex cover and set cover . We show that the relaxations produced by as many as Ω( n ) rounds of the LS + procedure do not allow nontrivial approximation, thus ruling out the possibility that the LS + approach gives even slightly subexponential approximation algorithms for these problems.We also point out why our results are somewhat incomparable to known inapproximability results proved using PCPs, and formalize several interesting open questions.

FOCS Conference 2004 Conference Paper

0(sqrt (log n)) Approximation to SPARSEST CUT in Õ(n 2 ) Time

  • Sanjeev Arora
  • Elad Hazan
  • Satyen Kale

We show how to compute O(/spl radic/log n)-approximations to SPARSEST CUT and BALANCED SEPARATOR problems in O/spl tilde/(n/sup 2/) time, thus improving upon the recent algorithm of Arora, Rao and Vazirani (2004). Their algorithm uses semidefinite programming and required O/spl tilde/(n/sup 4. 5/) time. Our algorithm relies on efficiently finding expander flows in the graph and does not solve semidefinite programs. The existence of expander flows was also established by Arora, Rao, and Vazirani.

STOC Conference 2004 Conference Paper

Expander flows, geometric embeddings and graph partitioning

  • Sanjeev Arora
  • Satish Rao
  • Umesh V. Vazirani

We give a O(√log n)-approximation algorithm for sparsest cut , balanced separator , and graph conductance problems. This improves the O(log n)-approximation of Leighton and Rao (1988). We use a well-known semidefinite relaxation with triangle inequality constraints. Central to our analysis is a geometric theorem about projections of point sets in R d , whose proof makes essential use of a phenomenon called measure concentration. We also describe an interesting and natural "certificate" for a graph's expansion, by embedding an n-node expander in it with appropriate dilation and congestion. We call this an expander flow.

NeurIPS Conference 2002 Conference Paper

A Note on the Representational Incompatibility of Function Approximation and Factored Dynamics

  • Eric Allender
  • Sanjeev Arora
  • Michael Kearns
  • Cristopher Moore
  • Alexander Russell

We establish a new hardness result that shows that the difficulty of plan- ning in factored Markov decision processes is representational rather than just computational. More precisely, we give a fixed family of fac- tored MDPs with linear rewards whose optimal policies and value func- tions simply cannot be represented succinctly in any standard parametric form. Previous hardness results indicated that computing good policies from the MDP parameters was difficult, but left open the possibility of succinct function approximation for any fixed factored MDP. Our result applies even to policies which yield a polynomially poor approximation to the optimal value, and highlights interesting connectionswith the com- plexity class of Arthur-Merlin games.

FOCS Conference 2002 Conference Paper

Proving Integrality Gaps without Knowing the Linear Program

  • Sanjeev Arora
  • Béla Bollobás
  • László Lovász 0001

Proving integrality gaps for linear relaxations of NP optimization problems is a difficult task and usually undertaken on a case-by-case basis. We initiate a more systematic approach. We prove an integrality gap of 2-o(1) for three families of linear relaxations for vertex cover, and our methods seem relevant to other problems as well.

STOC Conference 2001 Conference Paper

Learning mixtures of arbitrary gaussians

  • Sanjeev Arora
  • Ravindran Kannan

Mixtures of gaussian (or normal) distributions arise in a variety of application areas. Many techniques have been proposed for the task of finding the component gaussians given samples from the mixture, such as the EM algorithm , a local-search heuristic from Dempster, Laird and Rubin~(1977). However, such heuristics are known to require time exponential in the dimension (i.e., number of variables) in the worst case, even when the number of components is $2$. This paper presents the first algorithm that provably learns the component gaussians in time that is polynomial in the dimension. The gaussians may have arbitrary shape provided they satisfy a “nondegeneracy” condition, which requires their high-probability regions to be not “too close” together.

FOCS Conference 1997 Conference Paper

Nearly Linear Time Approximation Schemes for Euclidean TSP and other Geometric Problems

  • Sanjeev Arora

We present a randomized polynomial time approximation scheme for Euclidean TSP in R/sup 2/ that is substantially more efficient than our earlier scheme (1996) (and the scheme of Mitchell (1996)). For any fixed c>1 and any set of n nodes in the plane, the new scheme finds a (1+1/c)-approximation to the optimum traveling salesman tour in O(n(logn)/sup O(c)/) time. (Our earlier scheme ran in n/sup O(C)/ time.) For points in R/sup d/ the algorithm runs in O(n(logn)/sup (O(/spl radic/dc)/d-1)) time. This time is polynomial (actually nearly linear) for every fixed c, d. Designing such a polynomial-time algorithm was an open problem (our earlier algorithm (1996) ran in superpolynomial time for d/spl ges/3). The algorithm generalizes to the same set of Euclidean problems handled by the previous algorithm, including Steiner Tree, /spl kappa/-TSP, /spl kappa/-MST, etc, although for /spl kappa/-TSP and /spl kappa/-MST the running time gets multiplied by /spl kappa/. We also use our ideas to design nearly-linear time approximation schemes for Euclidean versions of problems that are known to be in P, such as Minimum Spanning Tree and Min Cost Perfect Matching. All our algorithms can be derandomized, though the running time then increases by O(n/sup d/) in R/sup d/. They also have simple parallel implementations (say, in NC/sup 2/).

FOCS Conference 1996 Conference Paper

A New Rounding Procedure for the Assignment Problem with Applications to Dense Graph Arrangement Problems

  • Sanjeev Arora
  • Alan M. Frieze
  • Haim Kaplan

We present a randomized procedure for rounding fractional perfect matchings to (integral) matchings. If the original fractional matching satisfies any linear inequality, then with high probability, the new matching satisfies that linear inequality in an approximate sense. This extends the well-known LP rounding procedure of Raghavan and Thompson (1987), which is usually used to round fractional solutions of linear programs. It also solves an open problem of Luby and Nisan (1993) ("Design an NC procedure for converting near-optimum fractional matchings to near-optimum matchings. ") We use the rounding procedure to design n/sup 0(logn//spl epsiv/(2)/) time algorithms for the following: (i) an additive approximation to the 0-1 Quadratic Assignment problem (QAP); (ii) a (1+E)-approximation for "dense" instances of many well-known NP-hard problems, including (an optimization formulation of) GRAPH-ISOMORPHISM, MIN-CUT-LINEAR-ARRANGEMENT, MAX-ACYCLIC-SUBGRAPH, MIN-LINEAR-ARRANGEMENT, and BETWEENNESS. (A "dense" graph is one in which the number of edges is /spl Omega/(n/sup 2/); denseness for the other problems is defined in an analogous way).

FOCS Conference 1996 Conference Paper

Polynomial Time Approximation Schemes for Euclidean TSP and Other Geometric Problems

  • Sanjeev Arora

We present a polynomial time approximation scheme for Euclidean TSP in /spl Rfr//sup 2/. Given any n nodes in the plane and /spl epsiv/>0, the scheme finds a (1+/spl epsiv/)-approximation to the optimum traveling salesman tour in time n/sup 0(1//spl epsiv/)/. When the nodes are in /spl Rfr//sup d/, the running time increases to n(O/spl tilde/(log/sup d-2/n)//spl epsiv//sup d-1/) The previous best approximation algorithm for the problem (due to Christofides (1976)) achieves a 3/2-approximation in polynomial time. We also give similar approximation schemes for a host of other Euclidean problems, including Steiner Tree, k-TSP, Minimum degree-k, spanning tree, k-MST, etc. (This list may get longer; our techniques are fairly general.) The previous best approximation algorithms for all these problems achieved a constant-factor approximation. All our algorithms also work, with almost no modification, when distance is measured using any geometric norm (such as l/sub p/ for p/spl ges/1 or other Minkowski norms).

FOCS Conference 1995 Conference Paper

Reductions, Codes, PCPs, and Inapproximability

  • Sanjeev Arora

Many recent results show the hardness of approximating NP-hard functions. We formalize, in a very simple way, what these results involve: a code-like Levin reduction. Assuming a well-known complexity assumption, we show that such reductions cannot prove the NP-hardness of the following problems, where /spl epsiv/ is any positive fraction: (i) achieving an approximation ratio n/sup 1/2+/spl epsiv// for Clique, (ii) achieving an approximation ratio 1. 5+/spl epsiv/ for Vertex Cover, and (iii) coloring a 3-colorable graph with O(logn) colors. In fact, we explain why current reductions cannot prove the NP-hardness of coloring 3-colorable graphs with 9 colors. Our formalization of a code-like reduction, together with our justification of why such reductions are natural, also clarifies why current proofs of inapproximability results use error-correcting codes.

FOCS Conference 1993 Conference Paper

The Hardness of Approximate Optimia in Lattices, Codes, and Systems of Linear Equations

  • Sanjeev Arora
  • László Babai
  • Jacques Stern
  • Elizabeth Sweedyk

We prove the following about the Nearest Lattice Vector Problem (in any l/sub p/ norm), the Nearest Code-word Problem for binary codes, the problem of learning a halfspace in the presence of errors, and some other problems. 1. Approximating the optimum within any constant factor is NP-hard. 2. If for some /spl epsiv/>0 there exists a polynomial time algorithm that approximates the optimum within a factor of 2/sup log(0. 5-/spl epsiv/)/ /sup n/ then NP is in quasi-polynomial deterministic time: NP/spl sube/DTIME(n/sup poly(log/ /sup n)/). Moreover, we show that result 2 also holds for the Shortest Lattice Vector Problem in the l/sub /spl infin// norm. Improving the factor 2/sup log(0. 5-/spl epsiv/)/ /sup n/ to /spl radic/(dim) for either of the lattice problems would imply the hardness of the Shortest Vector Problem in l/sub 2/ norm; an old open problem. Our proofs use reductions from few-prover, one-round interactive proof systems, either directly, or through a set-cover problem. >

FOCS Conference 1992 Conference Paper

Probabilistic Checking of Proofs; A New Characterization of NP

  • Sanjeev Arora
  • Muli Safra

The authors give a new characterization of NP: the class NP contains exactly those languages L for which membership proofs (a proof that an input x is in L) can be verified probabilistically in polynomial time using logarithmic number of random bits and sub-logarithmic number of queries to the proof. This is a non-relativizing characterization of NP. They discuss implications of this characterization; specifically, they show that approximating clique (or independent set) is NP-hard. >

FOCS Conference 1992 Conference Paper

Proof Verification and Hardness of Approximation Problems

  • Sanjeev Arora
  • Carsten Lund
  • Rajeev Motwani 0001
  • Madhu Sudan 0001
  • Mario Szegedy

The class PCP(f(n), g(n)) consists of all languages L for which there exists a polynomial-time probabilistic oracle machine that used O(f(n)) random bits, queries O(g(n)) bits of its oracle and behaves as follows: If x in L then there exists an oracle y such that the machine accepts for all random choices but if x not in L then for every oracle y the machine rejects with high probability. Arora and Safra (1992) characterized NP as PCP(log n, (loglogn)/sup O(1)/). The authors improve on their result by showing that NP=PCP(logn, 1). The result has the following consequences: (1) MAXSNP-hard problems (e. g. metric TSP, MAX-SAT, MAX-CUT) do not have polynomial time approximation schemes unless P=NP; and (2) for some epsilon >0 the size of the maximal clique in a graph cannot be approximated within a factor of n/sup epsilon / unless P=NP. >