Author name cluster

Simon Du

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

18 papers

1 author row

NeurIPS Conference 2025 Conference Paper

A Minimalist Example of Edge-of-Stability and Progressive Sharpening

Liming Liu
Zixuan Zhang
Simon Du
Tuo Zhao

Recent advances in deep learning optimization have unveiled two intriguing phenomena under large learning rates: Edge of Stability (EoS) and Progressive Sharpening (PS), challenging classical Gradient Descent (GD) analyses. Current research approaches, using either generalist frameworks or minimalist examples, face significant limitations in explaining these phenomena. This paper advances the minimalist approach by introducing a two-layer network with a two-dimensional input, where one dimension is relevant to the response and the other is irrelevant. Through this model, we rigorously prove the existence of progressive sharpening and self-stabilization under large learning rates, and establish non-asymptotic analysis of the training dynamics and sharpness along the entire GD trajectory. Besides, we connect our minimalist example to existing works by reconciling the existence of a well-behaved "stable set" between minimalist and generalist analyses, and extending the analysis of Gradient Flow Solution sharpness to our two-dimensional input scenario. These findings provide new insights into the EoS phenomenon from both parameter and input data distribution perspectives, potentially informing more effective optimization strategies in deep learning practice.

NeurIPS Conference 2025 Conference Paper

Deployment Efficient Reward-Free Exploration with Linear Function Approximation

Zihan Zhang
Yuxin Chen
Jason Lee
Simon Du
Lin Yang
Ruosong Wang

We study deployment-efficient reward-free exploration with linear function approximation, where the goal is to explore a linear Markov Decision Process (MDP) without revealing the reward function, while minimizing the number of distinct policies implemented during learning. By ``deployment efficient'', we mean algorithms that require few policies deployed during exploration -- crucial in real-world applications where such deployments are costly or disruptive. We design a novel reinforcement learning algorithm that achieves near-optimal deployment efficiency for linear MDPs in the reward-free setting, using at most $H$ exploration policies during execution (where $H$ is the horizon length), while maintaining sample complexity polynomial in feature dimension and horizon length. Unlike previous approaches with similar deployment efficiency guarantees, our algorithm's sample complexity is independent of the reachability or explorability coefficients of the underlying MDP, which can be arbitrarily small and lead to unbounded sample complexity in certain cases -- directly addressing an open problem from prior work. Our technical contributions include a data-dependent method for truncating state-action pairs in linear MDPs, efficient offline policy evaluation and optimization algorithms for these truncated MDPs, and a careful integration of these components to implement reward-free exploration with linear function approximation without sacrificing deployment efficiency.

NeurIPS Conference 2025 Conference Paper

Highlighting What Matters: Promptable Embeddings for Attribute-Focused Image Retrieval

Siting Li
Xiang Gao
Simon Du

While an image is worth more than a thousand words, only a few provide crucial information for a given task and thus should be focused on. In light of this, ideal text-to-image (T2I) retrievers should prioritize specific visual attributes relevant to queries. To evaluate current retrievers on handling attribute-focused queries, we build COCO-Facet, a COCO-based benchmark with 9, 112 queries about diverse attributes of interest. We find that CLIP-like retrievers, which are widely adopted due to their efficiency and zero-shot ability, have poor and imbalanced performance, possibly because their image embeddings focus on global semantics and subjects while leaving out other details. Notably, we reveal that even recent Multimodal Large Language Model (MLLM)-based, stronger retrievers with a larger output dimension struggle with this limitation. Hence, we hypothesize that retrieving with general image embeddings is suboptimal for performing such queries. As a solution, we propose to use promptable image embeddings enabled by these multimodal retrievers, which boost performance by highlighting required attributes. Our pipeline for deriving such embeddings generalizes across query types, image pools, and base retriever architectures. To enhance real-world applicability, we offer two acceleration strategies: Pre-processing promptable embeddings and using linear approximations. We show that the former yields a 15\% improvement in Recall@5 when prompts are predefined, while the latter achieves an 8\% improvement when prompts are only available during inference.

NeurIPS Conference 2025 Conference Paper

Reinforcement Learning for Reasoning in Large Language Models with One Training Example

Yiping Wang
Qing Yang
Zhiyuan Zeng
Liliang Ren
Liyuan Liu
Baolin Peng
Hao Cheng
Xuehai He

We show that reinforcement learning with verifiable reward using one training example (1-shot RLVR) is effective in incentivizing the math reasoning capabilities of large language models (LLMs). Applying RLVR to the base model Qwen2. 5-Math-1. 5B, we identify a single example that elevates model performance on MATH500 from 36. 0\% to 73. 6\% (8. 6\% improvement beyond format correction), and improves the average performance across six common mathematical reasoning benchmarks from 17. 6\% to 35. 7\% (7. 0\% non-format gain). This result matches the performance obtained using the 1. 2k DeepScaleR subset (MATH500: 73. 6\%, average: 35. 9\%), which contains the aforementioned example. Furthermore, RLVR with only two examples even slightly exceeds these results (MATH500: 74. 8\%, average: 36. 6\%). Similar substantial improvements are observed across various models (Qwen2. 5-Math-7B, Llama3. 2-3B-Instruct, DeepSeek-R1-Distill-Qwen-1. 5B), RL algorithms (GRPO and PPO), and different math examples. In addition, we identify some interesting phenomena during 1-shot RLVR, including cross-category generalization, increased frequency of self-reflection, and sustained test performance improvement even after the training accuracy has saturated, a phenomenon we term \textit{post-saturation generalization}. Moreover, we verify that the effectiveness of 1-shot RLVR primarily arises from the policy gradient loss, distinguishing it from the "grokking" phenomenon. We also show the critical role of promoting exploration (e. g. , by incorporating entropy loss with an appropriate coefficient) in 1-shot RLVR training. We also further discuss related observations about format correction, label robustness and prompt modification. These findings can inspire future work on RLVR efficiency and encourage a re-examination of recent progress and the underlying mechanisms in RLVR. Our code, models, and data are open source at https: //github. com/ypwang61/One-Shot-RLVR.

NeurIPS Conference 2025 Conference Paper

Sharp Gap-Dependent Variance-Aware Regret Bounds for Tabular MDPs

Shulun Chen
Runlong Zhou
Zihan Zhang
Maryam Fazel
Simon Du

We consider gap-dependent regret bounds for episodic MDPs. We show that the Monotonic Value Propagation (MVP) algorithm (Zhang et al. [2024]) achieves a variance-aware gap-dependent regret bound of $$\tilde{O}\left(\left(\sum_{\Delta_h(s, a)>0} \frac{H^2 \log K \land \mathtt{Var}\_{\max}^{\textup{c}}}{\Delta_h(s, a)} +\sum_{\Delta_h(s, a)=0}\frac{ H^2 \land \mathtt{Var}\_{\max}^{\textup{c}}}{\Delta_{\mathrm{min}}} + SAH^4 (S \lor H) \right) \log K\right), $$ where $H$ is the planning horizon, $S$ is the number of states, $A$ is the number of actions, $K$ is the number of episodes, and $\tilde{O}$ hides $\mathsf{poly} \log (S, A, H, 1 / \Delta\_{\mathrm{min}}, 1 / \delta)$ terms. Here, $\Delta_h(s, a) =V_h^* (a) - Q_h^* (s, a)$ represents the suboptimality gap and $\Delta_{\mathrm{min}}: = \min_{\Delta_h (s, a) > 0} \Delta_h(s, a)$. The term $\mathtt{Var}\_{\max}^{\textup{c}}$ denotes the maximum conditional total variance, calculated as the maximum over all $(\pi, h, s)$ tuples of the expected total variance under policy $\pi$ conditioned on trajectories visiting state $s$ at step $h$. $\mathtt{Var}\_{\max}^{\textup{c}}$ characterizes the maximum randomness encountered when learning any $(h, s)$ pair. Our result stems from a novel analysis of the weighted sum of the suboptimality gap and can be potentially adapted for other algorithms. To complement the study, we establish a lower bound of $$\Omega \left( \sum_{\Delta_h(s, a)>0} \frac{H^2 \land \mathtt{Var}\_{\max}^{\textup{c}}}{\Delta_h(s, a)}\cdot \log K\right), $$ demonstrating the necessity of dependence on $\mathtt{Var}\_{\max}^{\textup{c}}$ even when the maximum unconditional total variance (without conditioning on $(h, s)$) approaches zero.

NeurIPS Conference 2025 Conference Paper

Understanding the Gain from Data Filtering in Multimodal Contrastive Learning

Divyansh Pareek
Sewoong Oh
Simon Du

The success of modern multimodal representation learning relies on internet-scale datasets. Due to the low quality of a large fraction of raw web data, data curation has become a critical step in the training pipeline. Filtering using a trained model (i. e. , teacher-based filtering) has emerged as a successful solution, leveraging a pre-trained model to compute quality scores. To explain the empirical success of teacher-based filtering, we characterize the performance of filtered contrastive learning under the standard bimodal data generation model. Denoting $\eta\in(0, 1]$ as the fraction of data with correctly matched modalities among $n$ paired samples, we utilize a linear contrastive learning setup to show a provable benefit of data filtering: $(i)$ the error without filtering is upper and lower bounded by $\frac{1}{\eta \sqrt{n}}$, and $(ii)$ the error with teacher-based filtering is upper bounded by $\frac{1}{\sqrt{\eta n}}$ in the large $\eta$ regime, and by $\frac{1}{\sqrt{n}}$ in the small $\eta$ regime.

EWRL Workshop 2022 Workshop Paper

Linear Convergence of Natural Policy Gradient Methods with Log-Linear Policies

Rui Yuan
Robert M Gower
Alessandro Lazaric
Simon Du
Lin Xiao

We consider infinite-horizon discounted Markov decision processes and study the convergence rates of the natural policy gradient (NPG) and the Q-NPG methods with the log-linear policy class. Using the compatible function approximation framework, NPG and Q-NPG with log-linear policies can be written as approximate versions of the policy mirror descent (PMD) method. By extending a recent analysis of PMD in the tabular setting, we obtain linear convergence rates and O(1/ 2 ) sample complexities for both NPG and Q-NPG with log-linear policy parametrization using a simple, non-adaptive geometrically increasing step size, without resorting to entropy or other strongly convex regularization. As a byproduct, we obtain sublinear convergence rates for both NPG and Q-NPG with arbitrary large constant step sizes.

NeurIPS Conference 2019 Conference Paper

Acceleration via Symplectic Discretization of High-Resolution Differential Equations

Bin Shi
Simon Du
Weijie Su
Michael Jordan

We study first-order optimization algorithms obtained by discretizing ordinary differential equations (ODEs) corresponding to Nesterov’s accelerated gradient methods (NAGs) and Polyak’s heavy-ball method. We consider three discretization schemes: symplectic Euler (S), explicit Euler (E) and implicit Euler (I) schemes. We show that the optimization algorithm generated by applying the symplectic scheme to a high-resolution ODE proposed by Shi et al. [2018] achieves the accelerated rate for minimizing both strongly convex function and convex function. On the other hand, the resulting algorithm either fails to achieve acceleration or is impractical when the scheme is implicit, the ODE is low-resolution, or the scheme is explicit.

NeurIPS Conference 2019 Conference Paper

Graph Neural Tangent Kernel: Fusing Graph Neural Networks with Graph Kernels

Simon Du
Kangcheng Hou
Russ Salakhutdinov
Barnabas Poczos
Ruosong Wang
Keyulu Xu

While graph kernels (GKs) are easy to train and enjoy provable theoretical guarantees, their practical performances are limited by their expressive power, as the kernel function often depends on hand-crafted combinatorial features of graphs. Compared to graph kernels, graph neural networks (GNNs) usually achieve better practical performance, as GNNs use multi-layer architectures and non-linear activation functions to extract high-order information of graphs as features. However, due to the large number of hyper-parameters and the non-convex nature of the training procedure, GNNs are harder to train. Theoretical guarantees of GNNs are also not well-understood. Furthermore, the expressive power of GNNs scales with the number of parameters, and thus it is hard to exploit the full power of GNNs when computing resources are limited. The current paper presents a new class of graph kernels, Graph Neural Tangent Kernels (GNTKs), which correspond to \emph{infinitely wide} multi-layer GNNs trained by gradient descent. GNTKs enjoy the full expressive power of GNNs and inherit advantages of GKs. Theoretically, we show GNTKs provably learn a class of smooth functions on graphs. Empirically, we test GNTKs on graph classification datasets and show they achieve strong performance.

NeurIPS Conference 2019 Conference Paper

On Exact Computation with an Infinitely Wide Neural Net

Sanjeev Arora
Simon Du
Wei Hu
Zhiyuan Li
Russ Salakhutdinov
Ruosong Wang

How well does a classic deep net architecture like AlexNet or VGG19 classify on a standard dataset such as CIFAR-10 when its “width”— namely, number of channels in convolutional layers, and number of nodes in fully-connected internal layers — is allowed to increase to infinity? Such questions have come to the forefront in the quest to theoretically understand deep learning and its mysteries about optimization and generalization. They also connect deep learning to notions such as Gaussian processes and kernels. A recent paper [Jacot et al. , 2018] introduced the Neural Tangent Kernel (NTK) which captures the behavior of fully-connected deep nets in the infinite width limit trained by gradient descent; this object was implicit in some other recent papers. An attraction of such ideas is that a pure kernel-based method is used to capture the power of a fully-trained deep net of infinite width. The current paper gives the first efficient exact algorithm for computing the extension of NTK to convolutional neural nets, which we call Convolutional NTK (CNTK), as well as an efficient GPU implementation of this algorithm. This results in a significant new benchmark for performance of a pure kernel-based method on CIFAR-10, being 10% higher than the methods reported in [Novak et al. , 2019], and only 6% lower than the performance of the corresponding finite deep net architecture (once batch normalization etc. are turned off). Theoretically, we also give the first non-asymptotic proof showing that a fully-trained sufficiently wide net is indeed equivalent to the kernel regression predictor using NTK.

NeurIPS Conference 2019 Conference Paper

Provably Efficient Q-learning with Function Approximation via Distribution Shift Error Checking Oracle

Simon Du
Yuping Luo
Ruosong Wang
Hanrui Zhang

Q-learning with function approximation is one of the most popular methods in reinforcement learning. Though the idea of using function approximation was proposed at least 60 years ago, even in the simplest setup, i. e, approximating Q-functions with linear functions, it is still an open problem how to design a provably efficient algorithm that learns a near-optimal policy. The key challenges are how to efficiently explore the state space and how to decide when to stop exploring in conjunction with the function approximation scheme. The current paper presents a provably efficient algorithm for Q-learning with linear function approximation. Under certain regularity assumptions, our algorithm, Difference Maximization Q-learning, combined with linear function approximation, returns a near-optimal policy using polynomial number of trajectories. Our algorithm introduces a new notion, the Distribution Shift Error Checking (DSEC) oracle. This oracle tests whether there exists a function in the function class that predicts well on a distribution $\mathcal{D}_1$, but predicts poorly on another distribution $\mathcal{D}_2$, where $\mathcal{D}_1$ and $\mathcal{D}_2$ are distributions over states induced by two different exploration policies. For the linear function class, this oracle is equivalent to solving a top eigenvalue problem. We believe our algorithmic insights, especially the DSEC oracle, are also useful in designing and analyzing reinforcement learning algorithms with general function approximation.

NeurIPS Conference 2019 Conference Paper

Towards Understanding the Importance of Shortcut Connections in Residual Networks

Tianyi Liu
Minshuo Chen
Mo Zhou
Simon Du
Enlu Zhou
Tuo Zhao

Residual Network (ResNet) is undoubtedly a milestone in deep learning. ResNet is equipped with shortcut connections between layers, and exhibits efficient training using simple first order algorithms. Despite of the great empirical success, the reason behind is far from being well understood. In this paper, we study a two-layer non-overlapping convolutional ResNet. Training such a network requires solving a non-convex optimization problem with a spurious local optimum. We show, however, that gradient descent combined with proper normalization, avoids being trapped by the spurious local optimum, and converges to a global optimum in polynomial time, when the weight of the first layer is initialized at 0, and that of the second layer is initialized arbitrarily in a ball. Numerical experiments are provided to support our theory.

NeurIPS Conference 2018 Conference Paper

Algorithmic Regularization in Learning Deep Homogeneous Models: Layers are Automatically Balanced

Simon Du
Wei Hu
Jason Lee

We study the implicit regularization imposed by gradient descent for learning multi-layer homogeneous functions including feed-forward fully connected and convolutional deep neural networks with linear, ReLU or Leaky ReLU activation. We rigorously prove that gradient flow (i. e. gradient descent with infinitesimal step size) effectively enforces the differences between squared norms across different layers to remain invariant without any explicit regularization. This result implies that if the weights are initially small, gradient flow automatically balances the magnitudes of all layers. Using a discretization argument, we analyze gradient descent with positive step size for the non-convex low-rank asymmetric matrix factorization problem without any regularization. Inspired by our findings for gradient flow, we prove that gradient descent with step sizes $\eta_t=O(t^{−(1/2+\delta)}) (0<\delta\le1/2)$ automatically balances two low-rank factors and converges to a bounded global optimum. Furthermore, for rank-1 asymmetric matrix factorization we give a finer analysis showing gradient descent with constant step size converges to the global minimum at a globally linear rate. We believe that the idea of examining the invariance imposed by first order algorithms in learning homogeneous models could serve as a fundamental building block for studying optimization for learning deep models.

NeurIPS Conference 2018 Conference Paper

How Many Samples are Needed to Estimate a Convolutional Neural Network?

Simon Du
Yining Wang
Xiyu Zhai
Sivaraman Balakrishnan
Russ Salakhutdinov
Aarti Singh

A widespread folklore for explaining the success of Convolutional Neural Networks (CNNs) is that CNNs use a more compact representation than the Fully-connected Neural Network (FNN) and thus require fewer training samples to accurately estimate their parameters. We initiate the study of rigorously characterizing the sample complexity of estimating CNNs. We show that for an $m$-dimensional convolutional filter with linear activation acting on a $d$-dimensional input, the sample complexity of achieving population prediction error of $\epsilon$ is $\widetilde{O(m/\epsilon^2)$, whereas the sample-complexity for its FNN counterpart is lower bounded by $\Omega(d/\epsilon^2)$ samples. Since, in typical settings $m \ll d$, this result demonstrates the advantage of using a CNN. We further consider the sample complexity of estimating a one-hidden-layer CNN with linear activation where both the $m$-dimensional convolutional filter and the $r$-dimensional output weights are unknown. For this model, we show that the sample complexity is $\widetilde{O}\left((m+r)/\epsilon^2\right)$ when the ratio between the stride size and the filter size is a constant. For both models, we also present lower bounds showing our sample complexities are tight up to logarithmic factors. Our main tools for deriving these results are a localized empirical process analysis and a new lemma characterizing the convolutional structure. We believe that these tools may inspire further developments in understanding CNNs.

NeurIPS Conference 2017 Conference Paper

Gradient Descent Can Take Exponential Time to Escape Saddle Points

Simon Du
Chi Jin
Jason Lee
Michael Jordan
Aarti Singh
Barnabas Poczos

Although gradient descent (GD) almost always escapes saddle points asymptotically [Lee et al. , 2016], this paper shows that even with fairly natural random initialization schemes and non-pathological functions, GD can be significantly slowed down by saddle points, taking exponential time to escape. On the other hand, gradient descent with perturbations [Ge et al. , 2015, Jin et al. , 2017] is not slowed down by saddle points—it can find an approximate local minimizer in polynomial time. This result implies that GD is inherently slower than perturbed GD, and justifies the importance of adding perturbations for efficient non-convex optimization. While our focus is theoretical, we also present experiments that illustrate our theoretical findings.

NeurIPS Conference 2017 Conference Paper

Hypothesis Transfer Learning via Transformation Functions

Simon Du
Jayanth Koushik
Aarti Singh
Barnabas Poczos

We consider the Hypothesis Transfer Learning (HTL) problem where one incorporates a hypothesis trained on the source domain into the learning procedure of the target domain. Existing theoretical analysis either only studies specific algorithms or only presents upper bounds on the generalization error but not on the excess risk. In this paper, we propose a unified algorithm-dependent framework for HTL through a novel notion of transformation functions, which characterizes the relation between the source and the target domains. We conduct a general risk analysis of this framework and in particular, we show for the first time, if two domains are related, HTL enjoys faster convergence rates of excess risks for Kernel Smoothing and Kernel Ridge Regression than those of the classical non-transfer learning settings. We accompany this framework with an analysis of cross-validation for HTL to search for the best transfer technique and gracefully reduce to non-transfer learning when HTL is not helpful. Experiments on robotics and neural imaging data demonstrate the effectiveness of our framework.

NeurIPS Conference 2017 Conference Paper

On the Power of Truncated SVD for General High-rank Matrix Estimation Problems

Simon Du
Yining Wang
Aarti Singh

We show that given an estimate $\widehat{\mat A}$ that is close to a general high-rank positive semi-definite (PSD) matrix $\mat A$ in spectral norm (i. e. , $\|\widehat{\mat A}-\mat A\|_2 \leq \delta$), the simple truncated Singular Value Decomposition of $\widehat{\mat A}$ produces a multiplicative approximation of $\mat A$ in Frobenius norm. This observation leads to many interesting results on general high-rank matrix estimation problems: 1. High-rank matrix completion: we show that it is possible to recover a {general high-rank matrix} $\mat A$ up to $(1+\varepsilon)$ relative error in Frobenius norm from partial observations, with sample complexity independent of the spectral gap of $\mat A$. 2. High-rank matrix denoising: we design algorithms that recovers a matrix $\mat A$ with relative error in Frobenius norm from its noise-perturbed observations, without assuming $\mat A$ is exactly low-rank. 3. Low-dimensional estimation of high-dimensional covariance: given $N$ i. i. d. ~samples of dimension $n$ from $\mathcal N_n(\mat 0, \mat A)$, we show that it is possible to estimate the covariance matrix $\mat A$ with relative error in Frobenius norm with $N\approx n$, improving over classical covariance estimation results which requires $N\approx n^2$.

NeurIPS Conference 2016 Conference Paper

Efficient Nonparametric Smoothness Estimation

Shashank Singh
Simon Du
Barnabas Poczos

Sobolev quantities (norms, inner products, and distances) of probability density functions are important in the theory of nonparametric statistics, but have rarely been used in practice, partly due to a lack of practical estimators. They also include, as special cases, L^2 quantities which are used in many applications. We propose and analyze a family of estimators for Sobolev quantities of unknown probability density functions. We bound the finite-sample bias and variance of our estimators, finding that they are generally minimax rate-optimal. Our estimators are significantly more computationally tractable than previous estimators, and exhibit a statistical/computational trade-off allowing them to adapt to computational constraints. We also draw theoretical connections to recent work on fast two-sample testing and empirically validate our estimators on synthetic data.