Author name cluster

Sifan Yang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

7 papers

2 author rows

ICML Conference 2025 Conference Paper

Dimension-Free Adaptive Subgradient Methods with Frequent Directions

Sifan Yang
Yuanyu Wan
Peijia Li
Yibo Wang 0005
Xiao Zhang
Zhewei Wei
Lijun Zhang 0005

In this paper, we investigate the acceleration of adaptive subgradient methods through frequent directions (FD), a widely-used matrix sketching technique. The state-of-the-art regret bound exhibits a linear dependence on the dimensionality $d$, leading to unsatisfactory guarantees for high-dimensional problems. Additionally, it suffers from an $O(\tau^2 d)$ time complexity per round, which scales quadratically with the sketching size $\tau$. To overcome these issues, we first propose an algorithm named FTSL, achieving a tighter regret bound that is independent of the dimensionality. The key idea is to integrate FD with adaptive subgradient methods under the primal-dual framework and add the cumulative discarded information of FD back. To reduce its time complexity, we further utilize fast FD to expedite FTSL, yielding a better complexity of $O(\tau d)$ while maintaining the same regret bound. Moreover, to mitigate the computational cost for optimization problems involving matrix variables (e. g. , training neural networks), we adapt FD to Shampoo, a popular optimization algorithm that accounts for the structure of decision, and give a novel analysis under the primal-dual framework. Our proposed method obtains an improved dimension-free regret bound. Experimental results have verified the efficiency and effectiveness of our approaches.

Details

ICML Conference 2025 Conference Paper

Learning without Isolation: Pathway Protection for Continual Learning

Zhikang Chen
Abudukelimu Wuerkaixi
Sen Cui
Haoxuan Li 0001
Ding Li
Jingfeng Zhang
Bo Han 0003
Gang Niu 0001

Deep networks are prone to catastrophic forgetting during sequential task learning, i. e. , losing the knowledge about old tasks upon learning new tasks. To this end, continual learning (CL) has emerged, whose existing methods focus mostly on regulating or protecting the parameters associated with the previous tasks. However, parameter protection is often impractical, since the size of parameters for storing the old-task knowledge increases linearly with the number of tasks, otherwise it is hard to preserve the parameters related to the old-task knowledge. In this work, we bring a dual opinion from neuroscience and physics to CL: in the whole networks, the pathways matter more than the parameters when concerning the knowledge acquired from the old tasks. Following this opinion, we propose a novel CL framework, learning without isolation (LwI), where model fusion is formulated as graph matching and the pathways occupied by the old tasks are protected without being isolated. Thanks to the sparsity of activation channels in a deep network, LwI can adaptively allocate available pathways for a new task, realizing pathway protection and addressing catastrophic forgetting in a parameter-effcient manner. Experiments on popular benchmark datasets demonstrate the superiority of the proposed LwI.

Details

AAAI Conference 2025 Conference Paper

Online Nonsubmodular Optimization with Delayed Feedback in the Bandit Setting

Sifan Yang
Yuanyu Wan
Lijun Zhang

We investigate the online nonsubmodular optimization with delayed feedback in the bandit setting, where the loss function is α-weakly DR-submodular and β-weakly DR-supermodular. Previous work has established an (α,β)-regret bound of O(nd^⅓T^⅔), where n is the dimensionality and d is the maximum delay. However, its regret bound relies on the maximum delay and is thus sensitive to irregular delays. Additionally, it couples the effects of delays and bandit feedback as its bound is the product of the delay term and the O(nT^⅔) regret bound in the bandit setting without delayed feedback. In this paper, we develop two algorithms to address these limitations, respectively. Firstly, we propose a novel method, namely DBGD-NF, which employs the one-point gradient estimator and utilizes all the available estimated gradients in each round to update the decision. It achieves a better O(nd̅^⅓T^⅔) regret bound, which is relevant to the average delay d̅ = 1/T ∑ₜ₌₁ᵀ dₜ <= d. Secondly, we extend DBGD-NF by employing a blocking update mechanism to decouple the joint effect of the delays and bandit feedback, which enjoys an O(n(T^⅔+ √(dT))) regret bound. When d = O(T^⅓), our regret bound matches the O(nT^⅔) bound in the bandit setting without delayed feedback. Compared to our first O(nd̅^⅓T^⅔) bound, it is more advantageous when the maximum delay d = o(d̅^⅔T^⅓). Finally, we conduct experiments on structured sparse learning to demonstrate the superiority of our methods.

PDF Details DOI

IJCAI Conference 2025 Conference Paper

Smoothed Online Convex Optimization with Delayed Feedback

Sifan Yang
Wenhao Yang
Wei Jiang
Yuanyu Wan
Lijun Zhang

Smoothed online convex optimization (SOCO), in which the online player incurs both a hitting cost and a switching cost for changing its decisions, has garnered significant attention in recent years. While existing studies typically assume that the gradient information is revealed immediately, such an assumption may not hold in some real-world applications. To overcome this limitation, we investigate SOCO with delayed feedback, and develop two online algorithms that can minimize the dynamic regret with switching cost. Firstly, we extend Mild-OGD, an existing algorithm that adopts the meta-expert framework for online convex optimization with delayed feedback, to account for switching cost. Specifically, we analyze the switching cost in the expert-algorithm of Mild-OGD, and then modify its meta-algorithm to incorporate this cost when assigning the weight to each expert. We demonstrate that our proposed method, Smelt-DOGD can achieve an O(√(dT(P_T+1))) dynamic regret bound with switching cost, where d is the maximum delay and P_T is the path-length. Secondly, we develop an efficient variant to reduce the number of projections per round from O(log T) to 1, yet maintaining the same theoretical guarantee. The key idea is to construct a new surrogate loss defined over a simpler domain for expert-algorithms so that these experts do not need to perform the complex projection operations in each round. Finally, we conduct experiments to validate the effectiveness and efficiency of our algorithms.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Adaptive Variance Reduction for Stochastic Optimization under Weaker Assumptions

Wei Jiang
Sifan Yang
Yibo Wang
Lijun Zhang

This paper explores adaptive variance reduction methods for stochastic optimization based on the STORM technique. Existing adaptive extensions of STORM rely on strong assumptions like bounded gradients and bounded function values, or suffer an additional $\mathcal{O}(\log T)$ term in the convergence rate. To address these limitations, we introduce a novel adaptive STORM method that achieves an optimal convergence rate of $\mathcal{O}(T^{-1/3})$ for non-convex functions with our newly designed learning rate strategy. Compared with existing approaches, our method requires weaker assumptions and attains the optimal convergence rate without the additional $\mathcal{O}(\log T)$ term. We also extend the proposed technique to stochastic compositional optimization, obtaining the same optimal rate of $\mathcal{O}(T^{-1/3})$. Furthermore, we investigate the non-convex finite-sum problem and develop another innovative adaptive variance reduction method that achieves an optimal convergence rate of $\mathcal{O}(n^{1/4} T^{-1/2} )$, where $n$ represents the number of component functions. Numerical experiments across various tasks validate the effectiveness of our method.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Efficient Sign-Based Optimization: Accelerating Convergence via Variance Reduction

Wei Jiang
Sifan Yang
Wenhao Yang
Lijun Zhang

Sign stochastic gradient descent (signSGD) is a communication-efficient method that transmits only the sign of stochastic gradients for parameter updating. Existing literature has demonstrated that signSGD can achieve a convergence rate of $\mathcal{O}(d^{1/2}T^{-1/4})$, where $d$ represents the dimension and $T$ is the iteration number. In this paper, we improve this convergence rate to $\mathcal{O}(d^{1/2}T^{-1/3})$ by introducing the Sign-based Stochastic Variance Reduction (SSVR) method, which employs variance reduction estimators to track gradients and leverages their signs to update. For finite-sum problems, our method can be further enhanced to achieve a convergence rate of $\mathcal{O}(m^{1/4}d^{1/2}T^{-1/2})$, where $m$ denotes the number of component functions. Furthermore, we investigate the heterogeneous majority vote in distributed settings and introduce two novel algorithms that attain improved convergence rates of $\mathcal{O}(d^{1/2}T^{-1/2} + dn^{-1/2})$ and $\mathcal{O}(d^{1/4}T^{-1/4})$ respectively, outperforming the previous results of $\mathcal{O}(dT^{-1/4} + dn^{-1/2})$ and $\mathcal{O}(d^{3/8}T^{-1/8})$, where $n$ represents the number of nodes. Numerical experiments across different tasks validate the effectiveness of our proposed methods.

PDF Details DOI

ICML Conference 2024 Conference Paper

Projection-Free Variance Reduction Methods for Stochastic Constrained Multi-Level Compositional Optimization

Wei Jiang 0029
Sifan Yang
Wenhao Yang
Yibo Wang 0005
Yuanyu Wan
Lijun Zhang 0005

This paper investigates projection-free algorithms for stochastic constrained multi-level optimization. In this context, the objective function is a nested composition of several smooth functions, and the decision set is closed and convex. Existing projection-free algorithms for solving this problem suffer from two limitations: 1) they solely focus on the gradient mapping criterion and fail to match the optimal sample complexities in unconstrained settings; 2) their analysis is exclusively applicable to non-convex functions, without considering convex and strongly convex objectives. To address these issues, we introduce novel projection-free variance reduction algorithms and analyze their complexities under different criteria. For gradient mapping, our complexities improve existing results and match the optimal rates for unconstrained problems. For the widely-used Frank-Wolfe gap criterion, we provide theoretical guarantees that align with those for single-level problems. Additionally, by using a stage-wise adaptation, we further obtain complexities for convex and strongly convex functions. Finally, numerical experiments on different tasks demonstrate the effectiveness of our methods.

Details