Arrow Research search

Author name cluster

Yan Sun

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

28 papers
2 author rows

Possible papers

28

TMLR Journal 2026 Journal Article

Statistical Inference for Generative Model Comparison

  • Zijun Gao
  • Han Su
  • Yan Sun

Generative models have achieved remarkable success across a range of applications, yet their evaluation still lacks principled uncertainty quantification. In this paper, we develop a method for comparing how close different generative models are to the underlying distribution of test samples. Particularly, our approach employs the Kullback-Leibler (KL) divergence to measure the distance between a generative model and the unknown test distribution, as KL requires no tuning parameters such as the kernels used by RKHS-based distances. And the relative KL divergence is the only $f$-divergence that admits a crucial cancellation of the hard-to-estimate term to enable the faithful uncertainty quantification. Furthermore, we extend our method to comparing conditional generative models and leverage Edgeworth expansions to address limited-data settings. On simulated datasets with known ground truth, we show that our approach realizes effective coverage rates, and has higher power compared to kernel-based methods. When applied to generative models on image and text datasets, our procedure yields conclusions consistent with benchmark metrics but with statistical confidence. The source code to reproduce our experiments is available at https://github.com/sylydya/compare-generative-models.

TMLR Journal 2026 Journal Article

Subspace based Federated Unlearning

  • Guanghao Li
  • Li Shen
  • Yan Sun
  • Yue Hu
  • Han Hu
  • Dacheng Tao

Federated learning (FL) enables collaborative machine learning among multiple clients while preserving user data privacy by preventing the exchange of local data. However, when users request to leave the FL system, the trained FL model may still retain information about their contributions. To comply with the right to be forgotten, federated unlearning has been proposed, which aims to remove a designated client's influence from the FL model. Existing federated unlearning methods typically rely on storing historical parameter updates, which may be impractical in resource-constrained FL settings. In this paper, we propose a Subspace-based Federated Unlearning method (SFU) that addresses this challenge without requiring additional storage. SFU updates the model via gradient ascent constrained within a subspace, specifically the orthogonal complement of the gradient descent directions derived from the remaining clients. By projecting the ascending gradient of the target client onto this subspace, SFU can mitigate the contribution of the target client while maintaining model performance on the remaining clients. SFU is communication-efficient, requiring only one round of local training per client to transmit gradient information to the server for model updates. Extensive empirical evaluations on multiple datasets demonstrate that SFU achieves competitive unlearning performance while preserving model utility. Compared to representative baseline methods, SFU consistently shows promising results under various experimental settings.

NeurIPS Conference 2025 Conference Paper

Effective Policy Learning for Multi-Agent Online Coordination Beyond Submodular Objectives

  • Qixin Zhang
  • Yan Sun
  • Can Jin
  • Xikun Zhang
  • Yao Shu
  • Puning Zhao
  • Li Shen
  • Dacheng Tao

In this paper, we present two effective policy learning algorithms for multi-agent online coordination(MA-OC) problem. The first one, **MA-SPL**, not only can achieve the optimal $(1-\frac{c}{e})$-approximation guarantee for the MA-OC problem with submodular objectives but also can handle the unexplored $\alpha$-weakly DR-submodular and $(\gamma, \beta)$-weakly submodular scenarios, where $c$ is the curvature of the investigated submodular functions, $\alpha$ denotes the diminishing-return(DR) ratio and the tuple$(\gamma, \beta)$ represents the submodularity ratios. Subsequently, in order to reduce the reliance on the unknown parameters $\alpha, \gamma, \beta$ inherent in the **MA-SPL** algorithm, we then introduce the second online algorithm named **MA-MPL**. This **MA-MPL** algorithm is entirely *parameter-free* and simultaneously can maintain the same approximation ratio as the first **MA-SPL** algorithm. The core of our **MA-SPL** and **MA-MPL** algorithms is a novel continuous-relaxation technique term as policy-based continuous extension. Compared with the well-established multi-linear extension, a notable advantage of this new policy-based continuous extension is its ability to provide a lossless rounding scheme for any set function, thereby enabling us to tackle the challenging weakly submodular objective functions. Finally, extensive simulations are conducted to demonstrate the effectiveness of our proposed algorithms.

NeurIPS Conference 2025 Conference Paper

Foundations of Top-$k$ Decoding for Language Models

  • Georgy Noarov
  • Soham Mallick
  • Tao Wang
  • Sunay Joshi
  • Yan Sun
  • Yangxinyu Xie
  • Mengxin Yu
  • Edgar Dobriban

Top-$k$ decoding is a widely used method for sampling from LLMs: at each token, only the largest $k$ next-token-probabilities are kept, and the next token is sampled after re-normalizing them to sum to unity. Top-$k$ and other sampling methods are motivated by the intuition that true next-token distributions are sparse, and the noisy LLM probabilities need to be truncated. However, to our knowledge, a precise theoretical motivation for the use of top-$k$ decoding is missing. In this work, we develop a theoretical framework that both explains and generalizes top-$k$ decoding. We view decoding at a fixed token as the recovery of a sparse probability distribution. We introduce *Bregman decoders* obtained by minimizing a separable Bregman divergence (for both the *primal* and *dual* cases) with a sparsity-inducing $\ell_0$-regularization; in particular, these decoders are *adaptive* in the sense that the sparsity parameter $k$ is chosen depending on the underlying token distribution. Despite the combinatorial nature of the sparse Bregman objective, we show how to optimize it efficiently for a large class of divergences. We prove that (i) the optimal decoding strategies are greedy, and further that (ii) the objective is discretely convex in $k$, such that the optimal $k$ can be identified in logarithmic time. We note that standard top-$k$ decoding arises as a special case for the KL divergence, and construct new decoding strategies with substantially different behaviors (e. g. , non-linearly up-weighting larger probabilities after re-normalization).

ICLR Conference 2025 Conference Paper

In-context Time Series Predictor

  • Jiecheng Lu
  • Yan Sun
  • Shihao Yang 0002

Recent Transformer-based large language models (LLMs) demonstrate in-context learning ability to perform various functions based solely on the provided context, without updating model parameters. To fully utilize the in-context capabilities in time series forecasting (TSF) problems, unlike previous Transformer-based or LLM-based time series forecasting methods, we reformulate "time series forecasting tasks" as input tokens by constructing a series of (lookback, future) pairs within the tokens. This method aligns more closely with the inherent in-context mechanisms and is more parameter-efficient without the need of using pre-trained LLM parameters. Furthermore, it addresses issues such as overfitting in existing Transformer-based TSF models, consistently achieving better performance across full-data, few-shot, and zero-shot settings compared to previous architectures.

NeurIPS Conference 2025 Conference Paper

Structure-Aware Fusion with Progressive Injection for Multimodal Molecular Representation Learning

  • Zihao Jing
  • Yan Sun
  • Yan Yi Li
  • Sugitha Janarthanan
  • Alana Deng
  • Pingzhao Hu

Multimodal molecular models often suffer from 3D conformer unreliability and modality collapse, limiting their robustness and generalization. We propose MuMo, a structured multimodal fusion framework that addresses these challenges in molecular representation through two key strategies. To reduce the instability of conformer-dependent fusion, we design a Structured Fusion Pipeline (SFP) that combines 2D topology and 3D geometry into a unified and stable structural prior. To mitigate modality collapse caused by naive fusion, we introduce a Progressive Injection (PI) mechanism that asymmetrically integrates this prior into the sequence stream, preserving modality-specific modeling while enabling cross-modal enrichment. Built on a state space backbone, MuMo supports long-range dependency modeling and robust information propagation. Across 29 benchmark tasks from Therapeutics Data Commons (TDC) and MoleculeNet, MuMo achieves an average improvement of 2. 7% over the best-performing baseline on each task, ranking first on 22 of them, including a 27% improvement on the LD50 task. These results validate its robustness to 3D conformer noise and the effectiveness of multimodal fusion in molecular representation. The code is available at: github. com/selmiss/MuMo.

IROS Conference 2025 Conference Paper

Towards Robust Sensor-Fusion Ground SLAM: A Comprehensive Benchmark and A Resilient Framework

  • Deteng Zhang
  • Junjie Zhang
  • Yan Sun
  • Tao Li
  • Hao Yin
  • Hongzhao Xie
  • Jie Yin

Considerable advancements have been achieved in SLAM methods tailored for structured environments, yet their robustness under challenging corner cases remains a critical limitation. Although multi-sensor fusion approaches integrating diverse sensors have shown promising performance improvements, the research community faces two key barriers: On one hand, the lack of standardized and configurable benchmarks that systematically evaluate SLAM algorithms under diverse degradation scenarios hinders comprehensive performance assessment. While on the other hand, existing SLAM frameworks primarily focus on fusing a limited set of sensor types, without effectively addressing adaptive sensor selection strategies for varying environmental conditions. To bridge these gaps, we make three key contributions: First, we introduce M3DGR dataset: a sensor-rich benchmark with systematically induced degradation patterns including visual challenge, LiDAR degeneracy, wheel slippage and GNSS denial. Second, we conduct a comprehensive evaluation of forty SLAM systems on M3DGR, providing critical insights into their robustness and limitations under challenging real-world conditions. Third, we develop a resilient modular multi-sensor fusion framework named Ground-Fusion++, which demonstrates robust performance by coupling GNSS, RGB-D, LiDAR, IMU (Inertial Measurement Unit) and wheel odometry. Codes 1 and datasets 2 are publicly available.

ICLR Conference 2025 Conference Paper

Unsupervised Multiple Kernel Learning for Graphs via Ordinality Preservation

  • Yan Sun
  • Stanley Kok

Learning effective graph similarities is crucial for tasks like clustering, yet selecting the optimal kernel to evaluate such similarities in unsupervised settings remains a major challenge. Despite the development of various graph kernels, determining the most appropriate one for a specific task is particularly difficult in the absence of labeled data. Existing methods often struggle to handle the complex structure of graph data and rely on heuristic approaches that fail to adequately capture the global relationships between graphs. To overcome these limitations, we propose Unsupervised Multiple Kernel Learning for Graphs (UMKL-G), a model that combines multiple graph kernels without requiring labels or predefined local neighbors. Our approach preserves the topology of the data by maintaining ordinal relationships among graphs through a probability simplex, allowing for a unified and adaptive kernel learning process. We provide theoretical guarantees on the stability, robustness, and generalization of our method. Empirical results demonstrate that UMKL-G outperforms individual kernels and other state-of-the-art methods, offering a robust solution for unsupervised graph analysis.

ICML Conference 2025 Conference Paper

WAVE: Weighted Autoregressive Varying Gate for Time Series Forecasting

  • Jiecheng Lu
  • Xu Han
  • Yan Sun
  • Shihao Yang 0002

We propose a Weighted Autoregressive Varying gatE (WAVE) attention mechanism equipped with both Autoregressive (AR) and Moving-average (MA) components. It can adapt to various attention mechanisms, enhancing and decoupling their ability to capture long-range and local temporal patterns in time series data. In this paper, we first demonstrate that, for the time series forecasting (TSF) task, the previously overlooked decoder-only autoregressive Transformer model can achieve results comparable to the best baselines when appropriate tokenization and training methods are applied. Moreover, inspired by the ARMA model from statistics and recent advances in linear attention, we introduce the full ARMA structure into existing autoregressive attention mechanisms. By using an indirect MA weight generation method, we incorporate the MA term while maintaining the time complexity and parameter size of the underlying efficient attention models. We further explore how indirect parameter generation can produce implicit MA weights that align with the modeling requirements for local temporal impacts. Experimental results show that WAVE attention that incorporates the ARMA structure consistently improves the performance of various AR attentions on TSF tasks, achieving state-of-the-art results.

NeurIPS Conference 2025 Conference Paper

ZeroS: Zero‑Sum Linear Attention for Efficient Transformers

  • Jiecheng Lu
  • Xu Han
  • Yan Sun
  • Viresh Pati
  • Yubin Kim
  • Siddhartha Somani
  • Shihao Yang

Linear attention methods offer Transformers $O(N)$ complexity but typically underperform standard softmax attention. We identify two fundamental limitations affecting these approaches: the restriction to convex combinations that only permits additive information blending, and uniform accumulated weight bias that dilutes attention in long contexts. We propose Zero-Sum Linear Attention (ZeroS), which addresses these limitations by removing the constant zero-order term $1/t$ and reweighting the remaining zero-sum softmax residuals. This modification creates mathematically stable weights, enabling both positive and negative values and allowing a single attention layer to perform contrastive operations. While maintaining $O(N)$ complexity, ZeroS theoretically expands the set of representable functions compared to convex combinations. Empirically, it matches or exceeds standard softmax attention across various sequence modeling benchmarks.

NeurIPS Conference 2024 Conference Paper

A-FedPD: Aligning Dual-Drift is All Federated Primal-Dual Learning Needs

  • Yan Sun
  • Li Shen
  • Dacheng Tao

As a popular paradigm for juggling data privacy and collaborative training, federated learning (FL) is flourishing to distributively process the large scale of heterogeneous datasets on edged clients. Due to bandwidth limitations and security considerations, it ingeniously splits the original problem into multiple subproblems to be solved in parallel, which empowers primal dual solutions to great application values in FL. In this paper, we review the recent development of classical federated primal dual methods and point out a serious common defect of such methods in non-convex scenarios, which we say is a ``dual drift'' caused by dual hysteresis of those longstanding inactive clients under partial participation training. To further address this problem, we propose a novel Aligned Federated Primal Dual (A-FedPD) method, which constructs virtual dual updates to align global consensus and local dual variables for those protracted unparticipated local clients. Meanwhile, we provide a comprehensive analysis of the optimization and generalization efficiency for the A-FedPD method on smooth non-convex objectives, which confirms its high efficiency and practicality. Extensive experiments are conducted on several classical FL setups to validate the effectiveness of our proposed method.

ICML Conference 2024 Conference Paper

CATS: Enhancing Multivariate Time Series Forecasting by Constructing Auxiliary Time Series as Exogenous Variables

  • Jiecheng Lu
  • Xu Han
  • Yan Sun
  • Shihao Yang 0002

For Multivariate Time Series Forecasting (MTSF), recent deep learning applications show that univariate models frequently outperform multivariate ones. To address the deficiency in multivariate models, we introduce a method to Construct Auxiliary Time Series (CATS) that functions like a 2D temporal-contextual attention mechanism, which generates Auxiliary Time Series (ATS) from Original Time Series (OTS) to effectively represent and incorporate inter-series relationships for forecasting. Key principles of ATS—continuity, sparsity, and variability—are identified and implemented through different modules. Even with a basic 2-layer MLP as the core predictor, CATS achieves state-of-the-art, significantly reducing complexity and parameters compared to previous multivariate models, marking it as an efficient and transferable MTSF solution.

ICLR Conference 2024 Conference Paper

Deep Orthogonal Hypersphere Compression for Anomaly Detection

  • Yunhe Zhang 0001
  • Yan Sun
  • Jinyu Cai
  • Jicong Fan 0001

Many well-known and effective anomaly detection methods assume that a reasonable decision boundary has a hypersphere shape, which however is difficult to obtain in practice and is not sufficiently compact, especially when the data are in high-dimensional spaces. In this paper, we first propose a novel deep anomaly detection model that improves the original hypersphere learning through an orthogonal projection layer, which ensures that the training data distribution is consistent with the hypersphere hypothesis, thereby increasing the true positive rate and decreasing the false negative rate. Moreover, we propose a bi-hypersphere compression method to obtain a hyperspherical shell that yields a more compact decision region than a hyperball, which is demonstrated theoretically and numerically. The proposed methods are not confined to common datasets such as image and tabular data, but are also extended to a more challenging but promising scenario, graph-level anomaly detection, which learns graph representation with maximum mutual information between the substructure and global structure features while exploring orthogonal single- or bi-hypersphere anomaly decision boundaries. The numerical and visualization results on benchmark datasets demonstrate the superiority of our methods in comparison to many baselines and state-of-the-art methods.

ICLR Conference 2024 Conference Paper

MMD Graph Kernel: Effective Metric Learning for Graphs via Maximum Mean Discrepancy

  • Yan Sun
  • Jicong Fan 0001

This paper focuses on graph metric learning. First, we present a class of maximum mean discrepancy (MMD) based graph kernels, called MMD-GK. These kernels are computed by applying MMD to the node representations of two graphs with message-passing propagation. Secondly, we provide a class of deep MMD-GKs that are able to learn graph kernels and implicit graph features adaptively in an unsupervised manner. Thirdly, we propose a class of supervised deep MMD-GKs that are able to utilize label information of graphs and hence yield more discriminative metrics. Besides the algorithms, we provide theoretical analysis for the proposed methods. The proposed methods are evaluated in comparison to many baselines such as graph kernels and graph neural networks in the tasks of graph clustering and graph classification. The numerical results demonstrate the effectiveness and superiority of our methods.

TMLR Journal 2024 Journal Article

Visual Prompt Based Personalized Federated Learning

  • Guanghao Li
  • Wansen Wu
  • Yan Sun
  • Li Shen
  • Baoyuan Wu
  • Dacheng Tao

As a popular paradigm of distributed learning, personalized federated learning (PFL) allows personalized models to improve generalization ability and robustness by utilizing knowledge from all distributed clients. Most existing PFL algorithms tackle personalization in a model-centric way, such as personalized layer partition, model regularization, and model interpolation, which all fail to take into account the data characteristics of distributed clients. In this paper, we propose a novel PFL framework for image classification tasks, dubbed pFedPT, that leverages personalized visual prompts to implicitly represent local data distribution information of clients and provides that information to the aggregation model to help with classification tasks. Specifically, in each round of pFedPT training, each client generates a local personalized prompt related to local data distribution. Then, the local model is trained on the input composed of raw data and a visual prompt to learn the distribution information contained in the prompt. During model testing, the aggregated model obtains client-specific knowledge of the data distributions based on the prompts, which can be seen as an adaptive fine-tuning of the aggregation model to improve model performances on different clients. Furthermore, the visual prompt can be added as an orthogonal method to implement personalization on the client for existing FL methods to boost their performance. Experiments on the CIFAR10 and CIFAR100 datasets show that pFedPT outperforms several state-of-the-art (SOTA) PFL algorithms by a large margin in various settings. The code is available at: https://github.com/hkgdifyu/pFedPT.

NeurIPS Conference 2023 Conference Paper

Active Negative Loss Functions for Learning with Noisy Labels

  • Xichen Ye
  • Xiaoqiang Li
  • Songmin Dai
  • Tong Liu
  • Yan Sun
  • Weiqin Tong

Robust loss functions are essential for training deep neural networks in the presence of noisy labels. Some robust loss functions use Mean Absolute Error (MAE) as its necessary component. For example, the recently proposed Active Passive Loss (APL) uses MAE as its passive loss function. However, MAE treats every sample equally, slows down the convergence and can make training difficult. In this work, we propose a new class of theoretically robust passive loss functions different from MAE, namely Normalized Negative Loss Functions (NNLFs), which focus more on memorized clean samples. By replacing the MAE in APL with our proposed NNLFs, we improve APL and propose a new framework called Active Negative Loss (ANL). Experimental results on benchmark and real-world datasets demonstrate that the new set of loss functions created by our ANL framework can outperform state-of-the-art methods. The code is available athttps: //github. com/Virusdoll/Active-Negative-Loss.

ICML Conference 2023 Conference Paper

Dynamic Regularized Sharpness Aware Minimization in Federated Learning: Approaching Global Consistency and Smooth Landscape

  • Yan Sun
  • Li Shen 0008
  • Shixiang Chen
  • Liang Ding 0006
  • Dacheng Tao

In federated learning (FL), a cluster of local clients are chaired under the coordination of the global server and cooperatively train one model with privacy protection. Due to the multiple local updates and the isolated non-iid dataset, clients are prone to overfit into their own optima, which extremely deviates from the global objective and significantly undermines the performance. Most previous works only focus on enhancing the consistency between the local and global objectives to alleviate this prejudicial client drifts from the perspective of the optimization view, whose performance would be prominently deteriorated on the high heterogeneity. In this work, we propose a novel and general algorithm FedSMOO by jointly considering the optimization and generalization targets to efficiently improve the performance in FL. Concretely, FedSMOO adopts a dynamic regularizer to guarantee the local optima towards the global objective, which is meanwhile revised by the global Sharpness Aware Minimization (SAM) optimizer to search for the consistent flat minima. Our theoretical analysis indicates that FedSMOO achieves fast $\mathcal{O}(1/T)$ convergence rate with low generalization bound. Extensive numerical studies are conducted on the real-world dataset to verify its peerless efficiency and excellent generality.

ICLR Conference 2023 Conference Paper

FedSpeed: Larger Local Interval, Less Communication Round, and Higher Generalization Accuracy

  • Yan Sun
  • Li Shen 0008
  • Tiansheng Huang
  • Liang Ding 0006
  • Dacheng Tao

Federated learning (FL) is an emerging distributed machine learning framework which jointly trains a global model via a large number of local devices with data privacy protections. Its performance suffers from the non-vanishing biases introduced by the local inconsistent optimal and the rugged client-drifts by the local over-fitting. In this paper, we propose a novel and practical method, FedSpeed, to alleviate the negative impacts posed by these problems. Concretely, FedSpeed applies the prox-correction term on the current local updates to efficiently reduce the biases introduced by the prox-term, a necessary regularizer to maintain the strong local consistency. Furthermore, FedSpeed merges the vanilla stochastic gradient with a perturbation computed from an extra gradient ascent step in the neighborhood, thereby alleviating the issue of local over-fitting. Our theoretical analysis indicates that the convergence rate is related to both the communication rounds $T$ and local intervals $K$ with a tighter upper bound $\mathcal{O}(\frac{1}{T})$ if $K=\mathcal{O}(T)$. Moreover, we conduct extensive experiments on the real-world dataset to demonstrate the efficiency of our proposed FedSpeed, which converges significantly faster and achieves the state-of-the-art (SOTA) performance on the general FL experimental settings than several baselines including FedAvg, FedProx, FedCM, FedAdam, SCAFFOLD, FedDyn, FedADMM, etc.

TMLR Journal 2023 Journal Article

Fusion of Global and Local Knowledge for Personalized Federated Learning

  • Tiansheng Huang
  • Li Shen
  • Yan Sun
  • Weiwei Lin
  • Dacheng Tao

Personalized federated learning, as a variant of federated learning, trains customized models for clients using their heterogeneously distributed data. However, it is still inconclusive about how to design personalized models with better representation of shared global knowledge and personalized pattern. To bridge the gap, we in this paper explore personalized models with low-rank and sparse decomposition. Specifically, we employ proper regularization to extract a low-rank global knowledge representation (GKR), so as to distill global knowledge into a compact representation. Subsequently, we employ a sparse component over the obtained GKR to fuse the personalized pattern into the global knowledge. As a solution, we propose a two-stage proximal-based algorithm named \textbf{Fed}erated learning with mixed \textbf{S}parse and \textbf{L}ow-\textbf{R}ank representation (FedSLR) to efficiently search for the mixed models. Theoretically, under proper assumptions, we show that the GKR trained by FedSLR can at least sub-linearly converge to a stationary point of the regularized problem, and that the sparse component being fused can converge to its stationary point under proper settings. Extensive experiments also demonstrate the superior empirical performance of FedSLR. Moreover, FedSLR reduces the number of parameters, and lowers the down-link communication complexity, which are all desirable for federated learning algorithms. Source code is available in \url{https://github.com/huangtiansheng/fedslr}.

JBHI Journal 2023 Journal Article

GCCN: Graph Capsule Convolutional Network for Progressive Mild Cognitive Impairment Prediction and Pathogenesis Identification Based on Imaging Genetic Data

  • Junliang Shang
  • Qi Zou
  • Qianqian Ren
  • Boxin Guan
  • Feng Li
  • Jin-Xing Liu
  • Yan Sun

In this study, we proposed a novel method called the graph capsule convolutional network (GCCN) to predict the progression from mild cognitive impairment to dementia and identify its pathogenesis. First, we proposed a novel risk gene discovery component to indirectly target genes with higher interactions with others. These risk genes and brain regions were collected as nodes to construct heterogeneous pathogenic information association graphs. Second, the graph capsules were established by projecting heterogeneous pathogenic information into a set of disentangled latent components. The orientation and length of capsules are representations of the format and intensity of pathogenic information. Third, graph capsule convolution network was used to model the information flows among pathogenic factors and elaborates the convergence of primary capsules to advanced capsules. The advanced capsule is a concept that organizes pathogenic information based on its consistency, and the synergistic effects of advanced capsules directed the development of the disease. Finally, discriminative pathogenic information flows were captured by a straightforward built-in interpretation mechanism, i. e. , the dynamic routing mechanism, and applied to the identification of pathogenesis. GCCN has been experimentally shown to be significantly advanced on public datasets. Further experiments have shown that the pathogenic factors identified by GCCN are evidential and closely related to progressive mild cognitive impairment.

ICML Conference 2023 Conference Paper

Improving the Model Consistency of Decentralized Federated Learning

  • Yifan Shi
  • Li Shen 0008
  • Kang Wei 0004
  • Yan Sun
  • Bo Yuan 0003
  • Xueqian Wang 0001
  • Dacheng Tao

To mitigate the privacy leakages and communication burdens of Federated Learning (FL), decentralized FL (DFL) discards the central server and each client only communicates with its neighbors in a decentralized communication network. However, existing DFL suffers from high inconsistency among local clients, which results in severe distribution shift and inferior performance compared with centralized FL (CFL), especially on heterogeneous data or sparse communication topologies. To alleviate this issue, we propose two DFL algorithms named DFedSAM and DFedSAM-MGS to improve the performance of DFL. Specifically, DFedSAM leverages gradient perturbation to generate local flat models via Sharpness Aware Minimization (SAM), which searches for models with uniformly low loss values. DFedSAM-MGS further boosts DFedSAM by adopting Multiple Gossip Steps (MGS) for better model consistency, which accelerates the aggregation of local flat models and better balances communication complexity and generalization. Theoretically, we present improved convergence rates $\small \mathcal{O}\big(\frac{1}{\sqrt{KT}}+\frac{1}{T}+\frac{1}{K^{1/2}T^{3/2}(1-\lambda)^2}\big)$ and $\small \mathcal{O}\big(\frac{1}{\sqrt{KT}}+\frac{1}{T}+\frac{\lambda^Q+1}{K^{1/2}T^{3/2}(1-\lambda^Q)^2}\big)$ in non-convex setting for DFedSAM and DFedSAM-MGS, respectively, where $1-\lambda$ is the spectral gap of gossip matrix and $Q$ is the number of MGS. Empirically, our methods can achieve competitive performance compared with CFL methods and outperform existing DFL methods.

NeurIPS Conference 2023 Conference Paper

Sparse Deep Learning for Time Series Data: Theory and Applications

  • Mingxuan Zhang
  • Yan Sun
  • Faming Liang

Sparse deep learning has become a popular technique for improving the performance of deep neural networks in areas such as uncertainty quantification, variable selection, and large-scale network compression. However, most existing research has focused on problems where the observations are independent and identically distributed (i. i. d. ), and there has been little work on the problems where the observations are dependent, such as time series data and sequential data in natural language processing. This paper aims to address this gap by studying the theory for sparse deep learning with dependent data. We show that sparse recurrent neural networks (RNNs) can be consistently estimated, and their predictions are asymptotically normally distributed under appropriate assumptions, enabling the prediction uncertainty to be correctly quantified. Our numerical results show that sparse deep learning outperforms state-of-the-art methods, such as conformal predictions, in prediction uncertainty quantification for time series data. Furthermore, our results indicate that the proposed method can consistently identify the autoregressive order for time series data and outperform existing methods in large-scale model compression. Our proposed method has important practical implications in fields such as finance, healthcare, and energy, where both accurate point estimates and prediction uncertainty quantification are of concern.

NeurIPS Conference 2023 Conference Paper

Understanding How Consistency Works in Federated Learning via Stage-wise Relaxed Initialization

  • Yan Sun
  • Li Shen
  • Dacheng Tao

Federated learning (FL) is a distributed paradigm that coordinates massive local clients to collaboratively train a global model via stage-wise local training processes on the heterogeneous dataset. Previous works have implicitly studied that FL suffers from the "client-drift" problem, which is caused by the inconsistent optimum across local clients. However, till now it still lacks solid theoretical analysis to explain the impact of this local inconsistency. To alleviate the negative impact of the "client drift" and explore its substance in FL, in this paper, we first design an efficient FL algorithm FedInit, which allows employing the personalized relaxed initialization state at the beginning of each local training stage. Specifically, FedInit initializes the local state by moving away from the current global state towards the reverse direction of the latest local state. This relaxed initialization helps to revise the local divergence and enhance the local consistency level. Moreover, to further understand how inconsistency disrupts performance in FL, we introduce the excess risk analysis and study the divergence term to investigate the test error of the proposed FedInit method. Our studies show that on the non-convex objectives, optimization error is not sensitive to this local inconsistency, while it mainly affects the generalization error bound in FedInit. Extensive experiments are conducted to validate this conclusion. Our proposed FedInit could achieve state-of-the-art (SOTA) results compared to several advanced benchmarks without any additional costs. Meanwhile, stage-wise relaxed initialization could also be incorporated into the current advanced algorithms to achieve higher performance in the FL paradigm.

NeurIPS Conference 2022 Conference Paper

Nonlinear Sufficient Dimension Reduction with a Stochastic Neural Network

  • Siqi Liang
  • Yan Sun
  • Faming Liang

Sufficient dimension reduction is a powerful tool to extract core information hidden in the high-dimensional data and has potentially many important applications in machine learning tasks. However, the existing nonlinear sufficient dimension reduction methods often lack the scalability necessary for dealing with large-scale data. We propose a new type of stochastic neural network under a rigorous probabilistic framework and show that it can be used for sufficient dimension reduction for large-scale data. The proposed stochastic neural network is trained using an adaptive stochastic gradient Markov chain Monte Carlo algorithm, whose convergence is rigorously studied in the paper as well. Through extensive experiments on real-world classification and regression problems, we show that the proposed method compares favorably with the existing state-of-the-art sufficient dimension reduction methods and is computationally more efficient for large-scale data.

NeurIPS Conference 2021 Conference Paper

Sparse Deep Learning: A New Framework Immune to Local Traps and Miscalibration

  • Yan Sun
  • Wenjun Xiong
  • Faming Liang

Deep learning has powered recent successes of artificial intelligence (AI). However, the deep neural network, as the basic model of deep learning, has suffered from issues such as local traps and miscalibration. In this paper, we provide a new framework for sparse deep learning, which has the above issues addressed in a coherent way. In particular, we lay down a theoretical foundation for sparse deep learning and propose prior annealing algorithms for learning sparse neural networks. The former has successfully tamed the sparse deep neural network into the framework of statistical modeling, enabling prediction uncertainty correctly quantified. The latter can be asymptotically guaranteed to converge to the global optimum, enabling the validity of the down-stream statistical inference. Numerical result indicates the superiority of the proposed method compared to the existing ones.

AAAI Conference 2018 Conference Paper

Understanding Image Impressiveness Inspired by Instantaneous Human Perceptual Cues

  • Jufeng Yang
  • Yan Sun
  • Jie Liang
  • Yong-Liang Yang
  • Ming-Ming Cheng

With the explosion of visual information nowadays, millions of digital images are available to the users. How to efficiently explore a large set of images and retrieve useful information thus becomes extremely important. Unfortunately only some of the images can impress the user at first glance. Others that make little sense in human perception are often discarded, while still costing valuable time and space. Therefore, it is significant to identify these two kinds of images for relieving the load of online repositories and accelerating information retrieval process. However, most of the existing image properties, e. g. , memorability and popularity, are based on repeated human interactions, which limit the research and application of evaluating image quality in terms of instantaneous impression. In this paper, we propose a novel image property, called impressiveness, that measures how images impress people with a short-term contact. This is based on an impression-driven model inspired by a number of important human perceptual cues. To achieve this, we first collect three datasets in various domains, which are labeled according to the instantaneous sensation of the annotators. Then we investigate the impressiveness property via six established human perceptual cues as well as the corresponding features from pixel to semantic levels. Sequentially, we verify the consistency of the impressiveness which can be quantitatively measured by multiple visual representations, and evaluate their latent relationships. Finally, we apply the proposed impressiveness property to rank the images for an efficient image recommendation system.