Author name cluster

Haibo Yang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

9 papers

1 author row

NeurIPS Conference 2025 Conference Paper

Exact and Linear Convergence for Federated Learning under Arbitrary Client Participation is Attainable

Bicheng Ying
Zhe Li
Haibo Yang

This work tackles the fundamental challenges in Federated Learning (FL) posed by arbitrary client participation and data heterogeneity, prevalent characteristics in practical FL settings. It is well-established that popular FedAvg-style algorithms struggle with exact convergence and can suffer from slow convergence rates since a decaying learning rate is required to mitigate these scenarios. To address these issues, we introduce the concept of stochastic matrix and the corresponding time-varying graphs as a novel modeling tool to accurately capture the dynamics of arbitrary client participation and the local update procedure. Leveraging this approach, we offer a fresh perspective on designing FL algorithms, provide a rigorous quantitative analysis of the limitations inherent in the FedAvg algorithm, and present FOCUS, Federated Optimization with Exact Convergence via Push-pull Strategy, a provably convergent algorithm designed to effectively overcome the previously mentioned two challenges. More specifically, we provide a rigorous proof demonstrating that FOCUS achieves exact convergence with a linear rate regardless of the arbitrary client participation, establishing it as the first work to demonstrate this significant result.

PDF Details

IJCAI Conference 2025 Conference Paper

FAST: A Lightweight Mechanism Unleashing Arbitrary Client Participation in Federated Learning

Zhe Li
Seyedsina Nabavirazavi
Bicheng Ying
Sitharama Iyengar
Haibo Yang

Federated Learning (FL) provides a flexible distributed platform where numerous clients with high data and system heterogeneity can collaborate to learn a model. While previous research has shown that FL can handle diverse data, it often completely assumes idealized conditions. In practice, real-world factors make it hard to predict or design individual client participation. This complexity results in an unknown participation pattern - arbitrary client participation (ACP). Hence, the key open problem is to understand the impact of client participation and develop a lightweight mechanism to support ACP in FL. In this paper, we first empirically investigate the client participation's influence in FL, revealing that FL algorithms are adversely impacted by ACP. To alleviate the impact, we propose a lightweight solution, Federated Average with Snapshot (FAST), that supports almost ACP for FL and can seamlessly integrate with other classic FL algorithms. Specifically, FAST enforces clients to take a snapshot once in a while and facilitates ACP for the majority of training processes. We prove that the convergence rates of FAST in non-convex and strongly-convex cases match those under ideal client participation. Furthermore, we empirically introduce an adaptive strategy to dynamically configure the snapshot frequency, tailored to accommodate diverse FL systems. Extensive experiments show that FAST significantly improves performance under ACP and high data heterogeneity.

PDF Details DOI

AAAI Conference 2025 Conference Paper

PSMGD: Periodic Stochastic Multi-Gradient Descent for Fast Multi-Objective Optimization

Mingjing Xu
Peizhong Ju
Jia Liu
Haibo Yang

Multi-objective optimization (MOO) lies at the core of many machine learning (ML) applications that involve multiple, potentially conflicting objectives (e.g., multi-task learning, multi-objective reinforcement learning, among many others). Despite the long history of MOO, recent years have witnessed a surge in interest within the ML community in the development of gradient manipulation algorithms for MOO, thanks to the availability of gradient information in many ML problems. However, existing gradient manipulation methods for MOO often suffer from long training times, primarily due to the need for computing dynamic weights by solving an additional optimization problem to determine a common descent direction that can decrease all objectives simultaneously. To address this challenge, we propose a new and efficient algorithm called Periodic Stochastic Multi-Gradient Descent (PSMGD) to accelerate MOO. PSMGD is motivated by the key observation that dynamic weights across objectives exhibit small changes under minor updates over short intervals during the optimization process. Consequently, our PSMGD algorithm is designed to periodically compute these dynamic weights and utilizes them repeatedly, thereby effectively reducing the computational overload. Theoretically, we prove that PSMGD can achieve state-of-the-art convergence rates for strongly-convex, general convex, and non-convex functions. Additionally, we introduce a new computational complexity measure, termed backpropagation complexity, and demonstrate that PSMGD could achieve an objective-independent backpropagation complexity. Through extensive experiments, we verify that PSMGD can provide comparable or superior performance to state-of-the-art MOO algorithms while significantly reducing training time.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

Towards Straggler-Resilient Split Federated Learning: An Unbalanced Update Approach

Dandan Liang
Jianing Zhang
Evan Chen
Zhe Li
Rui Li
Haibo Yang

Split Federated Learning (SFL) enables scalable training on edge devices by combining the parallelism of Federated Learning (FL) with the computational offloading of Split Learning (SL). Despite its great success, SFL suffers significantly from the well-known straggler issue in distributed learning systems. This problem is exacerbated by the dependency between Split Server and clients: the Split Server side model update relies on receiving activations from clients. Such synchronization requirement introduces significant time latency, making straggler a critical bottleneck to the scalability and efficiency of the system. To mitigate this problem, we propose *MU-SplitFed*, a straggler-resilient SFL algorithm that decouples training progress from straggler delays via a simple yet effective unbalanced update mechanism. By enabling the server to perform $\tau$ local updates per client round, *MU-SplitFed* achieves convergence rate $\mathcal{O}(\sqrt{d/(\tau T)})$, showing a linear reduction in communication round by a factor of $\tau$. Experiments demonstrate that *MU-SplitFed* consistently outperforms baseline methods with the presence of stragglers and effectively mitigates their impact through adaptive tuning of $\tau$.

PDF Details

NeurIPS Conference 2023 Conference Paper

Federated Multi-Objective Learning

Haibo Yang
Zhuqing Liu
Jia Liu
Chaosheng Dong
Michinari Momma

In recent years, multi-objective optimization (MOO) emerges as a foundational problem underpinning many multi-agent multi-task learning applications. However, existing algorithms in MOO literature remain limited to centralized learning settings, which do not satisfy the distributed nature and data privacy needs of such multi-agent multi-task learning applications. This motivates us to propose a new federated multi-objective learning (FMOL) framework with multiple clients distributively and collaboratively solving an MOO problem while keeping their training data private. Notably, our FMOL framework allows a different set of objective functions across different clients to support a wide range of applications, which advances and generalizes the MOO formulation to the federated learning paradigm for the first time. For this FMOL framework, we propose two new federated multi-objective optimization (FMOO) algorithms called federated multi-gradient descent averaging (FMGDA) and federated stochastic multi-gradient descent averaging (FSMGDA). Both algorithms allow local updates to significantly reduce communication costs, while achieving the {\em same} convergence rates as those of their algorithmic counterparts in the single-objective federated learning. Our extensive experiments also corroborate the efficacy of our proposed FMOO algorithms.

PDF Details

JBHI Journal 2023 Journal Article

TW-Net: Transformer Weighted Network for Neonatal Brain MRI Segmentation

Shengjie Zhang
Bohan Ren
Ziqi Yu
Haibo Yang
Xiaoyang Han
Xiang Chen
Yuan Zhou
Dinggang Shen

Accurate neonatal brain MRI segmentation is valuable for investigating brain growth patterns and tracking the progression of neurodevelopmental disorders. However, it is a challenging task to use intensity-based methods to segment neonatal brain structures because of small contrast differences between brain regions caused by the inherent myelination process. Although convolutional neural networks offer the potential to segment brain structures in an intensity-independent manner, they suffer from lack of in-plane long-range dependency which is essential for the segmentation. To solve this problem, we propose a novel Transformer-Weighted network (TW-Net) to incorporate in-plane long-range dependency information. TW-Net employs a conventional encoder-decoder architecture with a Transformer module in the middle. The Transformer module uses a rotate-and-flip layer to better calculate the similarity between two patches in a slice to leverage similar patterns of geometrical and texture features within brain structures. In addition, a deep supervision module and squeeze-and-excitation blocks are introduced to incorporate boundary information of brain structures. Compared with state-of-the-art deep learning algorithms, TW-Net outperforms these methods for multiple-label tasks in 2D and 2. 5D configurations on two independent public datasets, demonstrating that TW-Net is a promising method for neonatal brain MRI segmentation.

Details DOI

NeurIPS Conference 2022 Conference Paper

SAGDA: Achieving $\mathcal{O}(\epsilon^{-2})$ Communication Complexity in Federated Min-Max Learning

Haibo Yang
Zhuqing Liu
Xin Zhang
Jia Liu

Federated min-max learning has received increasing attention in recent years thanks to its wide range of applications in various learning paradigms. Similar to the conventional federated learning for empirical risk minimization problems, communication complexity also emerges as one of the most critical concerns that affects the future prospect of federated min-max learning. To lower the communication complexity of federated min-max learning, a natural approach is to utilize the idea of infrequent communications (through multiple local updates) same as in conventional federated learning. However, due to the more complicated inter-outer problem structure in federated min-max learning, theoretical understandings of communication complexity for federated min-max learning with infrequent communications remain very limited in the literature. This is particularly true for settings with non-i. i. d. datasets and partial client participation. To address this challenge, in this paper, we propose a new algorithmic framework called \ul{s}tochastic \ul{s}ampling \ul{a}veraging \ul{g}radient \ul{d}escent \ul{a}scent ($\mathsf{SAGDA}$), which i) assembles stochastic gradient estimators from randomly sampled clients as control variates and ii) leverages two learning rates on both server and client sides. We show that $\mathsf{SAGDA}$ achieves a linear speedup in terms of both the number of clients and local update steps, which yields an $\mathcal{O}(\epsilon^{-2})$ communication complexity that is orders of magnitude lower than the state of the art. Interestingly, by noting that the standard federated stochastic gradient descent ascent (FSGDA) is in fact a control-variate-free special version of $\mathsf{SAGDA}$, we immediately arrive at an $\mathcal{O}(\epsilon^{-2})$ communication complexity result for FSGDA. Therefore, through the lens of $\mathsf{SAGDA}$, we also advance the current understanding on communication complexity of the standard FSGDA method for federated min-max learning.

PDF Details

NeurIPS Conference 2022 Conference Paper

Taming Fat-Tailed (“Heavier-Tailed” with Potentially Infinite Variance) Noise in Federated Learning

Haibo Yang
Peiwen Qiu
Jia Liu

In recent years, federated learning (FL) has emerged as an important distributed machine learning paradigm to collaboratively learn a global model with multiple clients, while keeping data local and private. However, a key assumption in most existing works on FL algorithms' convergence analysis is that the noise in stochastic first-order information has a finite variance. Although this assumption covers all light-tailed (i. e. , sub-exponential) and some heavy-tailed noise distributions (e. g. , log-normal, Weibull, and some Pareto distributions), it fails for many fat-tailed noise distributions (i. e. , ``heavier-tailed'' with potentially infinite variance) that have been empirically observed in the FL literature. To date, it remains unclear whether one can design convergent algorithms for FL systems that experience fat-tailed noise. This motivates us to fill this gap in this paper by proposing an algorithmic framework called $\mathsf{FAT}$-$\mathsf{Clipping}~$ (\ul{f}ederated \ul{a}veraging with \ul{t}wo-sided learning rates and \ul{clipping}), which contains two variants: $\mathsf{FAT}$-$\mathsf{Clipping}~$ per-round ($\mathsf{FAT}$-$\mathsf{Clipping}$-$\mathsf{PR}$) and $\mathsf{FAT}$-$\mathsf{Clipping}~$ per-iteration ($\mathsf{FAT}$-$\mathsf{Clipping}$-$\mathsf{PI}$). Specifically, for the largest $\alpha \in (1, 2]$ such that the fat-tailed noise in FL still has a bounded $\alpha$-moment, we show that both variants achieve $\mathcal{O}((mT)^{\frac{2-\alpha}{\alpha}})$ and $\mathcal{O}((mT)^{\frac{1-\alpha}{3\alpha-2}})$ convergence rates in the strongly-convex and general non-convex settings, respectively, where $m$ and $T$ are the numbers of clients and communication rounds. Moreover, at the expense of more clipping operations compared to $\mathsf{FAT}$-$\mathsf{Clipping}$-$\mathsf{PR}$, $\mathsf{FAT}$-$\mathsf{Clipping}$-$\mathsf{PI}~$ further enjoys a linear speedup effect with respect to the number of local updates at each client and being lower-bound-matching (i. e. , order-optimal). Collectively, our results advance the understanding of designing efficient algorithms for FL systems that exhibit fat-tailed first-order oracle information.

PDF Details

NeurIPS Conference 2021 Conference Paper

STEM: A Stochastic Two-Sided Momentum Algorithm Achieving Near-Optimal Sample and Communication Complexities for Federated Learning

Prashant Khanduri
Pranay Sharma
Haibo Yang
Mingyi Hong
Jia Liu
Ketan Rajawat
Pramod Varshney

Federated Learning (FL) refers to the paradigm where multiple worker nodes (WNs) build a joint model by using local data. Despite extensive research, for a generic non-convex FL problem, it is not clear, how to choose the WNs' and the server's update directions, the minibatch sizes, and the local update frequency, so that the WNs use the minimum number of samples and communication rounds to achieve the desired solution. This work addresses the above question and considers a class of stochastic algorithms where the WNs perform a few local updates before communication. We show that when both the WN's and the server's directions are chosen based on certain stochastic momentum estimator, the algorithm requires $\tilde{\mathcal{O}}(\epsilon^{-3/2})$ samples and $\tilde{\mathcal{O}}(\epsilon^{-1})$ communication rounds to compute an $\epsilon$-stationary solution. To the best of our knowledge, this is the first FL algorithm that achieves such {\it near-optimal} sample and communication complexities simultaneously. Further, we show that there is a trade-off curve between local update frequencies and local minibatch sizes, on which the above sample and communication complexities can be maintained. {Finally, we show that for the classical FedAvg (a. k. a. Local SGD, which is a momentum-less special case of the STEM), a similar trade-off curve exists, albeit with worse sample and communication complexities. Our insights on this trade-off provides guidelines for choosing the four important design elements for FL algorithms, the update frequency, directions, and minibatch sizes to achieve the best performance. }

PDF Details