Author name cluster

Sunwoo Lee

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

7 papers

1 author row

AAAI Conference 2026 Conference Paper

GEM: A Scale-Aware and Distribution-Sensitive Sparse Fine-Tuning Framework for Effective Downstream Adaptation

Sungmin Kang
Jisoo Kim
Salman Avestimehr
Sunwoo Lee

Parameter-efficient fine-tuning (PEFT) has become a popular way to adapt large pre-trained models to new tasks. Most PEFT methods update only a small subset of parameters while freezing the rest, avoiding redundant computation. As they maximize the absolute size of the updates without regard to the parameters’ original scale, the resulting changes in model behavior can be minimal. In contrast, we maximize updates relative to each parameter’s scale, yielding more meaningful downstream adaptation. We propose Gradient-to-Weight Ratio and Entropy-guided Masking (GEM), a parameter scale-aware, distribution-sensitive sparse fine-tuning framework. GEM prioritizes parameters whose updates are significant in proportion to their initial pre-trained values. It also adaptively determines how many parameters to tune at each layer based on the entropy of parameter values, thereby making the most effective use of the computational budget in PEFT. Our empirical study demonstrates the efficacy of GEM on both general-domain tasks (GLUE and SuperGLUE) and domain-specific tasks (GSM8k and MBPP), achieving up to a 1.6% improvement in fine-tuning accuracy over full fine-tuning while updating only 0.1% of model parameters.

PDF Details DOI

TIST Journal 2026 Journal Article

Multi-Metric Client Activation Method for Fast and Accurate Federated Learning

Jihyun Lim
Tuo Zhang
Sunwoo Lee

While unbiased gradient estimators ensure unbiased solutions in empirical risk minimization problems, they can significantly hinder optimization efficiency and generalization performance in strongly non-IID Federated Learning environments. Although recent studies have demonstrated promising applications of biased estimators, they typically focus only on convergence rates while overlooking generalization performance. We propose a novel multi-metric bias concept, quantified using both local loss and local gradient norm, along with a client activation method based on this bias concept. The proposed method prioritizes training on local datasets that better represent the global dataset, leading to faster convergence and improved generalization. Our extensive empirical study demonstrates that carefully injecting bias into client activation accelerates federated optimization, achieving a substantially improved validation accuracy within a given epoch budget. In representative machine learning benchmarks, our method achieves up to $12.7\%$ higher accuracy than uniform random sampling and $2.5\%$ higher accuracy than state-of-the-art biased client activation methods.

Details DOI

NeurIPS Conference 2025 Conference Paper

Layer-wise Update Aggregation with Recycling for Communication-Efficient Federated Learning

Jisoo Kim
Sungmin Kang
Sunwoo Lee

Expensive communication cost is a common performance bottleneck in Federated Learning (FL), which makes it less appealing in real-world applications. Many communication-efficient FL methods focus on discarding a part of model updates mostly based on gradient magnitude. In this study, we find that recycling previous updates, rather than simply dropping them, more effectively reduces the communication cost while maintaining FL performance. We propose FedLUAR, a Layer-wise Update Aggregation with Recycling scheme for communication-efficient FL. We first define a useful metric that quantifies the extent to which the aggregated gradients influences the model parameter values in each layer. FedLUAR selects a few layers based on the metric and recycles their previous updates on the server side. Our extensive empirical study demonstrates that the update recycling scheme significantly reduces the communication cost while maintaining model accuracy. For example, our method achieves nearly the same AG News accuracy as FedAvg, while reducing the communication cost to just 17%.

PDF Details

AAAI Conference 2024 Conference Paper

FoX: Formation-Aware Exploration in Multi-Agent Reinforcement Learning

Yonghyeon Jo
Sunwoo Lee
Junghyuk Yeom
Seungyul Han

Recently, deep multi-agent reinforcement learning (MARL) has gained significant popularity due to its success in various cooperative multi-agent tasks. However, exploration still remains a challenging problem in MARL due to the partial observability of the agents and the exploration space that can grow exponentially as the number of agents increases. Firstly, in order to address the scalability issue of the exploration space, we define a formation-based equivalence relation on the exploration space and aim to reduce the search space by exploring only meaningful states in different formations. Then, we propose a novel formation-aware exploration (FoX) framework that encourages partially observable agents to visit the states in diverse formations by guiding them to be well aware of their current formation solely based on their own observations. Numerical results show that the proposed FoX framework significantly outperforms the state-of-the-art MARL algorithms on Google Research Football (GRF) and sparse Starcraft II multi-agent challenge (SMAC) tasks.

PDF Details DOI

AAAI Conference 2023 Conference Paper

Layer-Wise Adaptive Model Aggregation for Scalable Federated Learning

Sunwoo Lee
Tuo Zhang
A. Salman Avestimehr

In Federated Learning (FL), a common approach for aggregating local solutions across clients is periodic full model averaging. It is, however, known that different layers of neural networks can have a different degree of model discrepancy across the clients. The conventional full aggregation scheme does not consider such a difference and synchronizes the whole model parameters at once, resulting in inefficient network bandwidth consumption. Aggregating the parameters that are similar across the clients does not make meaningful training progress while increasing the communication cost. We propose FedLAMA, a layer-wise adaptive model aggregation scheme for scalable FL. FedLAMA adjusts the aggregation interval in a layer-wise manner, jointly considering the model discrepancy and the communication cost. This fine-grained aggregation strategy enables to reduce the communication cost without significantly harming the model accuracy. Our extensive empirical study shows that, as the aggregation interval increases, FedLAMA shows a remarkably smaller accuracy drop than the periodic full aggregation, while achieving comparable communication efficiency.

PDF Details DOI

TMLR Journal 2023 Journal Article

mL-BFGS: A Momentum-based L-BFGS for Distributed Large-scale Neural Network Optimization

Yue Niu
Zalan Fabian
Sunwoo Lee
Mahdi Soltanolkotabi
Salman Avestimehr

Quasi-Newton methods still face significant challenges in training large-scale neural networks due to additional compute costs in the Hessian related computations and instability issues in stochastic training. A well-known method, L-BFGS that efficiently approximates the Hessian using history parameter and gradient changes, suffers convergence instability in stochastic training. So far, attempts that adapt L-BFGS to large-scale stochastic training incur considerable extra overhead, which offsets its convergence benefits in wall-clock time. In this paper, we propose mL-BFGS, a lightweight momentum-based L-BFGS algorithm that paves the way for quasi-Newton (QN) methods in large-scale distributed deep neural network (DNN) optimization. mL-BFGS introduces a nearly cost-free momentum scheme into L-BFGS update and greatly reduces stochastic noise in the Hessian, therefore stabilizing convergence during stochastic optimization. For model training at a large scale, mL-BFGS approximates a block-wise Hessian, thus enabling distributing compute and memory costs across all computing nodes. We provide a supporting convergence analysis for mL-BFGS in stochastic settings. To investigate mL-BFGS's potential in large-scale DNN training, we train benchmark neural models using mL-BFGS and compare performance with baselines (SGD, Adam, and other quasi-Newton methods). Results show that mL-BFGS achieves both noticeable iteration-wise and wall-clock speedup.

PDF Details

TMLR Journal 2023 Journal Article

Overcoming Resource Constraints in Federated Learning: Large Models Can Be Trained with only Weak Clients

Yue Niu
Saurav Prakash
Souvik Kundu
Sunwoo Lee
Salman Avestimehr

Federated Learning (FL) is emerging as a popular, promising decentralized learning framework that enables collaborative training among clients, with no need to share private data between them or to a centralized server. However, considering many edge clients do not have sufficient computing, memory, or communication capabilities, federated learning of large models still faces significant bottlenecks. To keep such weak but crucial clients in the loop, prior works either consider a heterogeneous-client setting where clients train models with different sizes; or offload training to the server. However, the heterogeneous-client setting requires some clients to train full model, which is not aligned with the resource-constrained setting; while the latter ones break privacy promises in FL when sharing intermediate representations or labels with the server. To overcome these limitations, in this work, we formulate a realistic, but much less explored, cross-device FL setting in which no client can train a full large model nor is willing to share any intermediate information with the remote server. Under such a formulation, we develop a principal sub-model (PriSM) training methodology to collaboratively train a full large model, while assigning each client a small sub-model that is a probabilistic low-rank approximation to the full server model. When creating sub-models, PriSM first performs a principal kernel analysis in the orthogonal kernel space to obtain importance of each kernel. Then, PriSM adopts a novel importance-aware sampling process to select a subset of kernels (i.e., a kernel with high importance is assigned with a higher sampling probability). This sampling process ensures each sub-model is still a low-rank approximation to the full model, while all sub-models together achieve nearly full coverage on the principal kernels. To further improve memory efficiency while still preserving accuracy, PriSM also exploits low-rank structure in intermediate representations and allows each sub-model to learn only a subset of them. Our evaluations on various datasets and models (CNNs, LSTMs, Transformers) under different resource-constrained settings demonstrate that PriSM yields an accuracy improvement of up to $10\%$ compared to existing works. More importantly, PriSM does not incur significant accuracy degradation compared to full-model training (e.g., only $\sim 2\%$ accuracy drops for ResNet-18/CIFAR-10 when clients train only $0.2\times$ sub-models).

PDF Details