Arrow Research search

Author name cluster

Cong Xie

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

14 papers
2 author rows

Possible papers

14

YNIMG Journal 2026 Journal Article

Communication modality modulates dynamic switching between neural coupling and decoupling during group creative ideation

  • Shate Cheng
  • Wenyu Zhang
  • Cong Xie
  • Ning Hao

While video conferencing is central to modern creative collaboration, its impact on group performance remains inconsistent, suggesting our understanding of the neural mechanisms is limited. We employed functional near-infrared spectroscopy hyperscanning in 51 dyads (102 participants) to investigate the effect of communication modality-face-to-face interaction, video-mediated communication, and text-mediated communication-on dynamic neural mechanisms during a collaborative creative task. The analysis of dynamic inter-brain synchrony revealed three recurring brain states: an 'Inefficient State', a 'Low-Coupling State' for idea generation, and a 'High-Integration State' for information integration. Analysis of collaborative efficiency metrics indicates that the advantage of face-to-face interaction lies in its capacity to maintain a flexible balance between strong neural coupling (High-Integration State) and decoupling (Low-Coupling State). In contrast, the reduced media richness and potential cognitive load of video-mediated communication may disrupt this balance. The increased demands for explicit cognitive alignment may hinder the development of the low-coupling state, resulting in a prolonged reliance on high-integration processing. This suggests a compensatory neural strategy to maintain collaboration despite the medium's constraints. Based on this integration bias, we propose a potential explanation for the inconsistent findings in the literature: the creative performance may depend on the match between the cognitive demands of a task and the specific neural processing style induced by the communication modality. Our results emphasize the importance of neural decoupling in collaboration and propose a new research direction: expanding the focus from the degree of brain coupling to the flexibility with which brains transition between dynamic states to meet creative demands.

NeurIPS Conference 2025 Conference Paper

DUO: No Compromise to Accuracy Degradation

  • Jinda Jia
  • Cong Xie
  • Hanlin Lu
  • Fanjiang Ye
  • Hao Feng
  • Daoce Wang
  • Haibin Lin
  • Zhi Zhang

Distributed training often suffers from high communication overhead due to large-scale gradient synchronization. Although gradient compression—particularly at 4-bit or even lower precision—significantly reduces transfer volume, it typically results in sacrifice in precision and degradation of the final model accuracy. In this work, we introduce DUO, a distributed training framework designed to mitigate accuracy degradation incurred by gradient compression without involving additional overhead. DUO achieves this by inserting an additional high-precision gradient synchronization step into a previously computation-only phase, so that its communication is fully hidden by computation. We provide a comprehensive theoretical proof of convergence for DUO and validate its effectiveness through extensive pre-training experiments on GPT models. Our results indicate that DUO effectively restores accuracy when using 4-bit gradient compression, achieving performance comparable to uncompressed training. Remarkably, DUO maintains minimal accuracy degradation even under extreme compression scenarios, including 1-bit gradients or complete omission of the low-precision gradient communication step (0-bit transmission).

ICLR Conference 2024 Conference Paper

LEMON: Lossless model expansion

  • Yite Wang
  • Jiahao Su
  • Hanlin Lu
  • Cong Xie
  • Tianyi Liu
  • Jianbo Yuan
  • Haibin Lin
  • Ruoyu Sun 0001

Scaling of deep neural networks, especially Transformers, is pivotal for their surging performance and has further led to the emergence of sophisticated reasoning capabilities in foundation models. Such scaling generally requires training large models from scratch with random initialization, failing to leverage the knowledge acquired by their smaller counterparts, which are already resource-intensive to obtain. To tackle this inefficiency, we present $\textbf{L}$ossl$\textbf{E}$ss $\textbf{MO}$del Expansio$\textbf{N}$ (LEMON), a recipe to initialize scaled models using the weights of their smaller but pre-trained counterparts. This is followed by model training with an optimized learning rate scheduler tailored explicitly for the scaled models, substantially reducing the training time compared to training from scratch. Notably, LEMON is versatile, ensuring compatibility with various network structures, including models like Vision Transformers and BERT. Our empirical results demonstrate that LEMON reduces computational costs by 56.7\% for Vision Transformers and 33.2\% for BERT when compared to training from scratch.

NeurIPS Conference 2024 Conference Paper

SDP4Bit: Toward 4-bit Communication Quantization in Sharded Data Parallelism for LLM Training

  • Jinda Jia
  • Cong Xie
  • Hanlin Lu
  • Daoce Wang
  • Hao Feng
  • Chengming Zhang
  • Baixi Sun
  • Haibin Lin

Recent years have witnessed a clear trend towards language models with an ever-increasing number of parameters, as well as the growing training overhead and memory usage. Distributed training, particularly through Sharded Data Parallelism (ShardedDP) which partitions optimizer states among workers, has emerged as a crucial technique to mitigate training time and memory usage. Yet, a major challenge in the scalability of ShardedDP is the intensive communication of weights and gradients. While compression techniques can alleviate this issue, they often result in worse accuracy. Driven by this limitation, we propose SDP4Bit (Toward 4Bit Communication Quantization in Sharded Data Parallelism for LLM Training), which effectively reduces the communication of weights and gradients to nearly 4 bits via two novel techniques: quantization on weight differences, and two-level gradient smooth quantization. Furthermore, SDP4Bit presents an algorithm-system co-design with runtime optimization to minimize the computation overhead of compression. Additional to the theoretical guarantees of convergence, we empirically evaluate the accuracy of SDP4Bit on the pre-training of GPT models with up to 6. 7 billion parameters, and the results demonstrate a negligible impact on training loss. Furthermore, speed experiments show that SDP4Bit achieves up to 4. 08× speedup in end-to-end throughput on a scale of 128 GPUs.

NeurIPS Conference 2022 Conference Paper

SAPipe: Staleness-Aware Pipeline for Data Parallel DNN Training

  • Yangrui Chen
  • Cong Xie
  • Meng Ma
  • Juncheng Gu
  • Yanghua Peng
  • Haibin Lin
  • Chuan Wu
  • Yibo Zhu

Data parallelism across multiple machines is widely adopted for accelerating distributed deep learning, but it is hard to achieve linear speedup due to the heavy communication. In this paper, we propose SAPipe, a performant system that pushes the training speed of data parallelism to its fullest extent. By introducing partial staleness, the communication overlaps the computation with minimal staleness in SAPipe. To mitigate additional problems incurred by staleness, SAPipe adopts staleness compensation techniques including weight prediction and delay compensation with provably lower error bounds. Additionally, SAPipe presents an algorithm-system co-design with runtime optimization to minimize system overhead for the staleness training pipeline and staleness compensation. We have implemented SAPipe in the BytePS framework, compatible to both TensorFlow and PyTorch. Our experiments show that SAPipe achieves up to 157% speedups over BytePS (non-stale), and outperforms PipeSGD in accuracy by up to 13. 7%.

AAAI Conference 2021 Conference Paper

Alternative Baselines for Low-Shot 3D Medical Image Segmentation—An Atlas Perspective

  • Shuxin Wang
  • Shilei Cao
  • Dong Wei
  • Cong Xie
  • Kai Ma
  • Liansheng Wang
  • Deyu Meng
  • Yefeng Zheng

Low-shot (one/few-shot) segmentation has attracted increasing attention as it works well with limited annotation. Stateof-the-art low-shot segmentation methods on natural images usually focus on implicit representation learning for each novel class, such as learning prototypes, deriving guidance features via masked average pooling, and segmenting using cosine similarity in feature space. We argue that low-shot segmentation on medical images should step further to explicitly learn dense correspondences between images to utilize the anatomical similarity. The core ideas are inspired by the classical practice of multi-atlas segmentation, where the indispensable parts of atlas-based segmentation, i. e. , registration, label propagation, and label fusion are unified into a single framework in our work. Specifically, we propose two alternative baselines, i. e. , the Siamese-Baseline and Individual- Difference-Aware Baseline, where the former is targeted at anatomically stable structures (such as brain tissues), and the latter possesses a strong generalization ability to organs suffering large morphological variations (such as abdominal organs). In summary, this work sets up a benchmark for lowshot 3D medical image segmentation and sheds light on further understanding of atlas-based few-shot segmentation.

NeurIPS Conference 2020 Conference Paper

CSER: Communication-efficient SGD with Error Reset

  • Cong Xie
  • Shuai Zheng
  • Sanmi Koyejo
  • Indranil Gupta
  • Mu Li
  • Haibin Lin

The scalability of Distributed Stochastic Gradient Descent (SGD) is today limited by communication bottlenecks. We propose a novel SGD variant: \underline{C}ommunication-efficient \underline{S}GD with \underline{E}rror \underline{R}eset, or \underline{CSER}. The key idea in CSER is first a new technique called ``error reset'' that adapts arbitrary compressors for SGD, producing bifurcated local models with periodic reset of resulting local residual errors. Second we introduce partial synchronization for both the gradients and the models, leveraging advantages from them. We prove the convergence of CSER for smooth non-convex problems. Empirical results show that when combined with highly aggressive compressors, the CSER algorithms accelerate the distributed training by nearly $10\times$ for CIFAR-100, and by $4. 5\times$ for ImageNet.

ICML Conference 2020 Conference Paper

Zeno++: Robust Fully Asynchronous SGD

  • Cong Xie
  • Sanmi Koyejo
  • Indranil Gupta

We propose Zeno++, a new robust asynchronous Stochastic Gradient Descent(SGD) procedure, intended to tolerate Byzantine failures of workers. In contrast to previous work, Zeno++ removes several unrealistic restrictions on worker-server communication, now allowing for fully asynchronous updates from anonymous workers, for arbitrarily stale worker updates, and for the possibility of an unbounded number of Byzantine workers. The key idea is to estimate the descent of the loss value after the candidate gradient is applied, where large descent values indicate that the update results in optimization progress. We prove the convergence of Zeno++ for non-convex problems under Byzantine failures. Experimental results show that Zeno++ outperforms existing Byzantine-tolerant asynchronous SGD algorithms.

UAI Conference 2019 Conference Paper

Fall of Empires: Breaking Byzantine-tolerant SGD by Inner Product Manipulation

  • Cong Xie
  • Sanmi Koyejo
  • Indranil Gupta

Recently, new defense techniques have been developed to tolerate Byzantine failures for distributed machine learning. The Byzantine model captures workers that behave arbitrarily, including malicious and compromised workers. In this paper, we break two prevailing Byzantine-tolerant techniques. Specifically we show that two robust aggregation methods for synchronous SGD–namely, coordinate-wise median and Krum–can be broken using new attack strategies based on inner product manipulation. We prove our results theoretically, as well as show empirical validation.

ICML Conference 2019 Conference Paper

Zeno: Distributed Stochastic Gradient Descent with Suspicion-based Fault-tolerance

  • Cong Xie
  • Sanmi Koyejo
  • Indranil Gupta

We present Zeno, a technique to make distributed machine learning, particularly Stochastic Gradient Descent (SGD), tolerant to an arbitrary number of faulty workers. Zeno generalizes previous results that assumed a majority of non-faulty nodes; we need assume only one non-faulty worker. Our key idea is to suspect workers that are potentially defective. Since this is likely to lead to false positives, we use a ranking-based preference mechanism. We prove the convergence of SGD for non-convex problems under these scenarios. Experimental results show that Zeno outperforms existing approaches.

TIST Journal 2018 Journal Article

Visual Analytics of Heterogeneous Data Using Hypergraph Learning

  • Cong Xie
  • Wen Zhong
  • Wei Xu
  • Klaus Mueller

For real-world learning tasks (e.g., classification), graph-based models are commonly used to fuse the information distributed in diverse data sources, which can be heterogeneous, redundant, and incomplete. These models represent the relations in different datasets as pairwise links. However, these links cannot deal with high-order relations which connect multiple objects (e.g., in public health datasets, more than two patient groups admitted by the same hospital in 2014). In this article, we propose a visual analytics approach for the classification on heterogeneous datasets using the hypergraph model. The hypergraph is an extension to traditional graphs in which a hyperedge connects multiple vertices instead of just two. We model various high-order relations in heterogeneous datasets as hyperedges and fuse different datasets with a unified hypergraph structure. We use the hypergraph learning algorithm for predicting missing labels in the datasets. To allow users to inject their domain knowledge into the model-learning process, we augment the traditional learning algorithm in a number of ways. In addition, we also propose a set of visualizations which enable the user to construct the hypergraph structure and the parameters of the learning model interactively during the analysis. We demonstrate the capability of our approach via two real-world cases.

AAAI Conference 2016 Conference Paper

A Scalable and Extensible Framework for Superposition-Structured Models

  • Shenjian Zhao
  • Cong Xie
  • Zhihua Zhang

In many learning tasks, structural models usually lead to better interpretability and higher generalization performance. In recent years, however, the simple structural models such as lasso are frequently proved to be insufficient. Accordingly, there has been a lot of work on “superposition-structured” models where multiple structural constraints are imposed. To efficiently solve these “superposition-structured” statistical models, we develop a framework based on a proximal Newtontype method. Employing the smoothed conic dual approach with the LBFGS updating formula, we propose a scalable and extensible proximal quasi-Newton (SEP-QN) framework. Empirical analysis on various datasets shows that our framework is potentially powerful, and achieves super-linear convergence rate for optimizing some popular “superposition-structured” statistical models such as the fused sparse group lasso.

AAAI Conference 2016 Conference Paper

Wishart Mechanism for Differentially Private Principal Components Analysis

  • Wuxuan Jiang
  • Cong Xie
  • Zhihua Zhang

We propose a new input perturbation mechanism for publishing a covariance matrix to achieve (, 0)differential privacy. Our mechanism uses a Wishart distribution to generate matrix noise. In particular, we apply this mechanism to principal component analysis (PCA). Our mechanism is able to keep the positive semi-definiteness of the published covariance matrix. Thus, our approach gives rise to a general publishing framework for input perturbation of a symmetric positive semidefinite matrix. Moreover, compared with the classic Laplace mechanism, our method has better utility guarantee. To the best of our knowledge, the Wishart mechanism is the best input perturbation approach for (, 0)-differentially private PCA. We also compare our work with previous exponential mechanism algorithms in the literature and provide near optimal bound while having more flexibility and less computational intractability.

NeurIPS Conference 2014 Conference Paper

Distributed Power-law Graph Computing: Theoretical and Empirical Analysis

  • Cong Xie
  • Ling Yan
  • Wu-Jun Li
  • Zhihua Zhang

With the emergence of big graphs in a variety of real applications like social networks, machine learning based on distributed graph-computing~(DGC) frameworks has attracted much attention from big data machine learning community. In DGC frameworks, the graph partitioning~(GP) strategy plays a key role to affect the performance, including the workload balance and communication cost. Typically, the degree distributions of natural graphs from real applications follow skewed power laws, which makes GP a challenging task. Recently, many methods have been proposed to solve the GP problem. However, the existing GP methods cannot achieve satisfactory performance for applications with power-law graphs. In this paper, we propose a novel vertex-cut method, called \emph{degree-based hashing}~(DBH), for GP. DBH makes effective use of the skewed degree distributions for GP. We theoretically prove that DBH can achieve lower communication cost than existing methods and can simultaneously guarantee good workload balance. Furthermore, empirical results on several large power-law graphs also show that DBH can outperform the state of the art.