Author name cluster

Xingyu Xie

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

12 papers

2 author rows

NeurIPS Conference 2025 Conference Paper

GRIFFIN: Effective Token Alignment for Faster Speculative Decoding

Shijing Hu
Jingyang Li
Xingyu Xie
Zhihui Lu
Kim-Chuan Toh
Pan Zhou

Speculative decoding accelerates inference in large language models (LLMs) by generating multiple draft tokens simultaneously. However, existing methods often struggle with token misalignment between the training and decoding phases, limiting their performance. To address this, we propose GRIFFIN, a novel framework that incorporates a token-alignable training strategy and a token-alignable draft model to mitigate misalignment. The training strategy employs a loss masking mechanism to exclude highly misaligned tokens during training, preventing them from negatively impacting the draft model's optimization. The token-alignable draft model introduces input tokens to correct inconsistencies in generated features. Experiments on LLaMA, Vicuna, Qwen and Mixtral models demonstrate that GRIFFIN achieves an average acceptance length improvement of over 8\% and a speedup ratio exceeding 7\%, outperforming current speculative decoding state-of-the-art methods. Our code and GRIFFIN's draft models will be released publicly in https: //github. com/hsj576/GRIFFIN.

PDF Details

JBHI Journal 2025 Journal Article

Multistage Diffusion Model With Phase Error Correction for Fast PET Imaging

Yunlong Gao
Zhenxing Huang
Xingyu Xie
Wenjie Zhao
Qianyi Yang
Xinlan Yang
Yongfeng Yang
Hairong Zheng

Fast PET imaging is clinically important for reducing motion artifacts and improving patient comfort. While recent diffusion-based deep learning methods have shown promise, they often fail to capture the true PET degradation process, suffer from accumulated inference errors, introduce artifacts, and require extensive reconstruction iterations. To address these challenges, we propose a novel multistage diffusion framework tailored for fast PET imaging. At the coarse level, we design a multistage structure to approximate the temporal non-linear PET degradation process in a data-driven manner, using paired PET images collected under different acquisition duration. A Phase Error Correction Network (PECNet) ensures consistency across stages by correcting accumulated deviations. At the fine level, we introduce a deterministic cold diffusion mechanism, which simulates intra-stage degradation through interpolation between known acquisition durations—significantly reducing reconstruction iterations to as few as 10. Evaluations on [ 68 Ga]FAPI and [ 18 F]FDG PET datasets demonstrate the superiority of our approach, achieving peak PSNRs of 36. 2 dB and 39. 0 dB, respectively, with average SSIMs over 0. 97. Our framework offers high-fidelity PET imaging with fewer iterations, making it practical for accelerated clinical imaging.

Details DOI

ICLR Conference 2025 Conference Paper

SEPARATE: A Simple Low-rank Projection for Gradient Compression in Modern Large-scale Model Training Process

Hanzhen Zhao
Xingyu Xie
Cong Fang 0001
Zhouchen Lin

Training Large Language Models (LLMs) presents a significant communication bottleneck, predominantly due to the growing scale of the gradient to communicate across multi-device clusters. However, how to mitigate communication overhead in practice remains a formidable challenge due to the weakness of the methodology of the existing compression methods, especially the neglect of the characteristics of the gradient. In this paper, we consider and demonstrate the low-rank properties of gradient and Hessian observed in LLMs training dynamic, and take advantage of such natural properties to design SEPARATE, a simple low-rank projection for gradient compression in modern large-scale model training processes. SEPARATE realizes dimensional reduction by common random Gaussian variables and an improved moving average error-feedback technique. We theoretically demonstrate that SEPARATE-based optimizers maintain the original convergence rate for SGD and Adam-Type optimizers for general non-convex objectives. Experimental results show that SEPARATE accelerates training speed by up to 2× for GPT-2-Medium pre-training, and improves performance on various benchmarks for LLAMA2-7B fine-tuning.

Details

JMLR Journal 2024 Journal Article

Win: Weight-Decay-Integrated Nesterov Acceleration for Faster Network Training

Pan Zhou
Xingyu Xie
Zhouchen Lin
Kim-Chuan Toh
Shuicheng Yan

Training deep networks on large-scale datasets is computationally challenging. This work explores the problem of “how to accelerate adaptive gradient algorithms in a general manner", and proposes an effective Weight-decay-Integrated Nesterov acceleration (Win) to accelerate adaptive algorithms. Taking AdamW and Adam as examples, per iteration, we construct a dynamical loss that combines the vanilla training loss and a dynamic regularizer inspired by proximal point method, and respectively minimize the first- and second-order Taylor approximations of dynamical loss to update variable. This yields our Win acceleration that uses a conservative step and an aggressive step to update, and linearly combines these two updates for acceleration. Next, we extend Win into Win2 which uses multiple aggressive update steps for faster convergence. Then we apply Win and Win2 to the popular LAMB and SGD optimizers. Our transparent derivation could provide insights for other accelerated methods and their integration into adaptive algorithms. Besides, we theoretically justify the faster convergence of Win- and Win2-accelerated AdamW, Adam and LAMB to their non-accelerated counterparts. Experimental results demonstrates the faster convergence speed and superior performance of our Win- and Win2-accelerated AdamW, Adam, LAMB and SGD over their vanilla counterparts on vision classification and language modeling tasks. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2024. ( edit, beta )

PDF Details

JBHI Journal 2023 Journal Article

Convolutional Feature Descriptor Selection for Mammogram Classification

Dong Li
Lei Zhang
Jianwei Zhang
Xingyu Xie

Breast cancer was the most commonly diagnosed cancer among women worldwide in 2020. Recently, several deep learning-based classification approaches have been proposed to screen breast cancer in mammograms. However, most of these approaches require additional detection or segmentation annotations. Meanwhile, some other image-level label-based methods often pay insufficient attention to lesion areas, which are critical for diagnosis. This study designs a novel deep-learning method for automatically diagnosing breast cancer in mammography, which focuses on the local lesion areas and only utilizes image-level classification labels. In this study, we propose to select discriminative feature descriptors from feature maps instead of identifying lesion areas using precise annotations. And we design a novel adaptive convolutional feature descriptor selection (AFDS) structure based on the distribution of the deep activation map. Specifically, we adopt the triangle threshold strategy to calculate a specific threshold for guiding the activation map to determine which feature descriptors (local areas) are discriminative. Ablation experiments and visualization analysis indicate that the AFDS structure makes the model easier to learn the difference between malignant and benign/normal lesions. Furthermore, since the AFDS structure can be regarded as a highly efficient pooling structure, it can be easily plugged into most existing convolutional neural networks with negligible effort and time consumption. Experimental results on two publicly available INbreast and CBIS-DDSM datasets indicate that the proposed method performs satisfactorily compared with state-of-the-art methods.

Details DOI

NeurIPS Conference 2023 Conference Paper

Task-Robust Pre-Training for Worst-Case Downstream Adaptation

Jianghui Wang
Yang Chen
Xingyu Xie
Cong Fang
Zhouchen Lin

Pre-training has achieved remarkable success when transferred to downstream tasks. In machine learning, we care about not only the good performance of a model but also its behavior under reasonable shifts of condition. The same philosophy holds when pre-training a foundation model. However, the foundation model may not uniformly behave well for a series of related downstream tasks. This happens, for example, when conducting mask recovery regression where the recovery ability or the training instances diverge like pattern features are extracted dominantly on pre-training, but semantic features are also required on a downstream task. This paper considers pre-training a model that guarantees a uniformly good performance over the downstream tasks. We call this goal as downstream-task robustness. Our method first separates the upstream task into several representative ones and applies a simple minimax loss for pre-training. We then design an efficient algorithm to solve the minimax lossand prove its convergence in the convex setting. In the experiments, we show both on large-scale natural language processing and computer vision datasets our method increases the metrics on worse-case downstream tasks. Additionally, some theoretical explanations for why our loss is beneficial are provided. Specifically, we show fewer samples are inherently required for the most challenging downstream task in some cases.

PDF Details

ICLR Conference 2023 Conference Paper

Win: Weight-Decay-Integrated Nesterov Acceleration for Adaptive Gradient Algorithms

Pan Zhou 0002
Xingyu Xie
Shuicheng Yan

Training deep networks on large-scale datasets is computationally challenging. In this work, we explore the problem of ``\textit{how to accelerate adaptive gradient algorithms in a general manner}", and aim to provide practical efficiency-boosting insights. To this end, we propose an effective and general {Weight-decay-Integrated Nesterov acceleration} (Win) to accelerate adaptive algorithms. Taking AdamW and Adam as examples, we minimize a dynamical loss per iteration which combines the vanilla training loss and a dynamic regularizer inspired by proximal point method (PPM) to improve the convexity of the problem. To introduce Nesterov-alike-acceleration into AdamW and Adam, we respectively use the first- and second-order Taylor approximations of vanilla loss to update the variable twice. In this way, we arrive at our Win acceleration for AdamW and Adam that uses a conservative step and a reckless step to update twice and then linearly combines these two updates for acceleration. Next, we extend Win acceleration to LAMB and SGD. Our transparent acceleration derivation could provide insights for other accelerated methods and their integration into adaptive algorithms. Besides, we prove the convergence of Win-accelerated adaptive algorithms and justify their convergence superiority over their non-accelerated counterparts by taking AdamW and Adam as examples. Experimental results testify to the faster convergence speed and superior performance of our Win-accelerated AdamW, Adam, LAMB and SGD over their non-accelerated counterparts on vision classification tasks and language modeling tasks with both CNN and Transformer backbones. We hope Win shall be a default acceleration option for popular optimizers in deep learning community to improve the training efficiency. Code will be released at \url{https://github.com/sail-sg/win}.

Details

ICLR Conference 2022 Conference Paper

Optimization inspired Multi-Branch Equilibrium Models

Mingjie Li 0007
Yisen Wang 0001
Xingyu Xie
Zhouchen Lin

Works have shown the strong connections between some implicit models and optimization problems. However, explorations on such relationships are limited. Most works pay attention to some common mathematical properties, such as sparsity. In this work, we propose a new type of implicit model inspired by the designing of the systems' hidden objective functions, called the Multi-branch Optimization induced Equilibrium networks~(MOptEqs). The model architecture is designed based on modelling the hidden objective function for the multi-resolution recognition task. Furthermore, we also propose a new strategy inspired by our understandings of the hidden objective function. In this manner, the proposed model can better utilize the hierarchical patterns for recognition tasks and retain the abilities for interpreting the whole structure as trying to obtain the minima of the problem's goal. Comparing with the state-of-the-art models, our MOptEqs not only enjoys better explainability but are also superior to MDEQ with less parameter consumption and better performance on practical tasks. Furthermore, we also implement various experiments to demonstrate the effectiveness of our new methods and explore the applicability of the model's hidden objective function.

Details

ICML Conference 2020 Conference Paper

Maximum-and-Concatenation Networks

Xingyu Xie
Hao Kong
Jianlong Wu
Wayne Zhang 0001
Guangcan Liu
Zhouchen Lin

While successful in many fields, deep neural networks (DNNs) still suffer from some open problems such as bad local minima and unsatisfactory generalization performance. In this work, we propose a novel architecture called Maximum-and-Concatenation Networks (MCN) to try eliminating bad local minima and improving generalization ability as well. Remarkably, we prove that MCN has a very nice property; that is, every local minimum of an (l+1)-layer MCN can be better than, at least as good as, the global minima of the network consisting of its first l layers. In other words, by increasing the network depth, MCN can autonomously improve its local minima’s goodness, what is more, it is easy to plug MCN into an existing deep model to make it also have this property. Finally, under mild conditions, we show that MCN can approximate certain continuous function arbitrarily well with high efficiency; that is, the covering number of MCN is much smaller than most existing DNNs such as deep ReLU. Based on this, we further provide a tight generalization bound to guarantee the inference ability of MCN when dealing with testing samples.

Details

AAAI Conference 2020 Conference Paper

Unified Graph and Low-Rank Tensor Learning for Multi-View Clustering

Jianlong Wu
Xingyu Xie
Liqiang Nie
Zhouchen Lin
Hongbin Zha

Multi-view clustering aims to take advantage of multiple views information to improve the performance of clustering. Many existing methods compute the afﬁnity matrix by low-rank representation (LRR) and pairwise investigate the relationship between views. However, LRR suffers from the high computational cost in self-representation optimization. Besides, compared with pairwise views, tensor form of all views’ representation is more suitable for capturing the highorder correlations among all views. Towards these two issues, in this paper, we propose the uniﬁed graph and low-rank tensor learning (UGLTL) for multi-view clustering. Speciﬁcally, on the one hand, we learn the view-speciﬁc afﬁnity matrix based on projected graph learning. On the other hand, we reorganize the afﬁnity matrices into tensor form and learn its intrinsic tensor based on low-rank tensor approximation. Finally, we unify these two terms together and jointly learn the optimal projection matrices, afﬁnity matrices and intrinsic low-rank tensor. We also propose an efﬁcient algorithm to iteratively optimize the proposed model. To evaluate the performance of the proposed method, we conduct extensive experiments on multiple benchmarks across different scenarios and sizes. Compared with the state-of-the-art approaches, our method achieves much better performance.

PDF Details

ICML Conference 2019 Conference Paper

Differentiable Linearized ADMM

Xingyu Xie
Jianlong Wu
Guangcan Liu
Zhisheng Zhong
Zhouchen Lin

Recently, a number of learning-based optimization methods that combine data-driven architectures with the classical optimization algorithms have been proposed and explored, showing superior empirical performance in solving various ill-posed inverse problems, but there is still a scarcity of rigorous analysis about the convergence behaviors of learning-based optimization. In particular, most existing analyses are specific to unconstrained problems but cannot apply to the more general cases where some variables of interest are subject to certain constraints. In this paper, we propose Differentiable Linearized ADMM (D-LADMM) for solving the problems with linear constraints. Specifically, D-LADMM is a K-layer LADMM inspired deep neural network, which is obtained by firstly introducing some learnable weights in the classical Linearized ADMM algorithm and then generalizing the proximal operator to some learnable activation function. Notably, we rigorously prove that there exist a set of learnable parameters for D-LADMM to generate globally converged solutions, and we show that those desired parameters can be attained by training D-LADMM in a proper way. To the best of our knowledge, we are the first to provide the convergence analysis for the learning-based optimization method on constrained problems.

Details

IJCAI Conference 2018 Conference Paper

Redundancy-resistant Generative Hashing for Image Retrieval

Changying Du
Xingyu Xie
Changde Du
Hao Wang

By optimizing probability distributions over discrete latent codes, Stochastic Generative Hashing (SGH) bypasses the critical and intractable binary constraints on hash codes. While encouraging results were reported, SGH still suffers from the deficient usage of latent codes, i. e. , there often exist many uninformative latent dimensions in the code space, a disadvantage inherited from its auto-encoding variational framework. Motivated by the fact that code redundancy usually is severer when more complex decoder network is used, in this paper, we propose a constrained deep generative architecture to simplify the decoder for data reconstruction. Specifically, our new framework forces the latent hashing codes to not only reconstruct data through the generative network but also retain minimal squared L2 difference to the last real-valued network hidden layer. Furthermore, during posterior inference, we propose to regularize the standard auto-encoding objective with an additional term that explicitly accounts for the negative redundancy degree of latent code dimensions. We interpret such modifications as Bayesian posterior regularization and design an adversarial strategy to optimize the generative, the variational, and the redundancy-resistanting parameters. Empirical results show that our new method can significantly boost the quality of learned codes and achieve state-of-the-art performance for image retrieval.

PDF Details