Arrow Research search

Author name cluster

Bin Gu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

51 papers
1 author row

Possible papers

51

AAAI Conference 2026 Conference Paper

Towards Nonlinear Sparse AUC Maximization via Compositional Stochastic Hard Thresholding

  • Wenkang Wang
  • Dongxu Liu
  • Bin Gu

The Area Under the ROC Curve (AUC) is an important evaluation metric for both linear and, in particular, nonlinear classification models, owing to its robustness against class imbalance. Sparse learning with an ℓ₀ constraint can enhance model interpretability and generalization. Prior work has shown that, in the linear setting, the pairwise formulation of AUC maximization can be reformulated as a standard pointwise empirical risk minimization problem, which enables efficient optimization using hard-thresholding gradient descent for ℓ₀-constrained AUC maximization. Extending this approach to the nonlinear setting remains largely unexplored, even though we establish that pairwise AUC maximization in this setting is equivalent to a pointwise compositional optimization problem; however, designing a compositional optimization algorithm compatible with hard-thresholding operators remains an open challenge. To address this challenge, in this paper, we propose a novel algorithm—Compositional Stochastic Hard Thresholding (CSHT)—for nonlinear sparse AUC maximization. Specifically, CSHT integrates stochastic variance-reduced gradient techniques with hard-thresholding projections to effectively reduce gradient estimation variance while enforcing sparsity. Notably, we provide a rigorous convergence analysis and prove that CSHT achieves linear convergence up to a tolerance bound. To the best of our knowledge, this is the first stochastic hard-thresholding algorithm tailored for nonlinear sparse AUC maximization. Extensive experiments on (a) nonlinear sparse AUC maximization using Random Fourier Feature-based kernel approximation and (b) universal adversarial attack scenarios demonstrate the superior performance of CSHT over existing methods, attributed to its unified treatment of nonlinearity and sparsity.

NeurIPS Conference 2025 Conference Paper

Accelerated Vertical Federated Adversarial Learning through Decoupling Layer-Wise Dependencies

  • Tianxing Man
  • Yu Bai
  • Ganyu Wang
  • Jinjie Fang
  • Haoran Fang
  • Bin Gu
  • Yi Chang

Vertical Federated Learning (VFL) enables participants to collaboratively train models on aligned samples while keeping their heterogeneous features private and distributed. Despite their utility, VFL models remain vulnerable to adversarial attacks during inference. Adversarial Training (AT), which generates adversarial examples at each training iteration, stands as the most effective defense for improving model robustness. However, applying AT in VFL settings (VFAL) faces significant computational efficiency challenges, as the distributed training framework necessitates iterative propagations across participants. To this end, we propose **_DecVFAL_** framework, which substantially accelerates **_VFAL_** training through a dual-level ***Dec***oupling mechanism applied during adversarial sample generation. Specifically, we first decouple the bottom modules of clients (directly responsible for adversarial updates) from the remaining networks, enabling efficient _lazy sequential propagations_ that reduce communication frequency through delayed gradients. We further introduce _decoupled parallel backpropagation_ to accelerate delayed gradient computation by eliminating idle waiting through parallel processing across modules. Additionally, we are the first to establish convergence analysis for VFAL, rigorously characterizing how our decoupling mechanism interacts with existing VFL dynamics, and prove that _DecVFAL_ achieves an $\mathcal{O}(1/\sqrt{K})$ convergence rate matching that of standard VFLs. Experimental results show that _DecVFAL_ ensures competitive robustness while significantly achieving about $3\sim10\times$ speed up.

AAAI Conference 2025 Conference Paper

Error Analysis Affected by Heavy-Tailed Gradients for Non-Convex Pairwise Stochastic Gradient Descent

  • Jun Chen
  • Hong Chen
  • Bin Gu
  • Guodong Liu
  • Yingjie Wang
  • Weifu Li

In recent years, there have been a growing number of works studying the generalization properties of stochastic gradient descent (SGD) from the perspective of algorithmic stability. However, few of them devote to simultaneously studying the generalization and optimization for the non-convex setting, especially pairwise SGD with heavy-tailed gradient noise. This paper considers the impact of the heavy-tailed gradient noise obeying sub-Weibull distribution on the stability-based learning guarantees for non-convex pairwise SGD by investigating its generalization and optimization jointly. Specifically, based on two novel pairwise uniform model stability tools, we firstly bound the generalization error of pairwise SGD in the general non-convex setting after bridging the quantitative relationships between stability and generalization error. Then, we further consider the practical heavy-tailed sub-Weibull gradient noise condition to establish a refined generalization bound without the bounded gradient condition. Finally, sharper error bounds for generalization and optimization are built by introducing the gradient dominance condition. Comparing these results reveals that sub-Weibull gradient noise brings some positive dependencies on the heavy-tailed strength for generalization and optimization. Furthermore, we extend our analysis to the corresponding pairwise minibatch SGD and derive the first stability-based near-optimal generalization and optimization bounds which are consistent with many empirical observations.

AAAI Conference 2025 Conference Paper

Leveraging First and Zeroth-Order Gradient to Address Imbalanced Black-Box Prompt Tuning via Minimax Optimization

  • Haozhen Zhang
  • Zhaogeng Liu
  • Bin Gu
  • Yi Chang

Black-box prompt tuning has become a prevalent parameter-efficient paradigm that leverages the capabilities of large language models (LLMs) for customized applications in specific downstream tasks. In practical scenarios, downstream tasks frequently involve data distributions that are heavily imbalanced. Such imbalances tend to impair performance, causing severe performance collapse in minority classes. Conducting effective imbalanced black-box prompt tuning to mitigate the adverse effects of imbalanced data distribution on prompt performance remains a significant challenge. In this paper, we propose black-box prompt tuning with first and zeroth order gradient (BPT-FZG) for handling the imbalanced data. Specifically, BPT-FZG introduces AUC maximization as the objective for prompt tuning and equivalently formulates it as a nonconvex-concave saddle point problem to avoid the construction of sample pairs from opposite classes. Indeed, BPT-FZG optimizes the latent representation of the continuous prompt in the low-dimensional subspace with AUC loss and leverages the first and zeroth order gradients alternately to update the parameters. Furthermore, we establish the theoretical convergence guarantee for BPT-FZG under common assumptions, showing that our method can find a stationary point of the objective function. Our experiments on RoBERTa-large, GPT2-XL, and Llama3 show that BPT-FZG achieves improvement on various imbalanced datasets, emphasizing the effectiveness of our methods.

AAAI Conference 2024 Conference Paper

Dynamic Spiking Graph Neural Networks

  • Nan Yin
  • Mengzhu Wang
  • Zhenghan Chen
  • Giulia De Masi
  • Huan Xiong
  • Bin Gu

The integration of Spiking Neural Networks (SNNs) and Graph Neural Networks (GNNs) is gradually attracting attention due to the low power consumption and high efficiency in processing the non-Euclidean data represented by graphs. However, as a common problem, dynamic graph representation learning faces challenges such as high complexity and large memory overheads. Current work often uses SNNs instead of Recurrent Neural Networks (RNNs) by using binary features instead of continuous ones for efficient training, which overlooks graph structure information and leads to the loss of details during propagation. Additionally, optimizing dynamic spiking models typically requires the propagation of information across time steps, which increases memory requirements. To address these challenges, we present a framework named Dynamic Spiking Graph Neural Networks (Dy-SIGN). To mitigate the information loss problem, Dy-SIGN propagates early-layer information directly to the last layer for information compensation. To accommodate the memory requirements, we apply the implicit differentiation on the equilibrium state, which does not rely on the exact reverse of the forward computation. While traditional implicit differentiation methods are usually used for static situations, Dy-SIGN extends it to the dynamic graph setting. Extensive experiments on three large-scale real-world dynamic graph datasets validate the effectiveness of Dy-SIGN on dynamic node classification tasks with lower computational costs.

JBHI Journal 2024 Journal Article

Enhancing Motor Sequence Learning via Transcutaneous Auricular Vagus Nerve Stimulation (taVNS): An EEG Study

  • Long Chen
  • Chenghu Tang
  • Zhongpeng Wang
  • Lei Zhang
  • Bin Gu
  • Xiuyun Liu
  • Dong Ming

Motor learning plays a crucial role in human life, and various neuromodulation methods have been utilized to strengthen or improve it. Transcutaneous auricular vagus nerve stimulation (taVNS) has gained increasing attention due to its non-invasive nature, affordability and ease of implementation. Although the potential of taVNS on regulating motor learning has been suggested, its actual regulatory effect has yet been fully explored. Electroencephalogram (EEG) analysis provides an in-depth understanding of cognitive processes involved in motor learning so as to offer methodological support for regulation of motor learning. To investigate the effect of taVNS on motor learning, this study recruited 22 healthy subjects to participate a single-blind, sham-controlled, and within-subject serial reaction time task (SRTT) experiment. Every subject involved in two sessions at least one week apart and received a 20-minute active/sham taVNS in each session. Behavioral indicators as well as EEG characteristics during the task state, were extracted and analyzed. The results revealed that compared to the sham group, the active group showed higher learning performance. Additionally, the EEG results indicated that after taVNS, the motor-related cortical potential amplitudes and alpha-gamma modulation index decreased significantly and functional connectivity based on partial directed coherence towards frontal lobe was enhanced. These findings suggest that taVNS can improve motor learning, mainly through enhancing cognitive and memory functions rather than simple movement learning. This study confirms the positive regulatory effect of taVNS on motor learning, which is particularly promising as it offers a potential avenue for enhancing motor skills and facilitating rehabilitation.

AAAI Conference 2024 Conference Paper

Enhancing Training of Spiking Neural Network with Stochastic Latency

  • Srinivas Anumasa
  • Bhaskar Mukhoty
  • Velibor Bojkovic
  • Giulia De Masi
  • Huan Xiong
  • Bin Gu

Spiking neural networks (SNNs) have garnered significant attention for their low power consumption when deployed on neuromorphic hardware that operates in orders of magnitude lower power than general-purpose hardware. Direct training methods for SNNs come with an inherent latency for which the SNNs are optimized, and in general, the higher the latency, the better the predictive powers of the models, but at the same time, the higher the energy consumption during training and inference. Furthermore, an SNN model optimized for one particular latency does not necessarily perform well in lower latencies, which becomes relevant in scenarios where it is necessary to switch to a lower latency because of the depletion of onboard energy or other operational requirements. In this work, we propose Stochastic Latency Training (SLT), a direct training method for SNNs that optimizes the model for the given latency but simultaneously offers a minimum reduction of predictive accuracy when shifted to lower inference latencies. We provide heuristics for our approach with partial theoretical justification and experimental evidence showing the state-of-the-art performance of our models on datasets such as CIFAR-10, DVS-CIFAR-10, CIFAR-100, and DVS-Gesture. Our code is available at https://github.com/srinuvaasu/SLT

IJCAI Conference 2024 Conference Paper

Fine-grained Analysis of Stability and Generalization for Stochastic Bilevel Optimization

  • Xuelin Zhang
  • Hong Chen
  • Bin Gu
  • Tieliang Gong
  • Feng Zheng

Stochastic bilevel optimization (SBO) has been integrated into many machine learning paradigms recently including hyperparameter optimization, meta learning, reinforcement learning, etc. Along with the wide range of applications, there have been abundant studies on concerning the computing behaviors of SBO. However, the generalization guarantees of SBO methods are far less understood from the lens of statistical learning theory. In this paper, we provide a systematical generalization analysis of the first-order gradient-based bilevel optimization methods. Firstly, we establish the quantitative connections between the on-average argument stability and the generalization gap of SBO methods. Then, we derive the upper bounds of on-average argument stability for single timescale stochastic gradient descent (SGD) and two timescale SGD, where three settings (nonconvex-nonconvex (NC-NC), convex-convex (C-C) and strongly-convex-strongly-convex (SC-SC)) are considered respectively. Experimental analysis validates our theoretical findings. Compared with the previous algorithmic stability analysis, our results do not require the re-initialization of the inner-level parameters before each iteration and are suit for more general objective functions.

IJCAI Conference 2024 Conference Paper

Hard-Thresholding Meets Evolution Strategies in Reinforcement Learning

  • Chengqian Gao
  • William de Vazelhes
  • Hualin Zhang
  • Bin Gu
  • Zhiqiang Xu

Evolution Strategies (ES) have emerged as a competitive alternative for model-free reinforcement learning, showcasing exemplary performance in tasks like Mujoco and Atari. Notably, they shine in scenarios with imperfect reward functions, making them invaluable for real-world applications where dense reward signals may be elusive. Yet, an inherent assumption in ES—that all input features are task-relevant—poses challenges, especially when confronted with irrelevant features common in real-world problems. This work scrutinizes this limitation, particularly focusing on the Natural Evolution Strategies (NES) variant. We propose NESHT, a novel approach that integrates Hard-Thresholding (HT) with NES to champion sparsity, ensuring only pertinent features are employed. Backed by rigorous analysis and empirical tests, NESHT demonstrates its promise in mitigating the pitfalls of irrelevant features and shines in complex decision-making problems like noisy Mujoco and Atari tasks. Our code is available at https: //github. com/cangcn/NES-HT.

NeurIPS Conference 2024 Conference Paper

How Does Black-Box Impact the Learning Guarantee of Stochastic Compositional Optimization?

  • Jun Chen
  • Hong Chen
  • Bin Gu

Stochastic compositional optimization (SCO) problem constitutes a class of optimization problems characterized by the objective function with a compositional form, including the tasks with known derivatives, such as AUC maximization, and the derivative-free tasks exemplified by black-box vertical federated learning (VFL). From the learning theory perspective, the learning guarantees of SCO algorithms with known derivatives have been studied in the literature. However, the potential impacts of the derivative-free setting on the learning guarantees of SCO remains unclear and merits further investigation. This paper aims to reveal the impacts by developing a theoretical analysis for two derivative-free algorithms, black-box SCGD and SCSC. Specifically, we first provide the sharper generalization upper bounds of convex SCGD and SCSC based on a new stability analysis framework more effective than prior work under some milder conditions, which is further developed to the non-convex case using the almost co-coercivity property of smooth function. Then, we derive the learning guarantees of three black-box variants of non-convex SCGD and SCSC with additional optimization analysis. Comparing these results, we theoretically uncover the impacts that a better gradient estimation brings a tighter learning guarantee and a larger proportion of unknown gradients may lead to a stronger dependence on the gradient estimation quality. Finally, our analysis is applied to two SCO algorithms, FOO-based vertical VFL and VFL-CZOFO, to build the first learning guarantees for VFL that align with the findings of SCGD and SCSC.

JBHI Journal 2024 Journal Article

Influence of Transcutaneous Vagus Nerve Stimulation on Motor Planning: A Resting-State and Task-State EEG Study

  • Long Chen
  • Jiatong He
  • Jiasheng Zhang
  • Zhongpeng Wang
  • Lei Zhang
  • Bin Gu
  • Xiuyun Liu
  • Dong Ming

Transcutaneous vagus nerve stimulation (tVNS) shows a potential regulatory role for motor planning. Still, existing research mainly focuses on behavioral studies, and the neural modulation mechanism needs to be clarified. Therefore, we designed a multi-condition (active or sham, pre or under, difficult or easy, left-hand or right-hand) motor planning experiment to explore the effect of online tVNS (i. e. , tVNS and tasks synchronized). Twenty-eight subjects were recruited and randomly assigned to active and sham groups. Both groups performed the same tasks in the experiment and separately collected task-state EEG and 5-min eye-open resting-state EEG. The results showed that the changes in event-related potential (ERP) and movement-related cortical potential (MRCP) amplitudes were more significant for the left-hand difficult task (LD) under active-tVNS. According to the power spectrum results, active-tVNS significantly modulated the activities of the contralateral motor cortex at beta and gamma bands in the resting state. The functional connectivity based on partial directed coherence (PDC) showed significant changes in the parietal lobe after active-tVNS. These findings suggest that tVNS is a promising way to improve motor planning ability.

AAAI Conference 2024 Conference Paper

Iterative Regularization with k-support Norm: An Important Complement to Sparse Recovery

  • William de Vazelhes
  • Bhaskar Mukhoty
  • Xiao-Tong Yuan
  • Bin Gu

Sparse recovery is ubiquitous in machine learning and signal processing. Due to the NP-hard nature of sparse recovery, existing methods are known to suffer either from restrictive (or even unknown) applicability conditions, or high computational cost. Recently, iterative regularization methods have emerged as a promising fast approach because they can achieve sparse recovery in one pass through early stopping, rather than the tedious grid-search used in the traditional methods. However, most of those iterative methods are based on the l1 norm which requires restrictive applicability conditions and could fail in many cases. Therefore, achieving sparse recovery with iterative regularization methods under a wider range of conditions has yet to be further explored. To address this issue, we propose a novel iterative regularization algorithm, IRKSN, based on the k-support norm regularizer rather than the l1 norm. We provide conditions for sparse recovery with IRKSN, and compare them with traditional conditions for recovery with l1 norm regularizers. Additionally, we give an early stopping bound on the model error of IRKSN with explicit constants, achieving the standard linear rate for sparse recovery. Finally, we illustrate the applicability of our algorithm on several experiments, including a support recovery experiment with a correlated design matrix.

AAAI Conference 2024 Conference Paper

Limited Memory Online Gradient Descent for Kernelized Pairwise Learning with Dynamic Averaging

  • Hilal AlQuabeh
  • William de Vazelhes
  • Bin Gu

Pairwise learning, an important domain within machine learning, addresses loss functions defined on pairs of training examples, including those in metric learning and AUC maximization. Acknowledging the quadratic growth in computation complexity accompanying pairwise loss as the sample size grows, researchers have turned to online gradient descent (OGD) methods for enhanced scalability. Recently, an OGD algorithm emerged, employing gradient computation involving prior and most recent examples, a step that effectively reduces algorithmic complexity to O(T), with T being the number of received examples. This approach, however, confines itself to linear models while assuming the independence of example arrivals. We introduce a lightweight OGD algorithm that does not require the independence of examples and generalizes to kernel pairwise learning. Our algorithm builds the gradient based on a random example and a moving average representing the past data, which results in a sub-linear regret bound with a complexity of O(T). Furthermore, through the integration of O(√T logT) random Fourier features, the complexity of kernel calculations is effectively minimized. Several experiments with real-world datasets show that the proposed technique outperforms kernel and linear algorithms in offline and online scenarios.

JMLR Journal 2024 Journal Article

On the Intrinsic Structures of Spiking Neural Networks

  • Shao-Qun Zhang
  • Jia-Yi Chen
  • Jin-Hui Wu
  • Gao Zhang
  • Huan Xiong
  • Bin Gu
  • Zhi-Hua Zhou

Recent years have emerged a surge of interest in spiking neural networks (SNNs). The performance of SNNs hinges not only on searching apposite architectures and connection weights, similar to conventional artificial neural networks, but also on the meticulous configuration of their intrinsic structures. However, there has been a dearth of comprehensive studies examining the impact of intrinsic structures; thus developers often feel challenging to apply a standardized configuration of SNNs across diverse datasets or tasks. This work delves deep into the intrinsic structures of SNNs. Initially, we draw two key conclusions: (1) the membrane time hyper-parameter is intimately linked to the eigenvalues of the integration operation, dictating the functional topology of spiking dynamics; (2) various hyper-parameters of the firing-reset mechanism govern the overall firing capacity of an SNN, mitigating the injection ratio or sampling density of input data. These findings elucidate why the efficacy of SNNs hinges heavily on the configuration of intrinsic structures and lead to a recommendation that enhancing the adaptability of these structures contributes to improving the overall performance and applicability of SNNs. Inspired by this recognition, we propose two feasible approaches to enhance SNN learning, involving developing self-connection architectures and stochastic spiking neurons to augment the adaptability of the integration operation and firing-reset mechanism, respectively. We theoretically prove that (1) both methods promote the expressive property for universal approximation, (2) the incorporation of self-connection architectures fosters ample solutions and structural stability for SNNs approximating adaptive dynamical systems, (3) the stochastic spiking neurons maintain generalization bounds with an exponential reduction in Rademacher complexity. Empirical experiments conducted on various real-world datasets affirm the effectiveness of our proposed methods. [abs] [ pdf ][ bib ] &copy JMLR 2024. ( edit, beta )

NeurIPS Conference 2023 Conference Paper

A Unified Solution for Privacy and Communication Efficiency in Vertical Federated Learning

  • Ganyu Wang
  • Bin Gu
  • Qingsong Zhang
  • Xiang Li
  • Boyu Wang
  • Charles X. Ling

Vertical Federated Learning (VFL) is a collaborative machine learning paradigm that enables multiple participants to jointly train a model on their private data without sharing it. To make VFL practical, privacy security and communication efficiency should both be satisfied. Recent research has shown that Zero-Order Optimization (ZOO) in VFL can effectively conceal the internal information of the model without adding costly privacy protective add-ons, making it a promising approach for privacy and efficiency. However, there are still two key problems that have yet to be resolved. First, the convergence rate of ZOO-based VFL is significantly slower compared to gradient-based VFL, resulting in low efficiency in model training and more communication round, which hinders its application on large neural networks. Second, although ZOO-based VFL has demonstrated resistance to state-of-the-art (SOTA) attacks, its privacy guarantee lacks a theoretical explanation. To address these challenges, we propose a novel cascaded hybrid optimization approach that employs a zeroth-order (ZO) gradient on the most critical output layer of the clients, with other parts utilizing the first-order (FO) gradient. This approach preserves the privacy protection of ZOO while significantly enhancing convergence. Moreover, we theoretically prove that applying ZOO to the VFL is equivalent to adding Gaussian Mechanism to the gradient information, which offers an implicit differential privacy guarantee. Experimental results demonstrate that our proposed framework achieves similar utility as the Gaussian mechanism under the same privacy budget, while also having significantly lower communication costs compared with SOTA communication-efficient VFL frameworks.

NeurIPS Conference 2023 Conference Paper

Accelerated On-Device Forward Neural Network Training with Module-Wise Descending Asynchronism

  • Xiaohan Zhao
  • Hualin Zhang
  • Zhouyuan Huo
  • Bin Gu

On-device learning faces memory constraints when optimizing or fine-tuning on edge devices with limited resources. Current techniques for training deep models on edge devices rely heavily on backpropagation. However, its high memory usage calls for a reassessment of its dominance. In this paper, we propose forward gradient descent (FGD) as a potential solution to overcome the memory capacity limitation in on-device learning. However, FGD's dependencies across layers hinder parallel computation and can lead to inefficient resource utilization. To mitigate this limitation, we propose AsyncFGD, an asynchronous framework that decouples dependencies, utilizes module-wise stale parameters, and maximizes parallel computation. We demonstrate its convergence to critical points through rigorous theoretical analysis. Empirical evaluations conducted on NVIDIA's AGX Orin, a popular embedded device, show that AsyncFGD reduces memory consumption and enhances hardware efficiency, offering a novel approach to on-device learning.

AAAI Conference 2023 Conference Paper

Denoising Multi-Similarity Formulation: A Self-Paced Curriculum-Driven Approach for Robust Metric Learning

  • Chenkang Zhang
  • Lei Luo
  • Bin Gu

Deep Metric Learning (DML) is a group of techniques that aim to measure the similarity between objects through the neural network. Although the number of DML methods has rapidly increased in recent years, most previous studies cannot effectively handle noisy data, which commonly exists in practical applications and often leads to serious performance deterioration. To overcome this limitation, in this paper, we build a connection between noisy samples and hard samples in the framework of self-paced learning, and propose a Balanced Self-Paced Metric Learning (BSPML) algorithm with a denoising multi-similarity formulation, where noisy samples are treated as extremely hard samples and adaptively excluded from the model training by sample weighting. Especially, due to the pairwise relationship and a new balance regularization term, the sub-problem w.r.t. sample weights is a nonconvex quadratic function. To efficiently solve this nonconvex quadratic problem, we propose a doubly stochastic projection coordinate gradient algorithm. Importantly, we theoretically prove the convergence not only for the doubly stochastic projection coordinate gradient algorithm, but also for our BSPML algorithm. Experimental results on several standard data sets demonstrate that our BSPML algorithm has better generalization ability and robustness than the state-of-the-art robust DML approaches.

NeurIPS Conference 2023 Conference Paper

Direct Training of SNN using Local Zeroth Order Method

  • Bhaskar Mukhoty
  • Velibor Bojkovic
  • William de Vazelhes
  • Xiaohan Zhao
  • Giulia De Masi
  • Huan Xiong
  • Bin Gu

Spiking neural networks are becoming increasingly popular for their low energy requirement in real-world tasks with accuracy comparable to traditional ANNs. SNN training algorithms face the loss of gradient information and non-differentiability due to the Heaviside function in minimizing the model loss over model parameters. To circumvent this problem, the surrogate method employs a differentiable approximation of the Heaviside function in the backward pass, while the forward pass continues to use the Heaviside as the spiking function. We propose to use the zeroth-order technique at the local or neuron level in training SNNs, motivated by its regularizing and potential energy-efficient effects and establish a theoretical connection between it and the existing surrogate methods. We perform experimental validation of the technique on standard static datasets (CIFAR-10, CIFAR-100, ImageNet-100) and neuromorphic datasets (DVS-CIFAR-10, DVS-Gesture, N-Caltech-101, NCARS) and obtain results that offer improvement over the state-of-the-art results. The proposed method also lends itself to efficient implementations of the back-propagation method, which could provide 3-4 times overall speedup in training time. The code is available at \url{https: //github. com/BhaskarMukhoty/LocalZO}.

AAAI Conference 2023 Conference Paper

Faster Fair Machine via Transferring Fairness Constraints to Virtual Samples

  • Zhou Zhai
  • Lei Luo
  • Heng Huang
  • Bin Gu

Fair classification is an emerging and important research topic in machine learning community. Existing methods usually formulate the fairness metrics as additional inequality constraints, and then embed them into the original objective. This makes fair classification problems unable to be effectively tackled by some solvers specific to unconstrained optimization. Although many new tailored algorithms have been designed to attempt to overcome this limitation, they often increase additional computation burden and cannot cope with all types of fairness metrics. To address these challenging issues, in this paper, we propose a novel method for fair classification. Specifically, we theoretically demonstrate that all types of fairness with linear and non-linear covariance functions can be transferred to two virtual samples, which makes the existing state-of-the-art classification solvers be applicable to these cases. Meanwhile, we generalize the proposed method to multiple fairness constraints. We take SVM as an example to show the effectiveness of our new idea. Empirically, we test the proposed method on real-world datasets and all results confirm its excellent performance.

NeurIPS Conference 2023 Conference Paper

Fine-Grained Theoretical Analysis of Federated Zeroth-Order Optimization

  • Jun Chen
  • Hong Chen
  • Bin Gu
  • Hao Deng

Federated zeroth-order optimization (FedZO) algorithm enjoys the advantages of both zeroth-order optimization and federated learning, and has shown exceptional performance on black-box attack and softmax regression tasks. However, there is no generalization analysis for FedZO, and its analysis on computing convergence rate is slower than the corresponding first-order optimization setting. This paper aims to establish systematic theoretical assessments of FedZO by developing the analysis technique of on-average model stability. We establish the first generalization error bound of FedZO under the Lipschitz continuity and smoothness conditions. Then, refined generalization and optimization bounds are provided by replacing bounded gradient with heavy-tailed gradient noise and utilizing the second-order Taylor expansion for gradient approximation. With the help of a new error decomposition strategy, our theoretical analysis is also extended to the asynchronous case. For FedZO, our fine-grained analysis fills the theoretical gap on the generalization guarantees and polishes the convergence characterization of the computing algorithm.

AAAI Conference 2023 Conference Paper

On the Stability and Generalization of Triplet Learning

  • Jun Chen
  • Hong Chen
  • Xue Jiang
  • Bin Gu
  • Weifu Li
  • Tieliang Gong
  • Feng Zheng

Triplet learning, i.e. learning from triplet data, has attracted much attention in computer vision tasks with an extremely large number of categories, e.g., face recognition and person re-identification. Albeit with rapid progress in designing and applying triplet learning algorithms, there is a lacking study on the theoretical understanding of their generalization performance. To fill this gap, this paper investigates the generalization guarantees of triplet learning by leveraging the stability analysis. Specifically, we establish the first general high-probability generalization bound for the triplet learning algorithm satisfying the uniform stability, and then obtain the excess risk bounds of the order O(log(n)/(√n) ) for both stochastic gradient descent (SGD) and regularized risk minimization (RRM), where 2n is approximately equal to the number of training samples. Moreover, an optimistic generalization bound in expectation as fast as O(1/n) is derived for RRM in a low noise case via the on-average stability analysis. Finally, our results are applied to triplet metric learning to characterize its theoretical underpinning.

AAAI Conference 2023 Conference Paper

Stability-Based Generalization Analysis for Mixtures of Pointwise and Pairwise Learning

  • Jiahuan Wang
  • Jun Chen
  • Hong Chen
  • Bin Gu
  • Weifu Li
  • Xin Tang

Recently, some mixture algorithms of pointwise and pairwise learning (PPL) have been formulated by employing the hybrid error metric of “pointwise loss + pairwise loss” and have shown empirical effectiveness on feature selection, ranking and recommendation tasks. However, to the best of our knowledge, the learning theory foundation of PPL has not been touched in the existing works. In this paper, we try to fill this theoretical gap by investigating the generalization properties of PPL. After extending the definitions of algorithmic stability to the PPL setting, we establish the high-probability generalization bounds for uniformly stable PPL algorithms. Moreover, explicit convergence rates of stochastic gradient descent (SGD) and regularized risk minimization (RRM) for PPL are stated by developing the stability analysis technique of pairwise learning. In addition, the refined generalization bounds of PPL are obtained by replacing uniform stability with on-average stability.

AAAI Conference 2023 Conference Paper

When Online Learning Meets ODE: Learning without Forgetting on Variable Feature Space

  • Diyang Li
  • Bin Gu

Machine learning systems that built upon varying feature space are ubiquitous across the world. When the set of practical or virtual features changes, the online learning approach can adjust the learned model accordingly rather than re-training from scratch and has been an attractive area of research. Despite its importance, most studies for algorithms that are capable of handling online features have no ensurance of stationarity point convergence, while the accuracy guaranteed methods are still limited to some simple cases such as L_1 or L_2 norms with square loss. To address this challenging problem, we develop an efficient Dynamic Feature Learning System (DFLS) to perform online learning on the unfixed feature set for more general statistical models and demonstrate how DFLS opens up many new applications. We are the first to achieve accurate & reliable feature-wise online learning for a broad class of models like logistic regression, spline interpolation, group Lasso and Poisson regression. By utilizing DFLS, the updated model is theoretically the same as the model trained from scratch using the entire new feature space. Specifically, we reparameterize the feature-varying procedure and devise the corresponding ordinary differential equation (ODE) system to compute the optimal solutions of the new model status. Simulation studies reveal that the proposed DFLS can substantially ease the computational cost without forgetting.

AAAI Conference 2022 Conference Paper

A Fully Single Loop Algorithm for Bilevel Optimization without Hessian Inverse

  • Junyi Li
  • Bin Gu
  • Heng Huang

In this paper, we propose a novel Hessian inverse free Fully Single Loop Algorithm (FSLA) for bilevel optimization problems. Classic algorithms for bilevel optimization admit a double loop structure which is computationally expensive. Recently, several single loop algorithms have been proposed with optimizing the inner and outer variable alternatively. However, these algorithms not yet achieve fully single loop. As they overlook the loop needed to evaluate the hyper-gradient for a given inner and outer state. In order to develop a fully single loop algorithm, we first study the structure of the hypergradient and identify a general approximation formulation of hyper-gradient computation that encompasses several previous common approaches, e. g. back-propagation through time, conjugate gradient, etc. Based on this formulation, we introduce a new state variable to maintain the historical hyper-gradient information. Combining our new formulation with the alternative update of the inner and outer variables, we propose an efficient fully single loop algorithm. We theoretically show that the error generated by the new state can be bounded and our algorithm converges with the rate of O( −2 ). Finally, we verify the efficacy our algorithm empirically through multiple bilevel optimization based machine learning tasks. A long version of this paper can be found in: https: //arxiv. org/abs/2112. 04660.

AAAI Conference 2022 Conference Paper

Balanced Self-Paced Learning for AUC Maximization

  • Bin Gu
  • Chenkang Zhang
  • Huan Xiong
  • Heng Huang

Learning to improve AUC performance is an important topic in machine learning. However, AUC maximization algorithms may decrease generalization performance due to the noisy data. Self-paced learning is an effective method for handling noisy data. However, existing self-paced learning methods are limited to pointwise learning, while AUC maximization is a pairwise learning problem. To solve this challenging problem, we innovatively propose a balanced selfpaced AUC maximization algorithm (BSPAUC). Specifically, we first provide a statistical objective for self-paced AUC. Based on this, we propose our self-paced AUC maximization formulation, where a novel balanced self-paced regularization term is embedded to ensure that the selected positive and negative samples have proper proportions. Specially, the sub-problem with respect to all weight variables may be nonconvex in our formulation, while the one is normally convex in existing self-paced problems. To address this, we propose a doubly cyclic block coordinate descent method. More importantly, we prove that the sub-problem with respect to all weight variables converges to a stationary point on the basis of closed-form solutions, and our BSPAUC converges to a stationary point of our fixed optimization objective under a mild assumption. Considering both the deep learning and kernel-based implementations, experimental results on several large-scale datasets demonstrate that our BSPAUC has a better generalization performance than existing state-of-theart AUC maximization methods.

AAAI Conference 2022 Conference Paper

Chunk Dynamic Updating for Group Lasso with ODEs

  • Diyang Li
  • Bin Gu

Group Lasso is an important sparse regression method in machine learning which encourages selecting key explanatory factors in a grouped manner because of the use of `2, 1-norm. In real-world learning tasks, some chunks of data would be added into or removed from the training set in sequence due to the existence of new or obsolete historical data, which is normally called dynamic or lifelong learning scenario. However, most of existing algorithms of group Lasso are limited to offline updating, and only one is online algorithm which can only handle newly added samples inexactly. Due to the complexity of `2, 1-norm, how to achieve accurate chunk incremental and decremental learning efficiently for group Lasso is still an open question. To address this challenging problem, in this paper, we propose a novel accurate dynamic updating algorithm for group Lasso by utilizing the technique of Ordinary Differential Equations (ODEs), which can incorporate or eliminate a chunk of samples from original training set without retraining the model from scratch. Specifically, we introduce a new formulation to reparameterize the adjustment procedures of chunk incremental and decremental learning simultaneously. Based on the new formulation, we propose a path following algorithm for group Lasso regarding to the adjustment parameter. Importantly, we prove that our path following algorithm can exactly track the piecewise smooth solutions thanks to the technique of ODEs, so that the accurate chunk incremental and decremental learning can be achieved. Extensive experimental results not only confirm the effectiveness of proposed algorithm for the chunk incremental and decremental learning, but also validate its efficiency compared to the existing offline and online algorithms.

NeurIPS Conference 2022 Conference Paper

GAGA: Deciphering Age-path of Generalized Self-paced Regularizer

  • Xingyu Qu
  • Diyang Li
  • Xiaohan Zhao
  • Bin Gu

Nowadays self-paced learning (SPL) is an important machine learning paradigm that mimics the cognitive process of humans and animals. The SPL regime involves a self-paced regularizer and a gradually increasing age parameter, which plays a key role in SPL but where to optimally terminate this process is still non-trivial to determine. A natural idea is to compute the solution path w. r. t. age parameter (i. e. , age-path). However, current age-path algorithms are either limited to the simplest regularizer, or lack solid theoretical understanding as well as computational efficiency. To address this challenge, we propose a novel Generalized Age-path Algorithm (GAGA) for SPL with various self-paced regularizers based on ordinary differential equations (ODEs) and sets control, which can learn the entire solution spectrum w. r. t. a range of age parameters. To the best of our knowledge, GAGA is the first exact path-following algorithm tackling the age-path for general self-paced regularizer. Finally the algorithmic steps of classic SVM and Lasso are described in detail. We demonstrate the performance of GAGA on real-world datasets, and find considerable speedup between our algorithm and competing baselines.

NeurIPS Conference 2022 Conference Paper

Zeroth-Order Hard-Thresholding: Gradient Error vs. Expansivity

  • William de Vazelhes
  • Hualin Zhang
  • Huimin Wu
  • Xiaotong Yuan
  • Bin Gu

$\ell_0$ constrained optimization is prevalent in machine learning, particularly for high-dimensional problems, because it is a fundamental approach to achieve sparse learning. Hard-thresholding gradient descent is a dominant technique to solve this problem. However, first-order gradients of the objective function may be either unavailable or expensive to calculate in a lot of real-world problems, where zeroth-order (ZO) gradients could be a good surrogate. Unfortunately, whether ZO gradients can work with the hard-thresholding operator is still an unsolved problem. To solve this puzzle, in this paper, we focus on the $\ell_0$ constrained black-box stochastic optimization problems, and propose a new stochastic zeroth-order gradient hard-thresholding (SZOHT) algorithm with a general ZO gradient estimator powered by a novel random support sampling. We provide the convergence analysis of SZOHT under standard assumptions. Importantly, we reveal a conflict between the deviation of ZO estimators and the expansivity of the hard-thresholding operator, and provide a theoretical minimal value of the number of random directions in ZO gradients. In addition, we find that the query complexity of SZOHT is independent or weakly dependent on the dimensionality under different settings. Finally, we illustrate the utility of our method on a portfolio optimization problem as well as black-box adversarial attacks.

NeurIPS Conference 2022 Conference Paper

Zeroth-Order Negative Curvature Finding: Escaping Saddle Points without Gradients

  • Hualin Zhang
  • Huan Xiong
  • Bin Gu

We consider escaping saddle points of nonconvex problems where only the function evaluations can be accessed. Although a variety of works have been proposed, the majority of them require either second or first-order information, and only a few of them have exploited zeroth-order methods, particularly the technique of negative curvature finding with zeroth-order methods which has been proven to be the most efficient method for escaping saddle points. To fill this gap, in this paper, we propose two zeroth-order negative curvature finding frameworks that can replace Hessian-vector product computations without increasing the iteration complexity. We apply the proposed frameworks to ZO-GD, ZO-SGD, ZO-SCSG, ZO-SPIDER and prove that these ZO algorithms can converge to $(\epsilon, \delta)$-approximate second-order stationary points with less query complexity compared with prior zeroth-order works for finding local minima.

JMLR Journal 2021 Journal Article

Black-Box Reductions for Zeroth-Order Gradient Algorithms to Achieve Lower Query Complexity

  • Bin Gu
  • Xiyuan Wei
  • Shangqian Gao
  • Ziran Xiong
  • Cheng Deng
  • Heng Huang

Zeroth-order (ZO) optimization has been the key technique for various machine learning applications especially for black-box adversarial attack, where models need to be learned in a gradient-free manner. Although many ZO algorithms have been proposed, the high function query complexities hinder their applications seriously. To address this challenging problem, we propose two stagewise black-box reduction frameworks for ZO algorithms under convex and non-convex settings respectively, which lower down the function query complexities of ZO algorithms. Moreover, our frameworks can directly derive the convergence results of ZO algorithms under convex and non-convex settings without extra analyses, as long as convergence results under strongly convex setting are given. To illustrate the advantages, we further study ZO-SVRG, ZO-SAGA and ZO-Varag under strongly-convex setting and use our frameworks to directly derive the convergence results under convex and non-convex settings. The function query complexities of these algorithms derived by our frameworks are lower than that of their vanilla counterparts without frameworks, or even lower than that of state-of-the-art algorithms. Finally we conduct numerical experiments to illustrate the superiority of our frameworks. [abs] [ pdf ][ bib ] &copy JMLR 2021. ( edit, beta )

AAAI Conference 2021 Conference Paper

Fast and Scalable Adversarial Training of Kernel SVM via Doubly Stochastic Gradients

  • Huimin Wu
  • Zhengmian Hu
  • Bin Gu

Adversarial attacks by generating examples which are almost indistinguishable from natural examples, pose a serious threat to learning models. Defending against adversarial attacks is a critical element for a reliable learning system. Support vector machine (SVM) is a classical yet still important learning algorithm even in the current deep learning era. Although a wide range of researches have been done in recent years to improve the adversarial robustness of learning models, but most of them are limited to deep neural networks (DNNs) and the work for kernel SVM is still vacant. In this paper, we aim at kernel SVM and propose adv-SVM to improve its adversarial robustness via adversarial training, which has been demonstrated to be the most promising defense techniques. To the best of our knowledge, this is the first work that devotes to the fast and scalable adversarial training of kernel SVM. Specifically, we first build connection of perturbations of samples between original and kernel spaces, and then give a reduced and equivalent formulation of adversarial training of kernel SVM based on the connection. Next, doubly stochastic gradients (DSG) based on two unbiased stochastic approximations (i. e. , one is on training points and another is on random features) are applied to update the solution of our objective function. Finally, we prove that our algorithm optimized by DSG converges to the optimal solution at the rate of O(1/t) under the constant and diminishing stepsizes. Comprehensive experimental results show that our adversarial training algorithm enjoys robustness against various attacks and meanwhile has the similar efficiency and scalability with classical DSG algorithm.

AAAI Conference 2021 Conference Paper

Improved Penalty Method via Doubly Stochastic Gradients for Bilevel Hyperparameter Optimization

  • Wanli Shi
  • Bin Gu

Hyperparameter optimization (HO) is an important problem in machine learning which is normally formulated as a bilevel optimization problem. Gradient-based methods are dominant in bilevel optimization due to their high scalability to the number of hyperparameters, especially in a deep learning problem. However, traditional gradient-based bilevel optimization methods need intermediate steps to obtain the exact or approximate gradient of hyperparameters, namely hypergradient, for the upper-level objective, whose complexity is high especially for high dimensional datasets. Recently, a penalty method has been proposed to avoid the computation of the hypergradient, which speeds up the gradient-based BHO methods. However, the penalty method may result in a very large number of constraints, which greatly limits the efficiency of this method, especially for high dimensional data problems. To address this limitation, in this paper, we propose a doubly stochastic gradient descent algorithm (DSGPHO) to improve the efficiency of the penalty method. Importantly, we not only prove the proposed method can converge to the KKT condition of the original problem in a convex setting, but also provide the convergence rate of DSGPHO which is the first result in the references of gradient-based bilevel optimization as far as we know. We compare our method with three state-of-the-art gradient-based methods in three tasks, i. e. , data denoising, few-shot learning, and training data poisoning, using several large-scale benchmark datasets. All the results demonstrate that our method outperforms or is comparable to the existing methods in terms of accuracy and efficiency.

AAAI Conference 2021 Conference Paper

Large Batch Optimization for Deep Learning Using New Complete Layer-Wise Adaptive Rate Scaling

  • Zhouyuan Huo
  • Bin Gu
  • Heng Huang

Training deep neural networks using a large batch size has shown promising results and benefits many real-world applications. Warmup is one of nontrivial techniques to stabilize the convergence of large batch training. However, warmup is an empirical method and it is still unknown whether there is a better algorithm with theoretical underpinnings. In this paper, we propose a novel Complete Layer-wise Adaptive Rate Scaling (CLARS) algorithm for large-batch training. We prove the convergence of our algorithm by introducing a new fine-grained analysis of gradient-based methods. Furthermore, the new analysis also helps to understand two other empirical tricks, layer-wise adaptive rate scaling and linear learning rate scaling. We conduct extensive experiments and demonstrate that the proposed algorithm outperforms gradual warmup technique by a large margin and defeats the convergence of the state-of-the-art large-batch optimizer in training advanced deep neural networks (ResNet, DenseNet, MobileNet) on ImageNet dataset.

AAAI Conference 2021 Conference Paper

Secure Bilevel Asynchronous Vertical Federated Learning with Backward Updating

  • Qingsong Zhang
  • Bin Gu
  • Cheng Deng
  • Heng Huang

Vertical federated learning (VFL) attracts increasing attention due to the emerging demands of multi-party collaborative modeling and concerns of privacy leakage. In the real VFL applications, usually only one or partial parties hold labels, which makes it challenging for all parties to collaboratively learn the model without privacy leakage. Meanwhile, most existing VFL algorithms are trapped in the synchronous computations, which leads to inefficiency in their real-world applications. To address these challenging problems, we propose a novel VFL framework integrated with new backward updating mechanism and bilevel asynchronous parallel architecture (VFB2 ), under which three new algorithms, including VFB2 -SGD, - SVRG, and -SAGA, are proposed. We derive the theoretical results of the convergence rates of these three algorithms under both strongly convex and nonconvex conditions. We also prove the security of VFB2 under semi-honest threat models. Extensive experiments on benchmark datasets demonstrate that our algorithms are efficient, scalable and lossless.

JMLR Journal 2020 Journal Article

A Unified q-Memorization Framework for Asynchronous Stochastic Optimization

  • Bin Gu
  • Wenhan Xian
  • Zhouyuan Huo
  • Cheng Deng
  • Heng Huang

Asynchronous stochastic algorithms with various variance reduction techniques (such as SVRG, S2GD, SAGA and q-SAGA) are popular in solving large scale learning problems. Recently, Reddi et al. (2015) proposed an unified variance reduction framework (i.e., HSAG) to analyze the asynchronous stochastic gradient optimization. However, the HSAG framework cannot incorporate the S2GD technique, the analysis of the HSAG framework is limited to the SVRG and SAGA techniques on the smooth convex optimization. They did not analyze other important various variance techniques (e.g., S2GD and q-SAGA) and other important optimization problems (e.g., convex optimization with non-smooth regularization and non-convex optimization with cardinality constraint). In this paper, we bridge this gap by using an unified q-memorization framework for various variance reduction techniques (including SVRG, S2GD, SAGA, q-SAGA) to analyze asynchronous stochastic algorithms for three important optimization problems. Specifically, based on the q-memorization framework, 1) we propose an asynchronous stochastic gradient hard thresholding algorithm with q-memorization (AsySGHT-qM) for the non-convex optimization with cardinality constraint, and prove that the convergence rate of AsySGHT-qM before reaching the inherent error induced by gradient hard thresholding methods is geometric. 2) We propose an asynchronous stochastic proximal gradient algorithm (AsySPG-qM) for the convex optimization with non-smooth regularization, and prove that AsySPG-qM can achieve a linear convergence rate. 3) We propose an asynchronous stochastic gradient descent algorithm (AsySGD-qM) for the general non-convex optimization problem, and prove that AsySGD-qM can achieve a sublinear convergence rate to stationary points. The experimental results on various large-scale datasets confirm the fast convergence of our AsySGHT-qM, AsySPG-qM and AsySGD-qM through concrete realizations of SVRG and SAGA. [abs] [ pdf ][ bib ] &copy JMLR 2020. ( edit, beta )

AAAI Conference 2020 Conference Paper

Quadruply Stochastic Gradient Method for Large Scale Nonlinear Semi-Supervised Ordinal Regression AUC Optimization

  • Wanli Shi
  • Bin Gu
  • Xiang Li
  • Heng Huang

Semi-supervised ordinal regression (S2 OR) problems are ubiquitous in real-world applications, where only a few ordered instances are labeled and massive instances remain unlabeled. Recent researches have shown that directly optimizing concordance index or AUC can impose a better ranking on the data than optimizing the traditional error rate in ordinal regression (OR) problems. In this paper, we propose an unbiased objective function for S2 OR AUC optimization based on ordinal binary decomposition approach. Besides, to handle the large-scale kernelized learning problems, we propose a scalable algorithm called QS3 ORAO using the doubly stochastic gradients (DSG) framework for functional optimization. Theoretically, we prove that our method can converge to the optimal solution at the rate of O(1/t), where t is the number of iterations for stochastic data sampling. Extensive experimental results on various benchmark and realworld datasets also demonstrate that our method is efficient and effective while retaining similar generalization performance.

AAAI Conference 2020 Conference Paper

Safe Sample Screening for Robust Support Vector Machine

  • Zhou Zhai
  • Bin Gu
  • Xiang Li
  • Heng Huang

Robust support vector machine (RSVM) has been shown to perform remarkably well to improve the generalization performance of support vector machine under the noisy environment. Unfortunately, in order to handle the non-convexity induced by ramp loss in RSVM, existing RSVM solvers often adopt the DC programming framework which is computationally inefficient for running multiple outer loops. This hinders the application of RSVM to large-scale problems. Safe sample screening that allows for the exclusion of training samples prior to or early in the training process is an effective method to greatly reduce computational time. However, existing safe sample screening algorithms are limited to convex optimization problems while RSVM is a non-convex problem. To address this challenge, in this paper, we propose two safe sample screening rules for RSVM based on the framework of concave-convex procedure (CCCP). Specifically, we provide screening rule for the inner solver of CCCP and another rule for propagating screened samples between two successive solvers of CCCP. To the best of our knowledge, this is the first work of safe sample screening to a non-convex optimization problem. More importantly, we provide the security guarantee to our sample screening rules to RSVM. Experimental results on a variety of benchmark datasets verify that our safe sample screening rules can significantly reduce the computational time.

IJCAI Conference 2019 Conference Paper

Asynchronous Stochastic Frank-Wolfe Algorithms for Non-Convex Optimization

  • Bin Gu
  • Wenhan Xian
  • Heng Huang

Asynchronous parallel stochastic optimization for non-convex problems becomes more and more important in machine learning especially due to the popularity of deep learning. The Frank-Wolfe (a. k. a. conditional gradient) algorithms has regained much interest because of its projection-free property and the ability of handling structured constraints. However, our understanding of asynchronous stochastic Frank-Wolfe algorithms is extremely limited especially in the non-convex setting. To address this challenging problem, in this paper, we propose our asynchronous stochastic Frank-Wolfe algorithm (AsySFW) and its variance reduction version (AsySVFW) for solving the constrained non-convex optimization problems. More importantly, we prove the fast convergence rates of AsySFW and AsySVFW in the non-convex setting. To the best of our knowledge, AsySFW and AsySVFW are the first asynchronous parallel stochastic algorithms with convergence guarantees for solving the constrained non-convex optimization problems. The experimental results on real high-dimensional gray-scale images not only confirm the fast convergence of our algorithms, but also show a near-linear speedup on a parallel system with shared memory due to the lock-free implementation.

AAAI Conference 2019 Conference Paper

Faster Gradient-Free Proximal Stochastic Methods for Nonconvex Nonsmooth Optimization

  • Feihu Huang
  • Bin Gu
  • Zhouyuan Huo
  • Songcan Chen
  • Heng Huang

Proximal gradient method has been playing an important role to solve many machine learning tasks, especially for the nonsmooth problems. However, in some machine learning problems such as the bandit model and the black-box learning problem, proximal gradient method could fail because the explicit gradients of these problems are difficult or infeasible to obtain. The gradient-free (zeroth-order) method can address these problems because only the objective function values are required in the optimization. Recently, the first zeroth-order proximal stochastic algorithm was proposed to solve the nonconvex nonsmooth problems. However, its convergence rate is O( 1 √ T ) for the nonconvex problems, which is significantly slower than the best convergence rate O( 1 T ) of the zerothorder stochastic algorithm, where T is the iteration number. To fill this gap, in the paper, we propose a class of faster zeroth-order proximal stochastic methods with the variance reduction techniques of SVRG and SAGA, which are denoted as ZO-ProxSVRG and ZO-ProxSAGA, respectively. In theoretical analysis, we address the main challenge that an unbiased estimate of the true gradient does not hold in the zerothorder case, which was required in previous theoretical analysis of both SVRG and SAGA. Moreover, we prove that both ZO-ProxSVRG and ZO-ProxSAGA algorithms have O( 1 T ) convergence rates. Finally, the experimental results verify that our algorithms have a faster convergence rate than the existing zeroth-order proximal stochastic algorithm.

IJCAI Conference 2019 Conference Paper

Quadruply Stochastic Gradients for Large Scale Nonlinear Semi-Supervised AUC Optimization

  • Wanli Shi
  • Bin Gu
  • Xiang Li
  • Xiang Geng
  • Heng Huang

Semi-supervised learning is pervasive in real-world applications, where only a few labeled data are available and large amounts of instances remain unlabeled. Since AUC is an important model evaluation metric in classification, directly optimizing AUC in semi-supervised learning scenario has drawn much attention in the machine learning community. Recently, it has been shown that one could find an unbiased solution for the semi-supervised AUC maximization problem without knowing the class prior distribution. However, this method is hardly scalable for nonlinear classification problems with kernels. To address this problem, in this paper, we propose a novel scalable quadruply stochastic gradient algorithm (QSG-S2AUC) for nonlinear semi-supervised AUC optimization. In each iteration of the stochastic optimization process, our method randomly samples a positive instance, a negative instance, an unlabeled instance and their random features to compute the gradient and then update the model by using this quadruply stochastic gradient to approach the optimal solution. More importantly, we prove that QSG-S2AUC can converge to the optimal solution in O(1/t), where t is the iteration number. Extensive experimental results on a variety of benchmark datasets show that QSG-S2AUC is far more efficient than the existing state-of-the-art algorithms for semi-supervised AUC maximization, while retaining the similar generalization performance.

AAAI Conference 2019 Conference Paper

Scalable and Efficient Pairwise Learning to Achieve Statistical Accuracy

  • Bin Gu
  • Zhouyuan Huo
  • Heng Huang

Pairwise learning is an important learning topic in the machine learning community, where the loss function involves pairs of samples (e. g. , AUC maximization and metric learning). Existing pairwise learning algorithms do not perform well in the generality, scalability and efficiency simultaneously. To address these challenging problems, in this paper, we first analyze the relationship between the statistical accuracy and the regularized empire risk for pairwise loss. Based on the relationship, we propose a scalable and efficient adaptive doubly stochastic gradient algorithm (AdaDSG) for generalized regularized pairwise learning problems. More importantly, we prove that the overall computational cost of AdaDSG is O(n) to achieve the statistical accuracy on the full training set with the size of n, which is the best theoretical result for pairwise learning to the best of our knowledge. The experimental results on a variety of real-world datasets not only confirm the effectiveness of our AdaDSG algorithm, but also show that AdaDSG has significantly better scalability and efficiency than the existing pairwise learning algorithms.

IJCAI Conference 2019 Conference Paper

Scalable Semi-Supervised SVM via Triply Stochastic Gradients

  • Xiang Geng
  • Bin Gu
  • Xiang Li
  • Wanli Shi
  • Guansheng Zheng
  • Heng Huang

Semi-supervised learning (SSL) plays an increasingly important role in the big data era because a large number of unlabeled samples can be used effectively to improve the performance of the classifier. Semi-supervised support vector machine (S3VM) is one of the most appealing methods for SSL, but scaling up S3VM for kernel learning is still an open problem. Recently, a doubly stochastic gradient (DSG) algorithm has been proposed to achieve efficient and scalable training for kernel methods. However, the algorithm and theoretical analysis of DSG are developed based on the convexity assumption which makes them incompetent for non-convex problems such as S3VM. To address this problem, in this paper, we propose a triply stochastic gradient algorithm for S3VM, called TSGS3VM. Specifically, to handle two types of data instances involved in S3VM, TSGS3VM samples a labeled instance and an unlabeled instance as well with the random features in each iteration to compute a triply stochastic gradient. We use the approximated gradient to update the solution. More importantly, we establish new theoretic analysis for TSGS3VM which guarantees that TSGS3VM can converge to a stationary point. Extensive experimental results on a variety of datasets demonstrate that TSGS3VM is much more efficient and scalable than existing S3VM algorithms.

IJCAI Conference 2018 Conference Paper

Accelerated Asynchronous Greedy Coordinate Descent Algorithm for SVMs

  • Bin Gu
  • Yingying Shan
  • Xiang Geng
  • Guansheng Zheng

Support vector machines play an important role in machine learning in the last two decades. Traditional SVM solvers (e. g. LIBSVM) are not scalable in the current big data era. Recently, a state of the art solver was proposed based on the asynchronous greedy coordinate descent (AsyGCD) algorithm. However, AsyGCD is still not scalable enough, and is limited to binary classification. To address these issues, in this paper we propose an asynchronous accelerated greedy coordinate descent algorithm (AsyAGCD) for SVMs. Compared with AsyGCD, our AsyAGCD has the following two-fold advantages: 1) our AsyAGCD is an accelerated version of AsyGCD because active set strategy is used. Specifically, our AsyAGCD can converge much faster than AsyGCD for the second half of iterations. 2) Our AsyAGCD can handle more SVM formulations (including binary classification and regression SVMs) than AsyGCD. We provide the comparison of computational complexity of AsyGCD and our AsyAGCD. Experiment results on a variety of datasets and learning applications confirm that our AsyAGCD is much faster than the existing SVM solvers (including AsyGCD).

AAAI Conference 2018 Conference Paper

Accelerated Method for Stochastic Composition Optimization With Nonsmooth Regularization

  • Zhouyuan Huo
  • Bin Gu
  • Ji Liu
  • Heng Huang

Stochastic composition optimization draws much attention recently and has been successful in many emerging applications of machine learning, statistical analysis, and reinforcement learning. In this paper, we focus on the composition problem with nonsmooth regularization penalty. Previous works either have slow convergence rate, or do not provide complete convergence analysis for the general problem. In this paper, we tackle these two issues by proposing a new stochastic composition optimization method for composition problem with nonsmooth regularization penalty. In our method, we apply variance reduction technique to accelerate the speed of convergence. To the best of our knowledge, our method admits the fastest convergence rate for stochastic composition optimization: for strongly convex composition problem, our algorithm is proved to admit linear convergence; for general composition problem, our algorithm significantly improves the state-of-theart convergence rate from O(T−1/2 ) to O((n1 +n2)2/3 T−1 ). Finally, we apply our proposed algorithm to portfolio management and policy evaluation in reinforcement learning. Experimental results verify our theoretical analysis.

AAAI Conference 2018 Conference Paper

Asynchronous Doubly Stochastic Sparse Kernel Learning

  • Bin Gu
  • Miao Xin
  • Zhouyuan Huo
  • Heng Huang

Kernel methods have achieved tremendous success in the past two decades. In the current big data era, data collection has grown tremendously. However, existing kernel methods are not scalable enough both at the training and predicting steps. To address this challenge, in this paper, we first introduce a general sparse kernel learning formulation based on the random feature approximation, where the loss functions are possibly non-convex. Then we propose a new asynchronous parallel doubly stochastic algorithm for large scale sparse kernel learning (AsyDSSKL). To the best our knowledge, AsyDSSKL is the first algorithm with the techniques of asynchronous parallel computation and doubly stochastic optimization. We also provide a comprehensive convergence guarantee to AsyDSSKL. Importantly, the experimental results on various large-scale real-world datasets show that, our AsyDSSKL method has the significant superiority on the computational efficiency at the training and predicting steps over the existing kernel methods.

IJCAI Conference 2018 Conference Paper

Faster Training Algorithms for Structured Sparsity-Inducing Norm

  • Bin Gu
  • Xingwang Ju
  • Xiang Li
  • Guansheng Zheng

Structured-sparsity regularization is popular for sparse learning because of its flexibility of encoding the feature structures. This paper considers a generalized version of structured-sparsity regularization (especially for $l_1/l_{\infty}$ norm) with arbitrary group overlap. Due to the group overlap, it is time-consuming to solve the associated proximal operator. Although Mairal~\shortcite{mairal2010network} have proposed a network-flow algorithm to solve the proximal operator, it is still time-consuming especially in the high-dimensional setting. To address this challenge, in this paper, we have developed a more efficient solution for $l_1/l_{\infty}$ group lasso with arbitrary group overlap using an Inexact Proximal-Gradient method. In each iteration, our algorithm only requires to calculate an inexact solution to the proximal sub-problem, which can be done efficiently. On the theoretic side, the proposed algorithm enjoys the same global convergence rate as the exact proximal methods. Experiments demonstrate that our algorithm is much more efficient than network-flow algorithm, while retaining the similar generalization performance.

AAAI Conference 2018 Conference Paper

Inexact Proximal Gradient Methods for Non-Convex and Non-Smooth Optimization

  • Bin Gu
  • De Wang
  • Zhouyuan Huo
  • Heng Huang

In machine learning research, the proximal gradient methods are popular for solving various optimization problems with non-smooth regularization. Inexact proximal gradient methods are extremely important when exactly solving the proximal operator is time-consuming, or the proximal operator does not have an analytic solution. However, existing inexact proximal gradient methods only consider convex problems. The knowledge of inexact proximal gradient methods in the non-convex setting is very limited. To address this challenge, in this paper, we first propose three inexact proximal gradient algorithms, including the basic version and Nesterov’s accelerated version. After that, we provide the theoretical analysis to the basic and Nesterov’s accelerated versions. The theoretical results show that our inexact proximal gradient algorithms can have the same convergence rates as the ones of exact proximal gradient algorithms in the non-convex setting. Finally, we show the applications of our inexact proximal gradient algorithms on three representative non-convex learning problems. Empirical results confirm the superiority of our new inexact proximal gradient algorithms.

NeurIPS Conference 2018 Conference Paper

Training Neural Networks Using Features Replay

  • Zhouyuan Huo
  • Bin Gu
  • Heng Huang

Training a neural network using backpropagation algorithm requires passing error gradients sequentially through the network. The backward locking prevents us from updating network layers in parallel and fully leveraging the computing resources. Recently, there are several works trying to decouple and parallelize the backpropagation algorithm. However, all of them suffer from severe accuracy loss or memory explosion when the neural network is deep. To address these challenging issues, we propose a novel parallel-objective formulation for the objective function of the neural network. After that, we introduce features replay algorithm and prove that it is guaranteed to converge to critical points for the non-convex problem under certain conditions. Finally, we apply our method to training deep convolutional neural networks, and the experimental results show that the proposed method achieves {faster} convergence, {lower} memory consumption, and {better} generalization error than compared methods.

IJCAI Conference 2015 Conference Paper

Bi-Parameter Space Partition for Cost-Sensitive SVM

  • Bin Gu
  • Victor S. Sheng
  • Shuo Li

Model selection is an important problem of costsensitive SVM (CS-SVM). Although using solution path to find global optimal parameters is a powerful method for model selection, it is a challenge to extend the framework to solve two regularization parameters of CS-SVM simultaneously. To overcome this challenge, we make three main steps in this paper. (i) A critical-regions-based biparameter space partition algorithm is proposed to present all piecewise linearities of CS-SVM. (ii) An invariant-regions-based bi-parameter space partition algorithm is further proposed to compute empirical errors for all parameter pairs. (iii) The global optimal solutions for K-fold cross validation are computed by superposing K invariant region based bi-parameter space partitions into one. The three steps constitute the model selection of CS-SVM which can find global optimal parameter pairs in K-fold cross validation. Experimental results on seven normal datsets and four imbalanced datasets, show that our proposed method has better generalization ability and than various kinds of grid search methods, however, with less running time.

IJCAI Conference 2015 Conference Paper

Data Sparseness in Linear SVM

  • Xiang Li
  • Huaimin Wang
  • Bin Gu
  • Charles X. Ling

Large sparse datasets are common in many realworld applications. Linear SVM has been shown to be very efficient for classifying such datasets. However, it is still unknown how data sparseness would affect its convergence behavior. To study this problem in a systematic manner, we propose a novel approach to generate large and sparse data from real-world datasets, using statistical inference and the data sampling process in the PAC framework. We first study the convergence behavior of linear SVM experimentally, and make several observations, useful for real-world applications. We then offer theoretical proofs for our observations by studying the Bayes risk and PAC bound. Our experiment and theoretic results are valuable for learning large sparse datasets with linear SVM.