Author name cluster

Lei Wu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

38 papers

2 author rows

AAAI Conference 2026 Conference Paper

Backdooring Rationalization

Lingxiao Kong
Jiahui Jiang
Wenchao Xu
Lei Wu

Rationalization model has recently garnered significant attention for enhancing the interpretability of natural language processing by first using a generator to select the most relevant pieces from the text with respect to the label, before passing the text input to the predictor. However, the robustness of the rationalization models is not sufficiently investigated. Specifically, this paper explores the robustness of rationalization models against backdoor attacks, which has been ignored by previous studies. Surprisingly, we find that conventional backdoor attack techniques fail to inject triggers into the rationalization model because its generator can filter out bad triggers. Considering this, we further propose a novel backdoor attack method named as BadRNL designed specially for the rationalization models. The core idea of BadRNL is first to search for the personalized trigger for each specific dataset and then manipulate the rationales and labels to conduct attacks. Besides, BadRNL controls the order of sample learning through poison-priority sampling strategies. Experimental results show that our method can successfully craft the predictions of samples containing triggers while maintaining the performance of the model on clean data.

PDF Details DOI

EAAI Journal 2026 Journal Article

Deep Potential Semantic-aware Hashing for Cross-modal Retrieval

Lei Wu
Qibing Qin
Jiangyan Dai
Lei Huang
Wenfeng Zhang

Details DOI

AAAI Conference 2026 Conference Paper

Introducing Decomposed Causality with Spatiotemporal Object-Centric Representation for Video Classification

Yachong Zhang
Lei Meng
Shuo Xu
Zhuang Qi
Wei Wu
Lei Wu
Xiangxu Meng

Video classification requires event-level representations of objects and their interactions. Existing methods typically rely on data-driven approaches, which either learn such features from whole frames or object-centric visual regions. Therefore, the modeling of spatiotemporal interactions among objects is usually overlooked. To address this issue, this paper presents a Decomposition of Synergistic, Unique, and Redundant Causal Representations Learning (SurdCRL) model for video classification, which introduces a newly-proposed SURD causal theory to model the spatiotemporal features of both object dynamics and their in- and cross-frame interactions. Specifically, SurdCRL employs three modules to model the object-centric spatiotemporal dynamics using distinct types of causal components, where the first module Spatial-Temporal Entity Modeling decouples the frame into object and context entities, and employs a temporal message passing block to capture object state changes over time, generating spatiotemporal features as basic causal variables. Second, the Dual-Path Causal Inference module mitigates confounders among causal variables by front-door and back-door interventions, thus enabling the subsequent causal components to reflect their intrinsic effects. Finally, the Causal Composition and Selection module employs the compositional structure-aware attention to project the causal variables and their high-order interactions into the synergistic, unique, and redundant components. Experiments on two benchmarking datasets verify that SurdCRL better captures event-relevant object-centric representation by decomposing spatiotemporal object interactions into three types of causal components.

PDF Details DOI

YNICL Journal 2025 Journal Article

Consistent frontal-limbic-occipital connections in distinguishing treatment-resistant and non-treatment-resistant schizophrenia

Yijie Zhang
Shuzhan Gao
Chuang Liang
Juan Bustillo
Peter Kochunov
Jessica A. Turner
Vince D. Calhoun
Lei Wu

Details DOI

IROS Conference 2025 Conference Paper

DTactive: A Vision-Based Tactile Sensor with Active Surface

Jikai Xu
Lei Wu
Changyi Lin
Ding Zhao
Huazhe Xu

The development of vision-based tactile sensors has significantly enhanced robots’ perception and manipulation capabilities, especially for tasks requiring contact-rich interactions with objects. In this work, we present DTactive, a novel vision-based tactile sensor with active surfaces. DTactive inherits and modifies the tactile 3D shape reconstruction method of DTact while integrating a mechanical transmission mechanism that facilitates the mobility of its surface. Thanks to this design, the sensor is capable of simultaneously performing tactile perception and in-hand manipulation with surface movement. Leveraging the high-resolution tactile images from the sensor and the magnetic encoder data from the transmission mechanism, we propose a learning-based method to enable precise angular trajectory control during in-hand manipulation. In our experiments, we successfully achieved accurate rolling manipulation within the range of [−180°, 180°] on various objects, with the root mean square error between the desired and actual angular trajectories being less than 12° on nine trained objects and less than 19° on three novel objects. The results demonstrate the potential of DTactive for in-hand object manipulation in terms of effectiveness, robustness and precision.

Details

NeurIPS Conference 2025 Conference Paper

Functional Scaling Laws in Kernel Regression: Loss Dynamics and Learning Rate Schedules

Binghui Li
Fengling Chen
Zixun Huang
Lean Wang
Lei Wu

Scaling laws have emerged as a unifying lens for understanding and guiding the training of large language models (LLMs). However, existing studies predominantly focus on the final-step loss, leaving open whether the entire $\textit{loss dynamics}$ obey similar laws and, crucially, how the $\textit{learning rate schedule}$ (LRS) shapes them. We address these gaps in a controlled theoretical setting by analyzing stochastic gradient descent (SGD) on a power-law kernel regression model. The key insight is a novel $\textbf{intrinsic-time}$ viewpoint, which captures the training progress more faithfully than iteration count. We then establish a $\textbf{Functional Scaling Law (FSL)}$ that captures the full loss trajectory under arbitrary LRSs, with the schedule’s influence entering through a simple convolutional functional. We further instantiate the theory for three representative LRSs---constant, exponential decay, and warmup–stable–decay (WSD)---and derive explicit scaling relations in both data- and compute-limited regimes. These comparisons explain key empirical phenomena: (i) higher-capacity models are more data- and compute-efficient; (ii) learning-rate decay improves training efficiency; and (iii) WSD-type schedules outperform pure decay. Finally, experiments on LLMs ranging from 0. 1B to 1B parameters demonstrate the practical relevance of FSL as a surrogate model for fitting and predicting loss trajectories in large-scale pre-training.

PDF Details

ICML Conference 2025 Conference Paper

The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training

Jinbo Wang 0003
Mingze Wang
Zhanpeng Zhou
Junchi Yan
Weinan E
Lei Wu

Transformers have become the cornerstone of modern AI. Unlike traditional architectures, transformers exhibit a distinctive characteristic: diverse types of building blocks, such as embedding layers, normalization layers, self-attention mechanisms, and point-wise feed-forward networks, work collaboratively. Understanding the disparities and interactions among these blocks is therefore important. In this paper, we uncover a clear sharpness disparity across these blocks, which intriguingly emerges early in training and persists throughout the training process. Building on this insight, we propose a novel Blockwise Learning Rate (LR) strategy to accelerate large language model (LLM) pre-training. Specifically, by integrating Blockwise LR into AdamW, we consistently achieve lower terminal loss and nearly $2\times$ speedup compared to vanilla AdamW. This improvement is demonstrated across GPT-2 and LLaMA models, with model sizes ranging from 0. 12B to 1. 1B and datasets including OpenWebText and MiniPile. Finally, we incorporate Blockwise LR into Adam-mini (Zhang et al. , 2024), a recently proposed memory-efficient variant of Adam, achieving a combined $2\times$ speedup and $2\times$ memory savings. These results underscore the potential of leveraging the sharpness disparity principle to improve LLM training.

Details

ICML Conference 2024 Conference Paper

Achieving Margin Maximization Exponentially Fast via Progressive Norm Rescaling

Mingze Wang
Zeping Min
Lei Wu

In this work, we investigate the margin-maximization bias exhibited by gradient-based algorithms in classifying linearly separable data. We present an in-depth analysis of the specific properties of the velocity field associated with (normalized) gradients, focusing on their role in margin maximization. Inspired by this analysis, we propose a novel algorithm called Progressive Rescaling Gradient Descent (PRGD) and show that PRGD can maximize the margin at an exponential rate. This stands in stark contrast to all existing algorithms, which maximize the margin at a slow polynomial rate. Specifically, we identify mild conditions on data distribution under which existing algorithms such as gradient descent (GD) and normalized gradient descent (NGD) provably fail in maximizing the margin efficiently. To validate our theoretical findings, we present both synthetic and real-world experiments. Notably, PRGD also shows promise in enhancing the generalization performance when applied to linearly non-separable datasets and deep neural networks.

Details

ECAI Conference 2024 Conference Paper

CAMAOT: Channel-Aware Multi-Camera Active Object Tracking System

Maolong Yin
Bin Guo 0001
Zhuo Sun 0002
Lei Wu
Zhaotie Hao
Zhiwen Yu 0001

Multi-Camera Active Object Tracking is an attractive technique in the area of intelligent surveillance, where cameras share their observations via the wireless communication to collaboratively track the target. Due to the variability in wireless channel, the dynamic transmission delay between cameras significantly affects the collaboration performance, especially when the tracking is time-sensitive. In this paper, we propose a channel-aware multi-camera active object tracking (CAMAOT) system, to achieve the stable and improved tracking performance. Specifically, a communication decision module is designed in CAMAOT, where the cameras’ communication graph and communication resource allocation adapt to the channels. Our experiments demonstrate that for time-varying channels, CAMAOT has a stable performance improvement over other systems, particularly when the communication resources are limited.

Details

NeurIPS Conference 2024 Conference Paper

Improving Generalization and Convergence by Enhancing Implicit Regularization

Mingze Wang
Jinbo Wang
Haotian He
Zilin Wang
Guanhua Huang
Feiyu Xiong
Zhiyu Li
Weinan E

In this work, we propose an Implicit Regularization Enhancement (IRE) framework to accelerate the discovery of flat solutions in deep learning, thereby improving generalization and convergence. Specifically, IRE decouples the dynamics of flat and sharp directions, which boosts the sharpness reduction along flat directions while maintaining the training stability in sharp directions. We show that IRE can be practically incorporated with *generic base optimizers* without introducing significant computational overload. Experiments show that IRE consistently improves the generalization performance for image classification tasks across a variety of benchmark datasets (CIFAR-10/100, ImageNet) and models (ResNets and ViTs). Surprisingly, IRE also achieves a $2\times$ *speed-up* compared to AdamW in the pre-training of Llama models (of sizes ranging from 60M to 229M) on datasets including Wikitext-103, Minipile, and Openwebtext. Moreover, we provide theoretical guarantees, showing that IRE can substantially accelerate the convergence towards flat minima in Sharpness-aware Minimization (SAM).

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Parameter Symmetry and Noise Equilibrium of Stochastic Gradient Descent

Liu Ziyin
Mingze Wang
Hongchao Li
Lei Wu

Symmetries are prevalent in deep learning and can significantly influence the learning dynamics of neural networks. In this paper, we examine how exponential symmetries -- a broad subclass of continuous symmetries present in the model architecture or loss function -- interplay with stochastic gradient descent (SGD). We first prove that gradient noise creates a systematic motion (a ``Noether flow") of the parameters $\theta$ along the degenerate direction to a unique initialization-independent fixed point $\theta^*$. These points are referred to as the noise equilibria because, at these points, noise contributions from different directions are balanced and aligned. Then, we show that the balance and alignment of gradient noise can serve as a novel alternative mechanism for explaining important phenomena such as progressive sharpening/flattening and representation formation within neural networks and have practical implications for understanding techniques like representation normalization and warmup.

PDF Details DOI

YNIMG Journal 2024 Journal Article

Searching Reproducible Brain Features using NeuroMark: Templates for Different Age Populations and Imaging Modalities

Zening Fu
Ishaan Batta
Lei Wu
Anees Abrol
Oktay Agcaoglu
Mustafa S Salman
Yuhui Du
Armin Iraji

Details DOI

YNIMG Journal 2024 Journal Article

Self-supervised multimodal learning for group inferences from MRI data: Discovering disorder-relevant brain regions and multimodal links

Alex Fedorov
Eloy Geenjaar
Lei Wu
Tristan Sylvain
Thomas P. DeRamus
Margaux Luck
Maria Misiura
Girish Mittapalle

Details DOI

ICML Conference 2024 Conference Paper

Why Do You Grok? A Theoretical Analysis on Grokking Modular Addition

Mohamad Amin Mohamadi
ZhiYuan Li
Lei Wu
Danica J. Sutherland

We present a theoretical explanation of the “grokking” phenomenon (Power et al. , 2022), where a model generalizes long after overfitting, for the originally-studied problem of modular addition. First, we show that early in gradient descent, so that the “kernel regime” approximately holds, no permutation-equivariant model can achieve small population error on modular addition unless it sees at least a constant fraction of all possible data points. Eventually, however, models escape the kernel regime. We show that one-hidden-layer quadratic networks that achieve zero training loss with bounded $\ell_\infty$ norm generalize well with substantially fewer training points, and further show such networks exist and can be found by gradient descent with small $\ell_\infty$ regularization. We further provide empirical evidence that these networks leave the kernel regime only after initially overfitting. Taken together, our results strongly support the case for grokking as a consequence of the transition from kernel-like behavior to limiting behavior of gradient descent on deep networks.

Details

EAAI Journal 2023 Journal Article

A new data fusion driven-sparse representation learning method for bearing intelligent diagnosis in small and unbalanced samples

Yike Zhao
Xin Zhang
Jiaxu Wang
Lei Wu
Zhiwen Liu
Lei Wang

Details DOI

IJCAI Conference 2023 Conference Paper

Compositional Zero-Shot Artistic Font Synthesis

Xiang Li
Lei Wu
Changshuo Wang
Lei Meng
Xiangxu Meng

Recently, many researchers have made remarkable achievements in the field of artistic font synthesis, with impressive glyph style and effect style in the results. However, due to less exploration in style disentanglement, it is difficult for existing methods to envision a kind of unseen style (glyph-effect) compositions of artistic font, and thus can only learn the seen style compositions. To solve this problem, we propose a novel compositional zero-shot artistic font synthesis gan (CAFS-GAN), which allows the synthesis of unseen style compositions by exploring the visual independence and joint compatibility of encoding semantics between glyph and effect. Specifically, we propose two contrast-based style encoders to achieve style disentanglement due to glyph and effect intertwining in the image. Meanwhile, to preserve more glyph and effect detail, we propose a generator based on hierarchical dual styles AdaIN to reorganize content-styles representations from structure to texture gradually. Extensive experiments demonstrate the superiority of our model in generating high-quality artistic font images with unseen style compositions against other state-of-the-art methods. The source code and data is available at moonlight03. github. io/CAFS-GAN/.

PDF Details DOI

AAMAS Conference 2023 Conference Paper

Learning to Self-Reconfigure for Freeform Modular Robots via Altruism Multi-Agent Reinforcement Learning

Lei Wu
Bin Guo
Qiuyun Zhang
Zhuo Sun
Jieyi Zhang
Zhiwen Yu

Modular robots can change between different configurations to adapt to complex and dynamic environments. Therefore, performing accurate and efficient changes to modular robot system, known as the self-reconfiguration problem, is essential. Existing reconfiguration algorithms are based on discrete motion primitives. However, freeform modular robots are connected without alignment and their motion space is continuous, making existing reconfiguration methods infeasible. In this work, we design a parallel distributed self-reconfiguration algorithm based on multi-agent reinforcement learning for freeform modular robots. We introduce a collaboration mechanism into the reinforcement learning to avoid conflicts in continuous action spaces. Simulations show that our algorithm reduces conflicts and improves effectiveness compared to the baselines.

PDF

IJCAI Conference 2023 Conference Paper

Learning to Self-Reconfigure for Freeform Modular Robots via Altruism Proximal Policy Optimization

Lei Wu
Bin Guo
Qiuyun Zhang
Zhuo Sun
Jieyi Zhang
Zhiwen Yu

The advantages of modular robot systems stem from their ability to change between different configurations, enabling them to adapt to complex and dynamic real-world environments. Then, how to perform the accurate and efficient change of the modular robot system, i. e. , the self-reconfiguration problem, is essential. Existing reconfiguration algorithms are based on discrete motion primitives and are suitable for lattice-type modular robots. The modules of freeform modular robots are connected without alignment, and the motion space is continuous. It renders existing reconfiguration methods infeasible. In this paper, we design a parallel distributed self-reconfiguration algorithm for freeform modular robots based on multi-agent reinforcement learning to realize the automatic design of conflict-free reconfiguration controllers in continuous action spaces. To avoid conflicts, we incorporate a collaborative mechanism into reinforcement learning. Furthermore, we design the distributed termination criteria to achieve timely termination in the presence of limited communication and local observability. When compared to the baselines, simulations show that the proposed method improves efficiency and congruence, and module movement demonstrates altruism.

PDF Details DOI

IJCAI Conference 2023 Conference Paper

RZCR: Zero-shot Character Recognition via Radical-based Reasoning

Xiaolei Diao
Daqian Shi
Hao Tang
Qiang Shen
Yanzeng Li
Lei Wu
Hao Xu

The long-tail effect is a common issue that limits the performance of deep learning models on real-world datasets. Character image datasets are also affected by such unbalanced data distribution due to differences in character usage frequency. Thus, current character recognition methods are limited when applied in the real world, especially for the categories in the tail that lack training samples, e. g. , uncommon characters. In this paper, we propose a zero-shot character recognition framework via radical-based reasoning, called RZCR, to improve the recognition performance of few-sample character categories in the tail. Specifically, we exploit radicals, the graphical units of characters, by decomposing and reconstructing characters according to orthography. RZCR consists of a visual semantic fusion-based radical information extractor (RIE) and a knowledge graph character reasoner (KGR). RIE aims to recognize candidate radicals and their possible structural relations from character images in parallel. The results are then fed into KGR to recognize the target character by reasoning with a knowledge graph. We validate our method on multiple datasets, and RZCR shows promising experimental results, especially on few-sample character datasets.

PDF Details DOI

ICML Conference 2023 Conference Paper

The Implicit Regularization of Dynamical Stability in Stochastic Gradient Descent

Lei Wu
Weijie J. Su

In this paper, we study the implicit regularization of stochastic gradient descent (SGD) through the lens of dynamical stability (Wu et al. , 2018). We start by revising existing stability analyses of SGD, showing how the Frobenius norm and trace of Hessian relate to different notions of stability. Notably, if a global minimum is linearly stable for SGD, then the trace of Hessian must be less than or equal to $2/\eta$, where $\eta$ denotes the learning rate. By contrast, for gradient descent (GD), the stability imposes a similar constraint but only on the largest eigenvalue of Hessian. We then turn to analyze the generalization properties of these stable minima, focusing specifically on two-layer ReLU networks and diagonal linear networks. Notably, we establish the equivalence between these metrics of sharpness and certain parameter norms for the two models, which allows us to show that the stable minima of SGD provably generalize well. By contrast, the stability-induced regularization of GD is provably too weak to ensure satisfactory generalization. This discrepancy provides an explanation of why SGD often generalizes better than GD. Note that the learning rate (LR) plays a pivotal role in the strength of stability-induced regularization. As the LR increases, the regularization effect becomes more pronounced, elucidating why SGD with a larger LR consistently demonstrates superior generalization capabilities. Additionally, numerical experiments are provided to support our theoretical findings.

Details

NeurIPS Conference 2023 Conference Paper

Theoretical Analysis of the Inductive Biases in Deep Convolutional Networks

Zihao Wang
Lei Wu

In this paper, we provide a theoretical analysis of the inductive biases in convolutional neural networks (CNNs). We start by examining the universality of CNNs, i. e. , the ability to approximate any continuous functions. We prove that a depth of $\mathcal{O}(\log d)$ suffices for deep CNNs to achieve this universality, where $d$ in the input dimension. Additionally, we establish that learning sparse functions with CNNs requires only $\widetilde{\mathcal{O}}(\log^2d)$ samples, indicating that deep CNNs can efficiently capture {\em long-range} sparse correlations. These results are made possible through a novel combination of the multichanneling and downsampling when increasing the network depth. We also delve into the distinct roles of weight sharing and locality in CNNs. To this end, we compare the performance of CNNs, locally-connected networks (LCNs), and fully-connected networks (FCNs) on a simple regression task, where LCNs can be viewed as CNNs without weight sharing. On the one hand, we prove that LCNs require ${\Omega}(d)$ samples while CNNs need only $\widetilde{\mathcal{O}}(\log^2d)$ samples, highlighting the critical role of weight sharing. On the other hand, we prove that FCNs require $\Omega(d^2)$ samples, whereas LCNs need only $\widetilde{\mathcal{O}}(d)$ samples, underscoring the importance of locality. These provable separations quantify the difference between the two biases, and the major observation behind our proof is that weight sharing and locality break different symmetries in the learning process.

PDF Details

JMLR Journal 2022 Journal Article

A spectral-based analysis of the separation between two-layer neural networks and linear methods

Lei Wu
Jihao Long

We propose a spectral-based approach to analyze how two-layer neural networks separate from linear methods in terms of approximating high-dimensional functions. We show that quantifying this separation can be reduced to estimating the Kolmogorov width of two-layer neural networks, and the latter can be further characterized by using the spectrum of an associated kernel. Different from previous work, our approach allows obtaining upper bounds, lower bounds, and identifying explicit hard functions in a united manner. We provide a systematic study of how the choice of activation functions affects the separation, in particular the dependence on the input dimension. Specifically, for nonsmooth activation functions, we extend known results to more activation functions with sharper bounds. As concrete examples, we prove that any single neuron can instantiate the separation between neural networks and random feature models. For smooth activation functions, one surprising finding is that the separation is negligible unless the norms of inner-layer weights are polynomially large with respect to the input dimension. By contrast, the separation for nonsmooth activation functions is independent of the norms of inner-layer weights. [abs] [ pdf ][ bib ] &copy JMLR 2022. ( edit, beta )

PDF Details

NeurIPS Conference 2022 Conference Paper

The alignment property of SGD noise and how it helps select flat minima: A stability analysis

Lei Wu
Mingze Wang
Weijie Su

The phenomenon that stochastic gradient descent (SGD) favors flat minima has played a critical role in understanding the implicit regularization of SGD. In this paper, we provide an explanation of this striking phenomenon by relating the particular noise structure of SGD to its \emph{linear stability} (Wu et al. , 2018). Specifically, we consider training over-parameterized models with square loss. We prove that if a global minimum $\theta^*$ is linearly stable for SGD, then it must satisfy $\|H(\theta^*)\|_F\leq O(\sqrt{B}/\eta)$, where $\|H(\theta^*)\|_F, B, \eta$ denote the Frobenius norm of Hessian at $\theta^*$, batch size, and learning rate, respectively. Otherwise, SGD will escape from that minimum \emph{exponentially} fast. Hence, for minima accessible to SGD, the sharpness---as measured by the Frobenius norm of the Hessian---is bounded \emph{independently} of the model size and sample size. The key to obtaining these results is exploiting the particular structure of SGD noise: The noise concentrates in sharp directions of local landscape and the magnitude is proportional to loss value. This alignment property of SGD noise provably holds for linear networks and random feature models (RFMs), and is empirically verified for nonlinear networks. Moreover, the validity and practical relevance of our theoretical findings are also justified by extensive experiments on CIFAR-10 dataset.

PDF Details

EAAI Journal 2021 Journal Article

NHACR: A novel heuristic approach for 2D rectangle packing area minimization problem with central rectangle

Lei Wu
Xinming Li
Chao Liu
Wensheng Xiao

Details DOI

YNIMG Journal 2021 Journal Article

Tracking spatial dynamics of functional connectivity during a task

Lei Wu
Arvind Caprihan
Vince Calhoun

Details DOI

EAAI Journal 2020 Journal Article

A novel bat algorithm with double mutation operators and its application to low-velocity impact localization problem

Qi Liu
Jindong Li
Lei Wu
Fengde Wang
Wensheng Xiao

Details DOI

AAAI Conference 2020 Conference Paper

Multi-Question Learning for Visual Question Answering

Chenyi Lei
Lei Wu
Dong Liu
Zhao Li
Guoxin Wang
Haihong Tang
Houqiang Li

Visual Question Answering (VQA) raises a great challenge for computer vision and natural language processing communities. Most of the existing approaches consider videoquestion pairs individually during training. However, we observe that there are usually multiple (either sequentially generated or not) questions for the target video in a VQA task, and the questions themselves have abundant semantic relations. To explore these relations, we propose a new paradigm for VQA termed Multi-Question Learning (MQL). Inspired by the multi-task learning, MQL learns from multiple questions jointly together with their corresponding answers for a target video sequence. The learned representations of videoquestion pairs are then more general to be transferred for new questions. We further propose an effective VQA framework and design a training procedure for MQL, where the speciﬁcally designed attention network models the relation between input video and corresponding questions, enabling multiple video-question pairs to be co-trained. Experimental results on public datasets show the favorable performance of the proposed MQL-VQA framework compared to state-of-the-arts.

PDF Details

NeurIPS Conference 2019 Conference Paper

Global Convergence of Gradient Descent for Deep Linear Residual Networks

Lei Wu
Qingcan Wang
Chao Ma

We analyze the global convergence of gradient descent for deep linear residual networks by proposing a new initialization: zero-asymmetric (ZAS) initialization. It is motivated by avoiding stable manifolds of saddle points. We prove that under the ZAS initialization, for an arbitrary target matrix, gradient descent converges to an $\varepsilon$-optimal point in $O\left( L^3 \log(1/\varepsilon) \right)$ iterations, which scales polynomially with the network depth $L$. Our result and the $\exp(\Omega(L))$ convergence time for the standard initialization (Xavier or near-identity) \cite{shamir2018exponential} together demonstrate the importance of the residual structure and the initialization in the optimization for deep linear neural networks, especially when $L$ is large.

PDF Details

ICML Conference 2019 Conference Paper

The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Sharp Minima and Regularization Effects

Zhanxing Zhu
Jingfeng Wu
Bing Yu
Lei Wu
Jinwen Ma

Understanding the behavior of stochastic gradient descent (SGD) in the context of deep neural networks has raised lots of concerns recently. Along this line, we study a general form of gradient based optimization dynamics with unbiased noise, which unifies SGD and standard Langevin dynamics. Through investigating this general optimization dynamics, we analyze the behavior of SGD on escaping from minima and its regularization effects. A novel indicator is derived to characterize the efficiency of escaping from minima through measuring the alignment of noise covariance and the curvature of loss function. Based on this indicator, two conditions are established to show which type of noise structure is superior to isotropic noise in term of escaping efficiency. We further show that the anisotropic noise in SGD satisfies the two conditions, and thus helps to escape from sharp and poor minima effectively, towards more stable and flat minima that typically generalize well. We systematically design various experiments to verify the benefits of the anisotropic noise, compared with full gradient descent plus isotropic diffusion (i. e. Langevin dynamics).

Details

YNIMG Journal 2018 Journal Article

An approach to directly link ICA and seed-based functional connectivity: Application to schizophrenia

Lei Wu
Arvind Caprihan
Juan Bustillo
Andrew Mayer
Vince Calhoun

Details DOI

ICRA Conference 2018 Conference Paper

Controlling a Non-Holonomic Mobile Manipulator in a Constrained Floor Space

Mustafa Mashali
Lei Wu
Redwan Alqasemi
Rajiv V. Dubey

Robotic manipulators that are attached to mobile platforms are often used in workspaces that require the end-effector to mobilize beyond the manipulator's limited reach, such as in warehouse shelf stacking and similar applications. However, such assistive robots fall short of completing tasks that require the end-effector to be situated in a specific configuration at a critical time during the task. Traditionally, users control the mobile base to situate the arm such that the task can be completed through continuous operation. This requires an experienced operator who can predict the needed end-effector workspace, and can operate the base accordingly to maximize the likelihood of a successful task while avoiding any floor obstacles. In this work, we propose a straightforward control method that provides sufficient freedom to the end-effector to complete a task that is bound by time-dependent constraints. This is achieved by relaxing the time constraints on the mobile base trajectory in a floor space obstructed by obstacles. The trajectory of the platform is determined by sensor-assisted obstacle avoidance algorithm such that a single degree of freedom mobility can be represented through a safe obstacle-free time-independent path. The proposed control method is implemented in simulation and on physical hardware built in our labs. The simulation included a 5-DoF redundant Planar Mobile Manipulator (PMM). The hardware implementation and testing utilized a 9-DoF redundant mobile manipulator. The implementation results demonstrate the effectiveness of the control method in adjusting the mobile platform motion along its allowed obstacle-free path to enable the end-effector to follow its trajectory for task completion that would otherwise fail to complete when conventional control methods are used.

Details

EAAI Journal 2018 Journal Article

Heuristic algorithm for RPAMP with central rectangle and its application to solve oil–gas treatment facility layout problem

Lei Wu
Qi Liu
Fengde Wang
Wensheng Xiao
Yaowen Yang

Details DOI

NeurIPS Conference 2018 Conference Paper

How SGD Selects the Global Minima in Over-parameterized Learning: A Dynamical Stability Perspective

Lei Wu
Chao Ma
Weinan E

The question of which global minima are accessible by a stochastic gradient decent (SGD) algorithm with specific learning rate and batch size is studied from the perspective of dynamical stability. The concept of non-uniformity is introduced, which, together with sharpness, characterizes the stability property of a global minimum and hence the accessibility of a particular SGD algorithm to that global minimum. In particular, this analysis shows that learning rate and batch size play different roles in minima selection. Extensive empirical results seem to correlate well with the theoretical findings and provide further support to these claims.

PDF Details

EAAI Journal 2017 Journal Article

An improved heuristic algorithm for 2D rectangle packing area minimization problems with central rectangles

Lei Wu
Xue Tian
Jixu Zhang
Qi Liu
Wensheng Xiao
Yaowen Yang

Details DOI

YNIMG Journal 2013 Journal Article

The spatiospectral characterization of brain networks: Fusing concurrent EEG spectra and fMRI maps

David A. Bridwell
Lei Wu
Tom Eichele
Vince D. Calhoun

Details DOI

TIST Journal 2011 Journal Article

Distance metric learning from uncertain side information for automated photo tagging

Lei Wu
Steven C.H. Hoi
Rong Jin
Jianke Zhu
Nenghai Yu

Automated photo tagging is an important technique for many intelligent multimedia information systems, for example, smart photo management system and intelligent digital media library. To attack the challenge, several machine learning techniques have been developed and applied for automated photo tagging. For example, supervised learning techniques have been applied to automated photo tagging by training statistical classifiers from a collection of manually labeled examples. Although the existing approaches work well for small testbeds with relatively small number of annotation words, due to the long-standing challenge of object recognition, they often perform poorly in large-scale problems. Another limitation of the existing approaches is that they require a set of high-quality labeled data, which is not only expensive to collect but also time consuming. In this article, we investigate a social image based annotation scheme by exploiting implicit side information that is available for a large number of social photos from the social web sites. The key challenge of our intelligent annotation scheme is how to learn an effective distance metric based on implicit side information (visual or textual) of social photos. To this end, we present a novel “Probabilistic Distance Metric Learning” (PDML) framework, which can learn optimized metrics by effectively exploiting the implicit side information vastly available on the social web. We apply the proposed technique to photo annotation tasks based on a large social image testbed with over 1 million tagged photos crawled from a social photo sharing portal. Encouraging results show that the proposed technique is effective and promising for social photo based annotation tasks.

Details DOI

YNIMG Journal 2010 Journal Article

Reactivity of hemodynamic responses and functional connectivity to different states of alpha synchrony: A concurrent EEG-fMRI study

Lei Wu
Tom Eichele
Vince D. Calhoun

Details DOI

NeurIPS Conference 2009 Conference Paper

Learning Bregman Distance Functions and Its Application for Semi-Supervised Clustering

Lei Wu
Rong Jin
Steven Hoi
Jianke Zhu
Nenghai Yu

Learning distance functions with side information plays a key role in many machine learning and data mining applications. Conventional approaches often assume a Mahalanobis distance function. These approaches are limited in two aspects: (i) they are computationally expensive (even infeasible) for high dimensional data because the size of the metric is in the square of dimensionality; (ii) they assume a fixed metric for the entire input space and therefore are unable to handle heterogeneous data. In this paper, we propose a novel scheme that learns nonlinear Bregman distance functions from side information using a non-parametric approach that is similar to support vector machines. The proposed scheme avoids the assumption of fixed metric because its local distance metric is implicitly derived from the Hessian matrix of a convex function that is used to generate the Bregman distance function. We present an efficient learning algorithm for the proposed scheme for distance function learning. The extensive experiments with semi-supervised clustering show the proposed technique (i) outperforms the state-of-the-art approaches for distance function learning, and (ii) is computationally efficient for high dimensional data.

PDF Details