Arrow Research search

Author name cluster

Zhichao Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

19 papers
2 author rows

Possible papers

19

AAAI Conference 2026 Conference Paper

Walking Further: Semantic-Aware Multimodal Gait Recognition Under Long-Range Conditions

  • Zhiyang Lu
  • Wen Jiang
  • Tianren Wu
  • Zhichao Wang
  • Changwang Zhang
  • Siqi Shen
  • Ming Cheng

Gait recognition is an emerging biometric technology that enables non-intrusive and hard-to-spoof human identification. However, most existing methods are confined to short-range, unimodal settings and fail to generalize to long-range and cross-distance scenarios under real-world conditions. To address this gap, we present LRGait, the first LiDAR-Camera multimodal benchmark designed for robust long-range gait recognition across diverse outdoor distances and environments. We further propose EMGaitNet, an end-to-end framework tailored for long-range multimodal gait recognition. To bridge the modality gap between RGB images and point clouds, we introduce a semantic-guided fusion pipeline. A CLIP-based Semantic Mining (SeMi) module first extracts human body-part-aware semantic cues, which are then employed to align 2D and 3D features via a Semantic-Guided Alignment (SGA) module within a unified embedding space. A Symmetric Cross-Attention Fusion (SCAF) module hierarchically integrates visual contours and 3D geometric features, and a Spatio-Temporal (ST) module captures global gait dynamics. Extensive experiments on various gait datasets validate the effectiveness of our method.

AAAI Conference 2025 Conference Paper

Federated Unlearning with Gradient Descent and Conflict Mitigation

  • Zibin Pan
  • Zhichao Wang
  • Chi Li
  • Kaiyan Zheng
  • Boqi Wang
  • Xiaoying Tang
  • Junhua Zhao

Federated Learning (FL) has received much attention in recent years. However, although clients are not required to share their data in FL, the global model itself can implicitly remember clients' local data. Therefore, it’s necessary to effectively remove the target client's data from the FL global model to ease the risk of privacy leakage and implement "the right to be forgotten". Federated Unlearning (FU) has been considered a promising solution to remove data without full retraining. But the model utility easily suffers significant reduction during unlearning due to the gradient conflicts. Furthermore, when conducting the post-training to recovery the model utility, it’s prone to move back and revert what have already been unlearned. To address these issues, we propose Federated Unlearning with Orthogonal Steepest Descent (FedOSD). We first design an unlearning cross entropy loss to overcome the convergence issue of the gradient ascent. A steepest descent direction for unlearning is then calculated in the condition of being non-conflicting with other clients’ gradients and closest to the target client's gradient. This benefits to efficiently unlearn and mitigate the model utility reduction. After unlearning, we recover the model utility by maintaining the achievement of unlearning. Finally, extensive experiments in several FL scenarios verify that FedOSD outperforms the SOTA FU algorithms in terms of unlearning and the model utility.

NeurIPS Conference 2025 Conference Paper

Generalization Bound of Gradient Flow through Training Trajectory and Data-dependent Kernel

  • Yilan Chen
  • Zhichao Wang
  • Wei Huang
  • Andi Han
  • Taiji Suzuki
  • Arya Mazumdar

Gradient-based optimization methods have shown remarkable empirical success, yet their theoretical generalization properties remain only partially understood. In this paper, we establish a generalization bound for gradient flow that aligns with the classical Rademacher complexity bounds for kernel methods—specifically those based on the RKHS norm and kernel trace—through a data-dependent kernel called the loss path kernel (LPK). Unlike static kernels such as NTK, the LPK captures the entire training trajectory, adapting to both data and optimization dynamics, leading to tighter and more informative generalization guarantees. Moreover, the bound highlights how the norm of the training loss gradients along the optimization trajectory influences the final generalization performance. The key technical ingredients in our proof combine stability analysis of gradient flow with uniform convergence via Rademacher complexity. Our bound recovers existing kernel regression bounds for overparameterized neural networks and shows the feature learning capability of neural networks compared to kernel methods. Numerical experiments on real-world datasets validate that our bounds correlate well with the true generalization gap.

JBHI Journal 2025 Journal Article

Hypergraph–based Audio–Visual Fusion for Obstructive Sleep Apnea Severity Estimation During Wakefulness

  • Biao Xue
  • Yanting Shao
  • Zhichao Wang
  • Chang–Hong Fu
  • Xiaohua Zhu
  • Heng Zhao
  • Jing Xu
  • Hong Hong

Obstructive sleep apnea (OSA) is associated with psychophysiological impairments, and recent studies have shown the feasibility of using speech and craniofacial images during wakefulness for severity estimation. However, the inherent limitations of unimodal data constrain the performance of current methods. To address this, we proposed a novel hypergraph-based multimodal fusion framework (HMFusion) that integrates psychophysiological information from audio-visual data. Specifically, we employ long short-term memory (LSTM)-based encoders to extract modality-specific temporal dynamics from pre–trained audio-visual embeddings and remotely photoplethysmography (rPPG)–derived heart rate sequences. A hypergraph neural network is then utilized to capture critical cross-modal interactions for OSA severity estimation. Evaluation on a dataset of 159 participants from a clinical sleep center demonstrates that the proposed model achieves area under the receiver operating characteristic curves (AUCs) of 88. 26%, 86. 07%, and 85. 29%, with corresponding F1-scores of 92. 91%, 85. 50%, and 85. 30% at Apnea-Hypopnea Index (AHI) thresholds of 5, 15, and 30 events/hour, respectively, outperforming state-of-the-art approaches. This study highlights the potential of psychophysiological data in enhancing OSA severity estimation during wakefulness, offering new avenues for clinical research in this field.

JBHI Journal 2025 Journal Article

Knowledge Guided Articulatory and Spectrum Information Fusion for Obstructive Sleep Apnea Severity Estimation

  • Biao Xue
  • Zhichao Wang
  • Yanting Shao
  • Xiaohua Zhu
  • Heng Zhao
  • Chang–Hong Fu
  • Jing Xu
  • Ning Ding

Numerous studies have demonstrated that speech analysis during wakefulness is a non-invasive and convenient method for Obstructive sleep apnea (OSA) screening. However, the inherent differences in upper airway structure and function between wakefulness and sleep limit the effectiveness of OSA assessments based on vowels and phonemes employed in existing studies. To address this challenge, we propose the design of controlled articulations that more accurately simulate upper airway obstruction during sleep, offering a more comprehensive reflection of the pathological changes in upper airway anatomy and function in individuals with suspected OSA. Specifically, we constructed a Mandarin Chinese controlled articulation dataset, consisting of speech recordings from 301 male adult participants who underwent polysomnography (PSG) monitoring at a sleep center. Drawing on domain knowledge, we thoroughly investigated articulations associated with upper airway collapse, including vowels, pharyngeals, and nasals, and identified interpretable optimal articulations using SHapley Additive Explanations (SHAP). Furthermore, we introduced a dual-stream fusion model, PTF-Net, which employs the Paralinguistic Acoustic Feature stream (PAF-Stream) to extract the physical attributes of speech and the Transfer Learning-based Spectrogram Feature stream (TLE-Stream) to capture the nonlinear features of upper airway dynamics. The Swin Transformer is utilized to integrate both local and global information from various articulations. Experimental results demonstrate that the knowledge-guided PTF-Net model outperforms existing methods in the assessment of OSA severity. The knowledge-guided PTF-Net model outperforms existing methods by 5. 1% in Area Under the Curve (AUC) and 5. 8% in Unweighted Average Recall (UAR) for OSA severity assessment. In addition, we revealed that the proposed deep embedding of controlled articulation could differentiate between different types of obstruction sites identified by drug-induced sleep endoscopy (DISE), suggesting its potential as a novel digital biomarker for upper airway assessment in OSA patients. This study enhances the understanding of speech-based OSA screening and paves the way for its broad clinical application.

ICML Conference 2025 Conference Paper

Models of Heavy-Tailed Mechanistic Universality

  • Liam Hodgkinson
  • Zhichao Wang
  • Michael W. Mahoney

Recent theoretical and empirical successes in deep learning, including the celebrated neural scaling laws, are punctuated by the observation that many objects of interest tend to exhibit some form of heavy-tailed or power law behavior. In particular, the prevalence of heavy-tailed spectral densities in Jacobians, Hessians, and weight matrices has led to the introduction of the concept of heavy-tailed mechanistic universality (HT-MU). Multiple lines of empirical evidence suggest a robust correlation between heavy-tailed metrics and model performance, indicating that HT-MU may be a fundamental aspect of deep learning efficacy. Here, we propose a general family of random matrix models—the high-temperature Marchenko-Pastur (HTMP) ensemble —to explore attributes that give rise to heavy-tailed behavior in trained neural networks. Under this model, spectral densities with power laws on (upper and lower) tails arise through a combination of three independent factors (complex correlation structures in the data; reduced temperatures during training; and reduced eigenvector entropy), appearing as an implicit bias in the model structure, and they can be controlled with an "eigenvalue repulsion” parameter. Implications of our model on other appearances of heavy tails, including neural scaling laws, optimizer trajectories, and the five-plus-one phases of neural network training, are discussed.

JBHI Journal 2025 Journal Article

Multi-Scale Group Agent Attention-Based Graph Convolutional Decoding Networks for 2D Medical Image Segmentation

  • Zhichao Wang
  • Lin Guo
  • Shuchang Zhao
  • Shiqing Zhang
  • Xiaoming Zhao
  • Jiangxiong Fang
  • Guoyu Wang
  • Hongsheng Lu

Automated medical image segmentation plays a crucial role in assisting doctors in diagnosing diseases. Feature decoding is a critical yet challenging issue for medical image segmentation. To address this issue, this work proposes a novel feature decoding network, called multi-scale group agent attention-based graph convolutional decoding networks (MSGAA-GCDN), to learn local-global features in graph structures for 2D medical image segmentation. The proposed MSGAA-GCDN combines graph convolutional network (GCN) and a lightweight multi-scale group agent attention (MSGAA) mechanism to represent features globally and locally within a graph structure. Moreover, in skip connections a simple yet efficient attention-based upsampling convolution fusion (AUCF) module is designed to enhance encoder-decoder feature fusion in both channel and spatial dimensions. Extensive experiments are conducted on three typical medical image segmentation tasks, namely Synapse abdominal multi-organs, Cardiac organs, and Polyp lesions. Experimental results demonstrate that the proposed MSGAA-GCDN outperforms the state-of-the-art methods, and the designed MSGAA is a lightweight yet effective attention architecture. The proposed MSGAA-GCDN can be easily taken as a plug-and-play decoder cascaded with other encoders for general medical image segmentation tasks.

NeurIPS Conference 2025 Conference Paper

UGM2N: An Unsupervised and Generalizable Mesh Movement Network via M-Uniform Loss

  • Zhichao Wang
  • Xinhai Chen
  • Qinglin Wang
  • Xiang Gao
  • Qingyang Zhang
  • Menghan Jia
  • Xiang Zhang
  • Jie Liu

Partial differential equations (PDEs) form the mathematical foundation for modeling physical systems in science and engineering, where numerical solutions demand rigorous accuracy-efficiency tradeoffs. Mesh movement techniques address this challenge by dynamically relocating mesh nodes to rapidly-varying regions, enhancing both simulation accuracy and computational efficiency. However, traditional approaches suffer from high computational complexity and geometric inflexibility, limiting their applicability, and existing supervised learning-based approaches face challenges in zero-shot generalization across diverse PDEs and mesh topologies. In this paper, we present an $\textbf{U}$nsupervised and $\textbf{G}$eneralizable $\textbf{M}$esh $\textbf{M}$ovement $\textbf{N}$etwork (UGM2N). We first introduce unsupervised mesh adaptation through localized geometric feature learning, eliminating the dependency on pre-adapted meshes. We then develop a physics-constrained loss function, M-Uniform loss, that enforces mesh equidistribution at the nodal level. Experimental results demonstrate that the proposed network exhibits equation-agnostic generalization and geometric independence in efficient mesh adaptation. It demonstrates consistent superiority over existing methods, including robust performance across diverse PDEs and mesh geometries, scalability to multi-scale resolutions and guaranteed error reduction without mesh tangling.

JMLR Journal 2025 Journal Article

Universality of Kernel Random Matrices and Kernel Regression in the Quadratic Regime

  • Parthe Pandit
  • Zhichao Wang
  • Yizhe Zhu

Kernel ridge regression (KRR) is a popular class of machine learning models that has become an important tool for understanding deep learning. Much of the focus thus far has been on studying the proportional asymptotic regime, $n \asymp d$, where $n$ is the number of training samples and $d$ is the dimension of the dataset. In the proportional regime, under certain conditions on the data distribution, the kernel random matrix involved in KRR exhibits behavior akin to that of a linear kernel. In this work, we extend the study of kernel regression to the quadratic asymptotic regime, where $n \asymp d^2$. In this regime, we demonstrate that a broad class of inner-product kernels exhibits behavior similar to a quadratic kernel. Specifically, we establish an operator norm approximation bound for the difference between the original kernel random matrix and a quadratic kernel random matrix with additional correction terms compared to the Taylor expansion of the kernel functions. The approximation works for general data distributions under a Gaussian-moment-matching assumption with a covariance structure. This new approximation is utilized to obtain a limiting spectral distribution of the original kernel matrix and characterize the precise asymptotic training and test errors for KRR in the quadratic regime when $n/d^2$ converges to a non-zero constant. The generalization errors are obtained for (i) a random teacher model, (ii) a deterministic teacher model where the weights are perfectly aligned with the covariance of the data. Under the random teacher model setting, we also verify that the generalized cross-validation (GCV) estimator can consistently estimate the generalization error in the quadratic regime for anisotropic data. Our proof techniques combine moment methods, Wick's formula, orthogonal polynomials, and resolvent analysis of random matrices with correlated entries. [abs] [ pdf ][ bib ] &copy JMLR 2025. ( edit, beta )

ICLR Conference 2024 Conference Paper

Faithful and Efficient Explanations for Neural Networks via Neural Tangent Kernel Surrogate Models

  • Andrew Engel
  • Zhichao Wang
  • Natalie Frank
  • Ioana Dumitriu
  • Sutanay Choudhury
  • Anand D. Sarwate
  • Tony Chiang

A recent trend in explainable AI research has focused on surrogate modeling, where neural networks are approximated as simpler ML algorithms such as kernel machines. A second trend has been to utilize kernel functions in various explain-by-example or data attribution tasks. In this work, we combine these two trends to analyze approximate empirical neural tangent kernels (eNTK) for data attribution. Approximation is critical for eNTK analysis due to the high computational cost to compute the eNTK. We define new approximate eNTK and perform novel analysis on how well the resulting kernel machine surrogate models correlate with the underlying neural network. We introduce two new random projection variants of approximate eNTK which allow users to tune the time and memory complexity of their calculation. We conclude that kernel machines using approximate neural tangent kernel as the kernel function are effective surrogate models, with the introduced trace NTK the most consistent performer.

ICML Conference 2024 Conference Paper

Optimal Exact Recovery in Semi-Supervised Learning: A Study of Spectral Methods and Graph Convolutional Networks

  • Haixiao Wang
  • Zhichao Wang

We delve into the challenge of semi-supervised node classification on the Contextual Stochastic Block Model (CSBM) dataset. Here, nodes from the two-cluster Stochastic Block Model (SBM) are coupled with feature vectors, which are derived from a Gaussian Mixture Model (GMM) that corresponds to their respective node labels. With only a subset of the CSBM node labels accessible for training, our primary objective becomes the accurate classification of the remaining nodes. Venturing into the transductive learning landscape, we, for the first time, pinpoint the information-theoretical threshold for the exact recovery of all test nodes in CSBM. Concurrently, we design an optimal spectral estimator inspired by Principal Component Analysis (PCA) with the training labels and essential data from both the adjacency matrix and feature vectors. We also evaluate the efficacy of graph ridge regression and Graph Convolutional Networks (GCN) on this synthetic dataset. Our findings underscore that graph ridge regression and GCN possess the ability to achieve the information threshold of exact recovery in a manner akin to the optimal estimator when using the optimal weighted self-loops. This highlights the potential role of feature learning in augmenting the proficiency of GCN, especially in the realm of semi-supervised learning.

NeurIPS Conference 2023 Conference Paper

Learning in the Presence of Low-dimensional Structure: A Spiked Random Matrix Perspective

  • Jimmy Ba
  • Murat A. Erdogdu
  • Taiji Suzuki
  • Zhichao Wang
  • Denny Wu

We consider the learning of a single-index target function $f_*: \mathbb{R}^d\to\mathbb{R}$ under spiked covariance data: $$f_*(\boldsymbol{x}) = \textstyle\sigma_*(\frac{1}{\sqrt{1+\theta}}\langle\boldsymbol{x}, \boldsymbol{\mu}\rangle), ~~ \boldsymbol{x}\overset{\small\mathrm{i. i. d. }}{\sim}\mathcal{N}(0, \boldsymbol{I_d} + \theta\boldsymbol{\mu}\boldsymbol{\mu}^\top), ~~ \theta\asymp d^{\beta} \text{ for } \beta\in[0, 1), $$ where the link function $\sigma_*: \mathbb{R}\to\mathbb{R}$ is a degree-$p$ polynomial with information exponent $k$ (defined as the lowest degree in the Hermite expansion of $\sigma_*$), and it depends on the projection of input $\boldsymbol{x}$ onto the spike (signal) direction $\boldsymbol{\mu}\in\mathbb{R}^d$. In the proportional asymptotic limit where the number of training examples $n$ and the dimensionality $d$ jointly diverge: $n, d\to\infty, n/d\to\psi\in(0, \infty)$, we ask the following question: how large should the spike magnitude $\theta$ (i. e. , the strength of the low-dimensional component) be, in order for $(i)$ kernel methods, $(ii)$ neural networks optimized by gradient descent, to learn $f_*$? We show that for kernel ridge regression, $\beta\ge 1-\frac{1}{p}$ is both sufficient and necessary. Whereas for two-layer neural networks trained with gradient descent, $\beta>1-\frac{1}{k}$ suffices. Our results demonstrate that both kernel methods and neural networks benefit from low-dimensional structures in the data. Further, since $k\le p$ by definition, neural networks can adapt to such structures more effectively.

NeurIPS Conference 2023 Conference Paper

Spectral Evolution and Invariance in Linear-width Neural Networks

  • Zhichao Wang
  • Andrew Engel
  • Anand D Sarwate
  • Ioana Dumitriu
  • Tony Chiang

We investigate the spectral properties of linear-width feed-forward neural networks, where the sample size is asymptotically proportional to network width. Empirically, we show that the spectra of weight in this high dimensional regime are invariant when trained by gradient descent for small constant learning rates; we provide a theoretical justification for this observation and prove the invariance of the bulk spectra for both conjugate and neural tangent kernels. We demonstrate similar characteristics when training with stochastic gradient descent with small learning rates. When the learning rate is large, we exhibit the emergence of an outlier whose corresponding eigenvector is aligned with the training data structure. We also show that after adaptive gradient training, where a lower test error and feature learning emerge, both weight and kernel matrices exhibit heavy tail behavior. Simple examples are provided to explain when heavy tails can have better generalizations. We exhibit different spectral properties such as invariant bulk, spike, and heavy-tailed distribution from a two-layer neural network using different training strategies, and then correlate them to the feature learning. Analogous phenomena also appear when we train conventional neural networks with real-world data. We conclude that monitoring the evolution of the spectra during training is an essential step toward understanding the training dynamics and feature learning.

NeurIPS Conference 2022 Conference Paper

High-dimensional Asymptotics of Feature Learning: How One Gradient Step Improves the Representation

  • Jimmy Ba
  • Murat A. Erdogdu
  • Taiji Suzuki
  • Zhichao Wang
  • Denny Wu
  • Greg Yang

We study the first gradient descent step on the first-layer parameters $\boldsymbol{W}$ in a two-layer neural network: $f(\boldsymbol{x}) = \frac{1}{\sqrt{N}}\boldsymbol{a}^\top\sigma(\boldsymbol{W}^\top\boldsymbol{x})$, where $\boldsymbol{W}\in\mathbb{R}^{d\times N}, \boldsymbol{a}\in\mathbb{R}^{N}$ are randomly initialized, and the training objective is the empirical MSE loss: $\frac{1}{n}\sum_{i=1}^n (f(\boldsymbol{x}_i)-y_i)^2$. In the proportional asymptotic limit where $n, d, N\to\infty$ at the same rate, and an idealized student-teacher setting where the teacher $f^*$ is a single-index model, we compute the prediction risk of ridge regression on the conjugate kernel after one gradient step on $\boldsymbol{W}$ with learning rate $\eta$. We consider two scalings of the first step learning rate $\eta$. For small $\eta$, we establish a Gaussian equivalence property for the trained feature map, and prove that the learned kernel improves upon the initial random features model, but cannot defeat the best linear model on the input. Whereas for sufficiently large $\eta$, we prove that for certain $f^*$, the same ridge estimator on trained features can go beyond this ``linear regime'' and outperform a wide range of (fixed) kernels. Our results demonstrate that even one gradient step can lead to a considerable advantage over random features, and highlight the role of learning rate scaling in the initial phase of training.

NeurIPS Conference 2020 Conference Paper

Spectra of the Conjugate Kernel and Neural Tangent Kernel for linear-width neural networks

  • Zhou Fan
  • Zhichao Wang

We study the eigenvalue distributions of the Conjugate Kernel and Neural Tangent Kernel associated to multi-layer feedforward neural networks. In an asymptotic regime where network width is increasing linearly in sample size, under random initialization of the weights, and for input samples satisfying a notion of approximate pairwise orthogonality, we show that the eigenvalue distributions of the CK and NTK converge to deterministic limits. The limit for the CK is described by iterating the Marcenko-Pastur map across the hidden layers. The limit for the NTK is equivalent to that of a linear combination of the CK matrices across layers, and may be described by recursive fixed-point equations that extend this Marcenko-Pastur map. We demonstrate the agreement of these asymptotic predictions with the observed spectra for both synthetic and CIFAR-10 training data, and we perform a small simulation to investigate the evolutions of these spectra over training.

AAAI Conference 2017 Conference Paper

Riemannian Submanifold Tracking on Low-Rank Algebraic Variety

  • Qian Li
  • Zhichao Wang

Matrix recovery aims to learn a low-rank structure from high dimensional data, which arises in numerous learning applications. As a popular heuristic to matrix recovery, convex relaxation involves iterative calling of singular value decomposition (SVD). Riemannian optimization based method can alleviate such expensive cost in SVD for improved scalability, which however is usually degraded by the unknown rank. This paper proposes a novel algorithm RIST that exploits the algebraic variety of low-rank manifold for matrix recovery. Particularly, RIST utilizes an efficient scheme that automatically estimate the potential rank on the real algebraic variety and tracks the favorable Riemannian submanifold. Moreover, RIST utilizes the second-order geometric characterization and achieves provable superlinear convergence, which is superior to the linear convergence of most existing methods. Extensive comparison experiments demonstrate the accuracy and efficiency of RIST algorithm.