EAAI Journal 2026 Journal Article
A self-adaptive transformer-enhanced physics-informed neural network for railway dynamics system
- Chengjia Han
- Shuai Qu
- Yun Yang
- Maggie Y. Gao
- Liwei Dong
- Fan Yang
- Tao Ma
- Yaowen Yang
Author name cluster
Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.
EAAI Journal 2026 Journal Article
EAAI Journal 2026 Journal Article
AAAI Conference 2026 Conference Paper
Vision Transformers (ViTs) have gained significant attention and widespread adoption due to their impressive performance in various computer vision tasks. However, in practice, their substantial computational overhead often leads to high inference latency and increased overheads when deployed on resource-constrained edge devices like smartphones, autonomous vehicles, and robots. To address these challenges, Early Exit (EE) has emerged as a promising approach for lightweight inference on edge devices. It accelerates inference and reduces computational overhead by adaptively producing predictions through early exits based on sample complexity. Existing EE methods typically suffer from substantial accuracy decreases in late exits while providing only marginal accuracy improvements to early exits. This paper presents EnViT, an exit-aware structured dropout-enabled self-distillation approach that enhances the performance of early exits without compromising late exits. EnViT leverages structured dropout to enable self-distillation, where the full model serves as the teacher and its own virtual sub-models generated by structured dropout as students. This mechanism effectively distills knowledge from the full model to early exits and avoids performance degradation in late exits by mitigating parameter conflicts across exits during training. Evaluation on five datasets shows that our EnViT achieves accuracy improvements ranging from 0.36% to 7.92% while maintaining competitive speed-up ratios of 1.72x to 2.23x.
ICML Conference 2025 Conference Paper
In this work, we explore the theoretical properties of conditional deep generative models under the statistical framework of distribution regression where the response variable lies in a high-dimensional ambient space but concentrates around a potentially lower-dimensional manifold. More specifically, we study the large-sample properties of a likelihood-based approach for estimating these models. Our results lead to the convergence rate of a sieve maximum likelihood estimator (MLE) for estimating the conditional distribution (and its devolved counterpart) of the response given predictors in the Hellinger (Wasserstein) metric. Our rates depend solely on the intrinsic dimension and smoothness of the true conditional distribution. These findings provide an explanation of why conditional deep generative models can circumvent the curse of dimensionality from the perspective of statistical foundations and demonstrate that they can learn a broader class of nearly singular conditional distributions. Our analysis also emphasizes the importance of introducing a small noise perturbation to the data when they are supported sufficiently close to a manifold. Finally, in our numerical studies, we demonstrate the effective implementation of the proposed approach using both synthetic and real-world datasets, which also provide complementary validation to our theoretical findings.
ICLR Conference 2025 Conference Paper
We consider a class of conditional forward-backward diffusion models for conditional generative modeling, that is, generating new data given a covariate (or control variable). To formally study the theoretical properties of these conditional generative models, we adopt a statistical framework of distribution regression to characterize the large sample properties of the conditional distribution estimators induced by these conditional forward-backward diffusion models. Here, the conditional distribution of data is assumed to smoothly change over the covariate. In particular, our derived convergence rate is minimax-optimal under the total variation metric within the regimes covered by the existing literature. Additionally, we extend our theory by allowing both the data and the covariate variable to potentially admit a low-dimensional manifold structure. In this scenario, we demonstrate that the conditional forward-backward diffusion model can adapt to both manifold structures, meaning that the derived estimation error bound (under the Wasserstein metric) depends only on the intrinsic dimensionalities of the data and the covariate.
EAAI Journal 2025 Journal Article
EAAI Journal 2025 Journal Article
NeurIPS Conference 2025 Conference Paper
KV cache technology, by storing key-value pairs, helps reduce the computational overhead incurred by large language models (LLMs). It facilitates their deployment on resource-constrained edge computing nodes like edge servers. However, as the complexity and size of tasks increase, KV cache usage leads to substantial GPU memory consumption. Existing research has focused on mitigating KV cache memory usage through sequence length reduction, task-specific compression, and dynamic eviction policies. However, these methods are computationally expensive for resource-constrained edge computing nodes. To tackle this challenge, this paper presents Sim-LLM, a novel inference optimization mechanism that leverages task similarity to reduce KV cache memory consumption for LLMs. By caching KVs from processed tasks and reusing them for subsequent similar tasks during inference, Sim-LLM significantly reduces memory consumption while boosting system throughput and increasing maximum batch size, all with minimal accuracy degradation. Evaluated on both A40 and A100 GPUs, Sim-LLM achieves a system throughput improvement of up to 39. 40\% and a memory reduction of up to 34. 65%, compared to state-of-the-art approaches. Our source code is available at https: //github. com/CGCL-codes/SimLLM.
ICML Conference 2025 Conference Paper
Conditional mean independence (CMI) testing is crucial for statistical tasks including model determination and variable importance evaluation. In this work, we introduce a novel population CMI measure and a bootstrap-based testing procedure that utilizes deep generative neural networks to estimate the conditional mean functions involved in the population measure. The test statistic is thoughtfully constructed to ensure that even slowly decaying nonparametric estimation errors do not affect the asymptotic accuracy of the test. Our approach demonstrates strong empirical performance in scenarios with high-dimensional covariates and response variable, can handle multivariate responses, and maintains nontrivial power against local alternatives outside an $n^{-1/2}$ neighborhood of the null hypothesis. We also use numerical simulations and real-world imaging data applications to highlight the efficacy and versatility of our testing procedure.
JMLR Journal 2024 Journal Article
In this paper, we examine the computational complexity of sampling from a Bayesian posterior (or pseudo-posterior) using the Metropolis-adjusted Langevin algorithm (MALA). MALA first employs a discrete-time Langevin SDE to propose a new state, and then adjusts the proposed state using Metropolis-Hastings rejection. Most existing theoretical analyses of MALA rely on the smoothness and strong log-concavity properties of the target distribution, which are often lacking in practical Bayesian problems. Our analysis hinges on statistical large sample theory, which constrains the deviation of the Bayesian posterior from being smooth and log-concave in a very specific way. In particular, we introduce a new technique for bounding the mixing time of a Markov chain with a continuous state space via the $s$-conductance profile, offering improvements over existing techniques in several aspects. By employing this new technique, we establish the optimal parameter dimension dependence of $d^{1/3}$ and condition number dependence of $\kappa$ in the non-asymptotic mixing time upper bound for MALA after the burn-in period, under a standard Bayesian setting where the target posterior distribution is close to a $d$-dimensional Gaussian distribution with a covariance matrix having a condition number $\kappa$. We also prove a matching mixing time lower bound for sampling from a multivariate Gaussian via MALA to complement the upper bound. [abs] [ pdf ][ bib ] © JMLR 2024. ( edit, beta )
ICLR Conference 2024 Conference Paper
$K$-means clustering is a widely used machine learning method for identifying patterns in large datasets. Recently, semidefinite programming (SDP) relaxations have been proposed for solving the $K$-means optimization problem, which enjoy strong statistical optimality guarantees. However, the prohibitive cost of implementing an SDP solver renders these guarantees inaccessible to practical datasets. In contrast, nonnegative matrix factorization (NMF) is a simple clustering algorithm widely used by machine learning practitioners, but it lacks a solid statistical underpinning and theoretical guarantees. In this paper, we consider an NMF-like algorithm that solves a nonnegative low-rank restriction of the SDP-relaxed $K$-means formulation using a nonconvex Burer--Monteiro factorization approach. The resulting algorithm is as simple and scalable as state-of-the-art NMF algorithms while also enjoying the same strong statistical optimality guarantees as the SDP. In our experiments, we observe that our algorithm achieves significantly smaller mis-clustering errors compared to the existing state-of-the-art while maintaining scalability.
JMLR Journal 2024 Journal Article
Motivated by approximation Bayesian computation using mean-field variational approximation and the computation of equilibrium in multi-species systems with cross-interaction, this paper investigates the composite geodesically convex optimization problem over multiple distributions. The objective functional under consideration is composed of a convex potential energy on a product of Wasserstein spaces and a sum of convex self-interaction and internal energies associated with each distribution. To efficiently solve this problem, we introduce the Wasserstein Proximal Coordinate Gradient (WPCG) algorithms with parallel, sequential, and random update schemes. Under a quadratic growth (QG) condition that is weaker than the usual strong convexity requirement on the objective functional, we show that WPCG converges exponentially fast to the unique global optimum. In the absence of the QG condition, WPCG is still demonstrated to converge to the global optimal solution, albeit at a slower polynomial rate. Numerical results for both motivating examples are consistent with our theoretical findings. [abs] [ pdf ][ bib ] © JMLR 2024. ( edit, beta )
ICML Conference 2023 Conference Paper
Graph coarsening is a technique for solving large-scale graph problems by working on a smaller version of the original graph, and possibly interpolating the results back to the original graph. It has a long history in scientific computing and has recently gained popularity in machine learning, particularly in methods that preserve the graph spectrum. This work studies graph coarsening from a different perspective, developing a theory for preserving graph distances and proposing a method to achieve this. The geometric approach is useful when working with a collection of graphs, such as in graph classification and regression. In this study, we consider a graph as an element on a metric space equipped with the Gromov–Wasserstein (GW) distance, and bound the difference between the distance of two graphs and their coarsened versions. Minimizing this difference can be done using the popular weighted kernel $K$-means method, which improves existing spectrum-preserving methods with the proper choice of the kernel. The study includes a set of experiments to support the theory and method, including approximating the GW distance, preserving the graph spectrum, classifying graphs using spectral information, and performing regression using graph convolutional networks. Code is available at https: //github. com/ychen-stat-ml/GW-Graph-Coarsening.
TMLR Journal 2023 Journal Article
Multiple sampling-based methods have been developed for approximating and accelerating node embedding aggregation in graph convolutional networks (GCNs) training. Among them, a layer-wise approach recursively performs importance sampling to select neighbors jointly for existing nodes in each layer. This paper revisits the approach from a matrix approximation perspective, and identifies two issues in the existing layer-wise sampling methods: suboptimal sampling probabilities and estimation biases induced by sampling without replacement. To address these issues, we accordingly propose two remedies: a new principle for constructing sampling probabilities and an efficient debiasing algorithm. The improvements are demonstrated by extensive analyses of estimation variance and experiments on common benchmarks. Code and algorithm implementations are publicly available at \url{https://github.com/ychen-stat-ml/GCN-layer-wise-sampling}.
JMLR Journal 2023 Journal Article
Covariate measurement error in nonparametric regression is a common problem in nutritional epidemiology and geostatistics, and other fields. Over the last two decades, this problem has received substantial attention in the frequentist literature. Bayesian approaches for handling measurement error have only been explored recently and are surprisingly successful, although there still is a lack of a proper theoretical justification regarding the asymptotic performance of the estimators. By specifying a Gaussian process prior on the regression function and a Dirichlet process Gaussian mixture prior on the unknown distribution of the unobserved covariates, we show that the posterior distribution of the regression function and the unknown covariate density attain optimal rates of contraction adaptively over a range of Holder classes, up to logarithmic terms. We also develop a novel surrogate prior for approximating the Gaussian process prior that leads to efficient computation and preserves the covariance structure, thereby facilitating easy prior elicitation. We demonstrate the empirical performance of our approach and compare it with competitors in a wide range of simulation experiments and a real data example. [abs] [ pdf ][ bib ] © JMLR 2023. ( edit, beta )
ICML Conference 2023 Conference Paper
Clustering is a widely deployed unsupervised learning tool. Model-based clustering is a flexible framework to tackle data heterogeneity when the clusters have different shapes. Likelihood-based inference for mixture distributions often involves non-convex and high-dimensional objective functions, imposing difficult computational and statistical challenges. The classic expectation-maximization (EM) algorithm is a computationally thrifty iterative method that maximizes a surrogate function minorizing the log-likelihood of observed data in each iteration, which however suffers from bad local maxima even in the special case of the standard Gaussian mixture model with common isotropic covariance matrices. On the other hand, recent studies reveal that the unique global solution of a semidefinite programming (SDP) relaxed $K$-means achieves the information-theoretically sharp threshold for perfectly recovering the cluster labels under the standard Gaussian mixture model. In this paper, we extend the SDP approach to a general setting by integrating cluster labels as model parameters and propose an iterative likelihood adjusted SDP (iLA-SDP) method that directly maximizes the exact observed likelihood in the presence of data heterogeneity. By lifting the cluster assignment to group-specific membership matrices, iLA-SDP avoids centroids estimation – a key feature that allows exact recovery under well-separateness of centroids without being trapped by their adversarial configurations. Thus iLA-SDP is less sensitive than EM to initialization and more stable on high-dimensional data. Our numeric experiments demonstrate that iLA-SDP can achieve lower mis-clustering errors over several widely used clustering methods including $K$-means, SDP and EM algorithms.
AAAI Conference 2023 Conference Paper
Multi-task learning models based on temporal smoothness assumption, in which each time point of a sequence of time points concerns a task of prediction, assume the adjacent tasks are similar to each other. However, the effect of outliers is not taken into account. In this paper, we show that even only one outlier task will destroy the performance of the entire model. To solve this problem, we propose two Robust Temporal Smoothness (RoTS) frameworks. Compared with the existing models based on temporal relation, our methods not only chase the temporal smoothness information but identify outlier tasks, however, without increasing the computational complexity. Detailed theoretical analyses are presented to evaluate the performance of our methods. Experimental results on synthetic and real-life datasets demonstrate the effectiveness of our frameworks. We also discuss several potential specific applications and extensions of our RoTS frameworks.
NeurIPS Conference 2022 Conference Paper
Clustering is an important exploratory data analysis technique to group objects based on their similarity. The widely used $K$-means clustering method relies on some notion of distance to partition data into a fewer number of groups. In the Euclidean space, centroid-based and distance-based formulations of the $K$-means are equivalent. In modern machine learning applications, data often arise as probability distributions and a natural generalization to handle measure-valued data is to use the optimal transport metric. Due to non-negative Alexandrov curvature of the Wasserstein space, barycenters suffer from regularity and non-robustness issues. The peculiar behaviors of Wasserstein barycenters may make the centroid-based formulation fail to represent the within-cluster data points, while the more direct distance-based $K$-means approach and its semidefinite program (SDP) relaxation are capable of recovering the true cluster labels. In the special case of clustering Gaussian distributions, we show that the SDP relaxed Wasserstein $K$-means can achieve exact recovery given the clusters are well-separated under the $2$-Wasserstein metric. Our simulation and real data examples also demonstrate that distance-based $K$-means can achieve better classification performance over the standard centroid-based $K$-means for clustering probability distributions and images.
AIIM Journal 2021 Journal Article
NeurIPS Conference 2021 Conference Paper
Transformers are expensive to train due to the quadratic time and space complexity in the self-attention mechanism. On the other hand, although kernel machines suffer from the same computation bottleneck in pairwise dot products, several approximation schemes have been successfully incorporated to considerably reduce their computational cost without sacrificing too much accuracy. In this work, we leverage the computation methods for kernel machines to alleviate the high computational cost and introduce Skyformer, which replaces the softmax structure with a Gaussian kernel to stabilize the model training and adapts the Nyström method to a non-positive semidefinite matrix to accelerate the computation. We further conduct theoretical analysis by showing that the matrix approximation error of our proposed method is small in the spectral norm. Experiments on Long Range Arena benchmark show that the proposed method is sufficient in getting comparable or even better performance than the full self-attention while requiring fewer computation resources.
TIST Journal 2019 Journal Article
As the rapid growth of social media technologies continues, Cyber-Physical-Social System (CPSS) has been a hot topic in many industrial applications. The use of “microblogging” services, such as Twitter, has rapidly become an influential way to share information. While recent studies have revealed that understanding and modelling microblog user behaviour with massive users’ data in social media are keen to success of many practical applications in CPSS, a key challenge in literatures is that diversity of geography and cultures in social media technologies strongly affect user behaviour and activity. The motivation of this article is to understand differences and similarities between microblogging users from different countries using social media technologies, and to attempt to design a Country-Level Micro-Blog User (CLMB) behaviour and activity model for supporting CPSS applications. We proposed a CLMB model for analysing microblogging user behaviour and their activity across different countries in the CPSS applications. The model has considered three important characteristics of user behaviour in microblogging data, including content of microblogging messages, user emotion index, and user relationship network. We evaluated CLBM model under the collected microblog dataset from 16 countries with the largest number of representative and active users in the world. Experimental results show that (1) for some countries with small population and strong cohesiveness, users pay more attention to social functionalities of microblogging service; (2) for some countries containing mostly large loose social groups, users use microblogging services as a news dissemination platform; (3) users in countries whose social network structure exhibits reciprocity rather than hierarchy will use more linguistic elements to express happiness in microblogging services.
IJCAI Conference 2019 Conference Paper
Regularities analysis for prescriptions is a significant task for traditional Chinese medicine (TCM), both in inheritance of clinical experience and in improvement of clinical quality. Recently, many methods have been proposed for regularities discovery, but this task is challenging due to the quantity, sparsity and free-style of prescriptions. In this paper, we address the specific problem of regularities discovery and propose a graph embedding based framework for regularities discovery for massive prescriptions. We model this task as a relation prediction in which the correlation of two herbs or of herb and symptom are incorporated to characterize the different relationships. Specifically, we first establish a heterogeneous network with herbs and symptoms as its nodes. We develop a bipartite embedding model termed HS2Vec to detect regularities, which explores multiple relations of herbherb, and herb-symptom based on the heterogeneous network. Experiments on four real-world datasets demonstrate that the proposed framework is very effective for regularities discovery.
JBHI Journal 2018 Journal Article
Driven by the automation technologies and health informatics of Industry 4. 0, hospitals in China have deployed a complete automation system/platform for healthcare services accessing. Without much more Internet knowledge, elderlies usually seek the third-party to assist them to get healthcare services from Web or APPs, it consequently results in an unexpected situation that scalpers could grab all healthcare services booking by unrighteous means in order to resell to elderlies for a much higher price. Moreover, it is hard for physicians to identify the scalpers due to the complexity, ad-hoc, and multiscenario nature of healthcare processes. In this paper, a novel method is proposed for the identification and creation of user groups of scalpers in mobile healthcare services. The approach utilizes and extends state of the art data analysis approaches in the event-logs of the mobile system to identify user groups. Based on the user groups, user profiles are extracted by identifying representative eventcases from hierarchical user-event clusters. A comprehensive evaluation is conducted in a selected test-set from the event-logs of a mobile healthcare APP. The result shows its accuracy and effectiveness in scalper detection in mobile healthcare APP. Further, a complete case study is deployed in a real word hospital to ensure its utility, efficacy, and reliability.
TAAS Journal 2007 Journal Article
In grid workflow systems, a checkpoint selection strategy is responsible for selecting checkpoints for conducting temporal verification at the runtime execution stage. Existing representative checkpoint selection strategies often select some unnecessary checkpoints and omit some necessary ones because they cannot adapt to the dynamics and uncertainty of runtime activity completion duration. In this article, based on the dynamics and uncertainty of runtime activity completion duration, we develop a novel checkpoint selection strategy that can adaptively select not only necessary, but also sufficient checkpoints. Specifically, we introduce a new concept of minimum time redundancy as a key reference parameter for checkpoint selection. An important feature of minimum time redundancy is that it can adapt to the dynamics and uncertainty of runtime activity completion duration. We develop a method on how to achieve minimum time redundancy dynamically along grid workflow execution and investigate its relationships with temporal consistency. Based on the method and the relationships, we present our strategy and rigorously prove its necessity and sufficiency. The simulation evaluation further demonstrates experimentally such necessity and sufficiency and its significant improvement on checkpoint selection over other representative strategies.