Author name cluster

Shujian Yu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

21 papers

2 author rows

TMLR Journal 2026 Journal Article

Continuous Treatment Effect Estimation with Cauchy-Schwarz Divergence Information Bottleneck

Louk van Remmerden
Shiqin Tang
Shujian Yu

Estimating conditional average treatment effects (CATE) for continuous and multivariate treatments remains a fundamental yet underexplored problem in causal inference, as most existing methods are confined to binary treatment settings. In this paper, we make two key theoretical contributions. First, we derive a novel counterfactual error bound based on the Cauchy–Schwarz (CS) divergence, which is provably tighter than prior bounds derived from the Kullback–Leibler (KL) divergence. Second, we strengthen this bound by integrating the Information Bottleneck principle, introducing a compression regularization on latent representations to enhance generalization. Building on these insights, we propose a new neural framework that operationalizes our theory. Extensive experiments on three benchmarks show that our method consistently outperforms state-of-the-art baselines and remains robust under biased treatment assignments.

PDF Details

ICML Conference 2025 Conference Paper

Aggregation of Dependent Expert Distributions in Multimodal Variational Autoencoders

Rogelio Andrade Mancisidor
Robert Jenssen
Shujian Yu
Michael Kampffmeyer

Multimodal learning with variational autoencoders (VAEs) requires estimating joint distributions to evaluate the evidence lower bound (ELBO). Current methods, the product and mixture of experts, aggregate single-modality distributions assuming independence for simplicity, which is an overoptimistic assumption. This research introduces a novel methodology for aggregating single-modality distributions by exploiting the principle of consensus of dependent experts (CoDE), which circumvents the aforementioned assumption. Utilizing the CoDE method, we propose a novel ELBO that approximates the joint likelihood of the multimodal data by learning the contribution of each subset of modalities. The resulting CoDE-VAE model demonstrates better performance in terms of balancing the trade-off between generative coherence and generative quality, as well as generating more precise log-likelihood estimations. CoDE-VAE further minimizes the generative quality gap as the number of modalities increases. In certain cases, it reaches a generative quality similar to that of unimodal VAEs, which is a desirable property that is lacking in most current methods. Finally, the classification accuracy achieved by CoDE-VAE is comparable to that of state-of-the-art multimodal VAE models.

Details

UAI Conference 2025 Conference Paper

InfoDPCCA: Information-Theoretic Dynamic Probabilistic Canonical Correlation Analysis

Shiqin Tang
Shujian Yu

Extracting meaningful latent representations from high-dimensional sequential data is a crucial challenge in machine learning, with applications spanning natural science and engineering. We introduce InfoDPCCA, a dynamic probabilistic Canonical Correlation Analysis (CCA) framework designed to model two interdependent sequences of observations. InfoDPCCA leverages a novel information-theoretic objective to extract a shared latent representation that captures the mutual structure between the data streams and balances representation compression and predictive sufficiency while also learning separate latent components that encode information specific to each sequence. Unlike prior dynamic CCA models, such as DPCCA, our approach explicitly enforces the shared latent space to encode only the mutual information between the sequences, improving interpretability and robustness. We further introduce a two-step training scheme to bridge the gap between information-theoretic representation learning and generative modeling, along with a residual connection mechanism to enhance training stability. Through experiments on synthetic and medical fMRI data, we demonstrate that InfoDPCCA excels as a tool for representation learning. Code of InfoDPCCA is available at https: //github. com/marcusstang/InfoDPCCA.

Details

TMLR Journal 2025 Journal Article

Learning Task-Aware Abstract Representations for Meta-Reinforcement Learning

Louk van Remmerden
Zhao Yang
Shujian Yu
Mark Hoogendoorn
Vincent Francois-Lavet

A central challenge in meta-reinforcement learning (meta-RL) is enabling agents trained on a set of environments to generalize to new, related tasks without requiring full policy retraining. Existing model-free approaches often rely on context-conditioned policies learned via encoder networks. However, these context encoders are prone to overfitting to the training environments, resulting in poor out-of-sample performance on unseen tasks. To address this issue, we adopt an alternative approach that uses an abstract representation model to learn augmented, task-aware abstract states. We achieve this by introducing a novel architecture that offers greater flexibility than existing recurrent network-based approaches. In addition, we optimize our model with multiple loss terms that encourage predictive, task-aware representations in the abstract state space. Our method simplifies the learning problem and provides a flexible framework that can be readily combined with any off-the-shelf reinforcement learning algorithm. We provide theoretical guarantees alongside empirical results, showing strong generalization performance across classical control and robotic meta-RL benchmarks, on par with state-of-the-art meta-RL methods and significantly better than non-meta RL approaches.

PDF Details

ICLR Conference 2025 Conference Paper

Start Smart: Leveraging Gradients For Enhancing Mask-based XAI Methods

Buelent Uendes
Shujian Yu
Mark Hoogendoorn

Mask-based explanation methods offer a powerful framework for interpreting deep learning model predictions across diverse data modalities, such as images and time series, in which the central idea is to identify an instance-dependent mask that minimizes the performance drop from the resulting masked input. Different objectives for learning such masks have been proposed, all of which, in our view, can be unified under an information-theoretic framework that balances performance degradation of the masked input with the complexity of the resulting masked representation. Typically, these methods initialize the masks either uniformly or as all-ones. In this paper, we argue that an effective mask initialization strategy is as important as the development of novel learning objectives, particularly in light of the significant computational costs associated with existing mask-based explanation methods. To this end, we introduce a new gradient-based initialization technique called StartGrad, which is the first initialization method specifically designed for mask-based post-hoc explainability methods. Compared to commonly used strategies, StartGrad is provably superior at initialization in striking the aforementioned trade-off. Despite its simplicity, our experiments demonstrate that StartGrad enhances the optimization process of various state-of-the-art mask-explanation methods by reaching target metrics faster and, in some cases, boosting their overall performance.

Details

NeurIPS Conference 2024 Conference Paper

BAN: Detecting Backdoors Activated by Adversarial Neuron Noise

Xiaoyun Xu
Zhuoran Liu
Stefanos Koffas
Shujian Yu
Stjepan Picek

Backdoor attacks on deep learning represent a recent threat that has gained significant attention in the research community. Backdoor defenses are mainly based on backdoor inversion, which has been shown to be generic, model-agnostic, and applicable to practical threat scenarios. State-of-the-art backdoor inversion recovers a mask in the feature space to locate prominent backdoor features, where benign and backdoor features can be disentangled. However, it suffers from high computational overhead, and we also find that it overly relies on prominent backdoor features that are highly distinguishable from benign features. To tackle these shortcomings, this paper improves backdoor feature inversion for backdoor detection by incorporating extra neuron activation information. In particular, we adversarially increase the loss of backdoored models with respect to weights to activate the backdoor effect, based on which we can easily differentiate backdoored and clean models. Experimental results demonstrate our defense, BAN, is 1. 37$\times$ (on CIFAR-10) and 5. 11$\times$ (on ImageNet200) more efficient with an average 9. 99\% higher detect success rate than the state-of-the-art defense BTI DBF. Our code and trained models are publicly available at https: //github. com/xiaoyunxxy/ban.

PDF Details DOI

YNIMG Journal 2024 Journal Article

BPI-GNN: Interpretable brain network-based psychiatric diagnosis and subtyping

Kaizhong Zheng
Shujian Yu
Liangjun Chen
Lujuan Dang
Badong Chen

Converging evidence increasingly suggests that psychiatric disorders, such as major depressive disorder (MDD) and autism spectrum disorder (ASD), are not unitary diseases, but rather heterogeneous syndromes that involve diverse, co-occurring symptoms and divergent responses to treatment. This clinical heterogeneity has hindered the progress of precision diagnosis and treatment effectiveness in psychiatric disorders. In this study, we propose BPI-GNN, a new interpretable graph neural network (GNN) framework for analyzing functional magnetic resonance images (fMRI), by leveraging the famed prototype learning. In addition, we introduce a novel generation process of prototype subgraph to discover essential edges of distinct prototypes and employ total correlation (TC) to ensure the independence of distinct prototype subgraph patterns. BPI-GNN can effectively discriminate psychiatric patients and healthy controls (HC), and identify biological meaningful subtypes of psychiatric disorders. We evaluate the performance of BPI-GNN against 11 popular brain network classification methods on three psychiatric datasets and observe that our BPI-GNN always achieves the highest diagnosis accuracy. More importantly, we examine differences in clinical symptom profiles and gene expression profiles among identified subtypes and observe that our identified brain-based subtypes have the clinical relevance. It also discovers the subtype biomarkers that align with current neuro-scientific knowledge.

Details DOI

ICLR Conference 2024 Conference Paper

Cauchy-Schwarz Divergence Information Bottleneck for Regression

Shujian Yu
Xi Yu
Sigurd Løkse
Robert Jenssen
José C. Príncipe

The information bottleneck (IB) approach is popular to improve the generalization, robustness and explainability of deep neural networks. Essentially, it aims to find a minimum sufficient representation $\mathbf{t}$ by striking a trade-off between a compression term $I(\mathbf{x};\mathbf{t})$ and a prediction term $I(y;\mathbf{t})$, where $I(\cdot;\cdot)$ refers to the mutual information (MI). MI is for the IB for the most part expressed in terms of the Kullback-Leibler (KL) divergence, which in the regression case corresponds to prediction based on mean squared error (MSE) loss with Gaussian assumption and compression approximated by variational inference. In this paper, we study the IB principle for the regression problem and develop a new way to parameterize the IB with deep neural networks by exploiting favorable properties of the Cauchy-Schwarz (CS) divergence. By doing so, we move away from MSE-based regression and ease estimation by avoiding variational approximations or distributional assumptions. We investigate the improved generalization ability of our proposed CS-IB and demonstrate strong adversarial robustness guarantees. We demonstrate its superior performance on six real-world regression tasks over other popular deep IB approaches. We additionally observe that the solutions discovered by CS-IB always achieve the best trade-off between prediction accuracy and compression ratio in the information plane. The code is available at \url{https://github.com/SJYuCNEL/Cauchy-Schwarz-Information-Bottleneck}.

Details

UAI Conference 2024 Conference Paper

Domain Adaptation with Cauchy-Schwarz Divergence

Wenzhe Yin
Shujian Yu
Yicong Lin
Jie Liu 0043
Jan-Jakob Sonke
Efstratios Gavves

Domain adaptation aims to use training data from one or multiple source domains to learn a hypothesis that can be generalized to a different, but related, target domain. As such, having a reliable measure for evaluating the discrepancy of both marginal and conditional distributions is crucial. We introduce Cauchy-Schwarz (CS) divergence to the problem of unsupervised domain adaptation (UDA). The CS divergence offers a theoretically tighter generalization error bound than the popular Kullback-Leibler divergence. This holds for the general case of supervised learning, including multi-class classification and regression. Furthermore, we illustrate that the CS divergence enables a simple estimator on the discrepancy of both marginal and conditional distributions between source and target domains in the representation space, without requiring any distributional assumptions. We provide multiple examples to illustrate how the CS divergence can be conveniently used in both distance metric- or adversarial training-based UDA frameworks, resulting in compelling performance. The code of our paper is available at \url{https: //github. com/ywzcode/CS-adv}.

Details

ICML Conference 2024 Conference Paper

Jacobian Regularizer-based Neural Granger Causality

Wanqi Zhou
Shuanghao Bai
Shujian Yu
Qibin Zhao
Badong Chen

With the advancement of neural networks, diverse methods for neural Granger causality have emerged, which demonstrate proficiency in handling complex data, and nonlinear relationships. However, the existing framework of neural Granger causality has several limitations. It requires the construction of separate predictive models for each target variable, and the relationship depends on the sparsity on the weights of the first layer, resulting in challenges in effectively modeling complex relationships between variables as well as unsatisfied estimation accuracy of Granger causality. Moreover, most of them cannot grasp full-time Granger causality. To address these drawbacks, we propose a J acobian R egularizer-based N eural G ranger C ausality ( JRNGC ) approach, a straightforward yet highly effective method for learning multivariate summary Granger causality and full-time Granger causality by constructing a single model for all target variables. Specifically, our method eliminates the sparsity constraints of weights by leveraging an input-output Jacobian matrix regularizer, which can be subsequently represented as the weighted causal matrix in the post-hoc analysis. Extensive experiments show that our proposed approach achieves competitive performance with the state-of-the-art methods for learning summary Granger causality and full-time Granger causality while maintaining lower model complexity and high scalability.

Details

ICLR Conference 2024 Conference Paper

Rethinking Information-theoretic Generalization: Loss Entropy Induced PAC Bounds

Yuxin Dong 0003
Tieliang Gong
Hong Chen 0004
Shujian Yu
Chen Li 0011

Information-theoretic generalization analysis has achieved astonishing success in characterizing the generalization capabilities of noisy and iterative learning algorithms. However, current advancements are mostly restricted to average-case scenarios and necessitate the stringent bounded loss assumption, leaving a gap with regard to computationally tractable PAC generalization analysis, especially for long-tailed loss distributions. In this paper, we bridge this gap by introducing a novel class of PAC bounds through leveraging loss entropies. These bounds simplify the computation of key information metrics in previous PAC information-theoretic bounds to one-dimensional variables, thereby enhancing computational tractability. Moreover, our data-independent bounds provide novel insights into the generalization behavior of the minimum error entropy criterion, while our data-dependent bounds improve over previous results by alleviating the bounded loss assumption under both leave-one-out and supersample settings. Extensive numerical studies indicate strong correlations between the generalization error and the induced loss entropy, showing that the presented bounds adeptly capture the patterns of the true generalization gap under various learning scenarios.

Details

AAAI Conference 2023 Conference Paper

Causal Recurrent Variational Autoencoder for Medical Time Series Generation

Hongming Li
Shujian Yu
Jose Principe

We propose causal recurrent variational autoencoder (CR-VAE), a novel generative model that is able to learn a Granger causal graph from a multivariate time series x and incorporates the underlying causal mechanism into its data generation process. Distinct to the classical recurrent VAEs, our CR-VAE uses a multi-head decoder, in which the p-th head is responsible for generating the p-th dimension of x (i.e., x^p). By imposing a sparsity-inducing penalty on the weights (of the decoder) and encouraging specific sets of weights to be zero, our CR-VAE learns a sparse adjacency matrix that encodes causal relations between all pairs of variables. Thanks to this causal matrix, our decoder strictly obeys the underlying principles of Granger causality, thereby making the data generating process transparent. We develop a two-stage approach to train the overall objective. Empirically, we evaluate the behavior of our model in synthetic data and two real-world human brain datasets involving, respectively, the electroencephalography (EEG) signals and the functional magnetic resonance imaging (fMRI) data. Our model consistently outperforms state-of-the-art time series generative models both qualitatively and quantitatively. Moreover, it also discovers a faithful causal graph with similar or improved accuracy over existing Granger causality-based causal inference methods. Code of CR-VAE is publicly available at https://github.com/hongmingli1995/CR-VAE.

PDF Details DOI

ECAI Conference 2023 Conference Paper

Revisiting the Robustness of the Minimum Error Entropy Criterion: A Transfer Learning Case Study

Luis Pedro Silvestrin
Shujian Yu
Mark Hoogendoorn

Coping with distributional shifts is an important part of transfer learning methods in order to perform well in real-life tasks. However, most of the existing approaches in this area either focus on an ideal scenario in which the data does not contain noises or employ a complicated training paradigm or model design to deal with distributional shifts. In this paper, we revisit the robustness of the minimum error entropy (MEE) criterion, a widely used objective in statistical signal processing to deal with non-Gaussian noises, and investigate its feasibility and usefulness in real-life transfer learning regression tasks, where distributional shifts are common. Specifically, we put forward a new theoretical result showing the robustness of MEE against covariate shift. We also show that by simply replacing the mean squared error (MSE) loss with the MEE on basic transfer learning algorithms such as fine-tuning and linear probing, we can achieve competitive performance with respect to state-of-the-art transfer learning algorithms. We justify our arguments on both synthetic data and 5 real-world time-series data.

Details

AAAI Conference 2023 Conference Paper

Robust and Fast Measure of Information via Low-Rank Representation

Yuxin Dong
Tieliang Gong
Shujian Yu
Hong Chen
Chen Li

The matrix-based Rényi's entropy allows us to directly quantify information measures from given data, without explicit estimation of the underlying probability distribution. This intriguing property makes it widely applied in statistical inference and machine learning tasks. However, this information theoretical quantity is not robust against noise in the data, and is computationally prohibitive in large-scale applications. To address these issues, we propose a novel measure of information, termed low-rank matrix-based Rényi's entropy, based on low-rank representations of infinitely divisible kernel matrices. The proposed entropy functional inherits the specialty of of the original definition to directly quantify information from data, but enjoys additional advantages including robustness and effective calculation. Specifically, our low-rank variant is more sensitive to informative perturbations induced by changes in underlying distributions, while being insensitive to uninformative ones caused by noises. Moreover, low-rank Rényi's entropy can be efficiently approximated by random projection and Lanczos iteration techniques, reducing the overall complexity from O(n³) to O(n²s) or even O(ns²), where n is the number of data samples and s ≪ n. We conduct large-scale experiments to evaluate the effectiveness of this new information measure, demonstrating superior results compared to matrix-based Rényi's entropy in terms of both performance and computational efficiency.

PDF Details DOI

AAAI Conference 2023 Conference Paper

The Analysis of Deep Neural Networks by Information Theory: From Explainability to Generalization

Shujian Yu

Despite their great success in many artificial intelligence tasks, deep neural networks (DNNs) still suffer from a few limitations, such as poor generalization behavior for out-of-distribution (OOD) data and the "black-box" nature. Information theory offers fresh insights to solve these challenges. In this short paper, we briefly review the recent developments in this area, and highlight our contributions.

PDF Details DOI

AAAI Conference 2022 Conference Paper

Learning to Transfer with von Neumann Conditional Divergence

Ammar Shaker
Shujian Yu
Daniel Oñoro-Rubio

The similarity of feature representations plays a pivotal role in the success of problems related to domain adaptation. Feature similarity includes both the invariance of marginal distributions and the closeness of conditional distributions given the desired response y (e. g. , class labels). Unfortunately, traditional methods always learn such features without fully taking into consideration the information in y, which in turn may lead to a mismatch of the conditional distributions or the mixup of discriminative structures underlying data distributions. In this work, we introduce the recently proposed von Neumann conditional divergence to improve the transferability across multiple domains. We show that this new divergence is differentiable and eligible to easily quantify the functional dependence between features and y. Given multiple source tasks, we integrate this divergence to capture discriminative information in y and design novel learning objectives assuming those source tasks are observed either simultaneously or sequentially. In both scenarios, we obtain favorable performance against state-of-the-art methods in terms of smaller generalization error on new tasks and less catastrophic forgetting on source tasks (in the sequential setup).

PDF Details

UAI Conference 2022 Conference Paper

Principle of relevant information for graph sparsification

Shujian Yu
Francesco Alesiani
Wenzhe Yin
Robert Jenssen
José C. Príncipe

Graph sparsification aims to reduce the number of edges of a graph while maintaining its structural properties. In this paper, we propose the first general and effective information-theoretic formulation of graph sparsification, by taking inspiration from the Principle of Relevant Information (PRI). To this end, we extend the PRI from a standard scalar random variable setting to structured data (i. e. , graphs). Our Graph-PRI objective is achieved by operating on the graph Laplacian, made possible by expressing the graph Laplacian of a subgraph in terms of a sparse edge selection vector w. We provide both theoretical and empirical justifications on the validity of our Graph-PRI approach. We also analyze its analytical solutions in a few special cases. We finally present three representative real-world applications, namely graph sparsification, graph regularized multi-task learning, and medical imaging-derived brain network classification, to demonstrate the effectiveness, the versatility and the enhanced interpretability of our approach over prevalent sparsification techniques. Code of Graph-PRI is available at https: //github. com/SJYuCNEL/PRI-Graphs.

Details

IJCAI Conference 2021 Conference Paper

Information-Theoretic Methods in Deep Neural Networks: Recent Advances and Emerging Opportunities

Shujian Yu
Luis Sanchez Giraldo
Jose Principe

We present a review on the recent advances and emerging opportunities around the theme of analyzing deep neural networks (DNNs) with information-theoretic methods. We first discuss popular information-theoretic quantities and their estimators. We then introduce recent developments on information-theoretic learning principles (e. g. , loss functions, regularizers and objectives) and their parameterization with DNNs. We finally briefly review current usages of information-theoretic concepts in a few modern machine learning problems and list a few emerging opportunities.

PDF Details DOI

AAAI Conference 2021 Conference Paper

Measuring Dependence with Matrix-based Entropy Functional

Shujian Yu
Francesco Alesiani
Xi Yu
Robert Jenssen
Jose Principe

Measuring the dependence of data plays a central role in statistics and machine learning. In this work, we summarize and generalize the main idea of existing information-theoretic dependence measures into a higher-level perspective by the Shearer’s inequality. Based on our generalization, we then propose two measures, namely the matrix-based normalized total correlation and the matrix-based normalized dual total correlation, to quantify the dependence of multiple variables in arbitrary dimensional space, without explicit estimation of the underlying data distributions. We show that our measures are differentiable and statistically more powerful than prevalent ones. We also show the impact of our measures in four different machine learning problems, namely the gene regulatory network inference, the robust machine learning under covariate shift and non-Gaussian noises, the subspace outlier detection, and the understanding of the learning dynamics of convolutional neural networks, to demonstrate their utilities, advantages, as well as implications to those problems.

PDF Details

IJCAI Conference 2020 Conference Paper

Measuring the Discrepancy between Conditional Distributions: Methods, Properties and Applications

Shujian Yu
Ammar Shaker
Francesco Alesiani
Jose Principe

We propose a simple yet powerful test statistic to quantify the discrepancy between two conditional distributions. The new statistic avoids the explicit estimation of the underlying distributions in high-dimensional space and it operates on the cone of symmetric positive semideﬁnite (SPS) matrix using the Bregman matrix divergence. Moreover, it inherits the merits of the correntropy function to explicitly incorporate high-order statistics in the data. We present the properties of our new statistic and illustrate its connections to prior art. We ﬁnally show the applications of our new statistic on three different machine learning problems, namely the multi-task learning over graphs, the concept drift detection, and the information-theoretic feature selection, to demonstrate its utility and advantage. Code of our statistic is available at https: //bit. ly/BregmanCorrentropy.

PDF Details DOI

IJCAI Conference 2018 Conference Paper

Request-and-Reverify: Hierarchical Hypothesis Testing for Concept Drift Detection with Expensive Labels

Shujian Yu
Xiaoyang Wang
José C. Príncipe

One important assumption underlying common classification models is the stationarity of the data. However, in real-world streaming applications, the data concept indicated by the joint distribution of feature and label is not stationary but drifting over time. Concept drift detection aims to detect such drifts and adapt the model so as to mitigate any deterioration in the model's predictive performance. Unfortunately, most existing concept drift detection methods rely on a strong and over-optimistic condition that the true labels are available immediately for all already classified instances. In this paper, a novel Hierarchical Hypothesis Testing framework with Request-and-Reverify strategy is developed to detect concept drifts by requesting labels only when necessary. Two methods, namely Hierarchical Hypothesis Testing with Classification Uncertainty (HHT-CU) and Hierarchical Hypothesis Testing with Attribute-wise "Goodness-of-fit" (HHT-AG), are proposed respectively under the novel framework. In experiments with benchmark datasets, our methods demonstrate overwhelming advantages over state-of-the-art unsupervised drift detectors. More importantly, our methods even outperform DDM (the widely used supervised drift detector) when we use significantly fewer labels.

PDF Details