Arrow Research search

Author name cluster

Nhat Ho

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

73 papers
2 author rows

Possible papers

73

AAAI Conference 2025 Conference Paper

Attack on Prompt: Backdoor Attack in Prompt-Based Continual Learning

  • Trang Nguyen
  • Anh Tran
  • Nhat Ho

Prompt-based approaches offer a cutting-edge solution to data privacy issues in continual learning, particularly in scenarios involving multiple data suppliers where long-term storage of private user data is prohibited. Despite delivering state-of-the-art performance, its impressive remembering capability can become a double-edged sword, raising security concerns as it might inadvertently retain poisoned knowledge injected during learning from private user data. Following this insight, in this paper, we expose continual learning to a potential threat: backdoor attack, which drives the model to follow a desired adversarial target whenever a specific trigger is present while still performing normally on clean samples. We highlight three critical challenges in executing backdoor attacks on incremental learners and propose corresponding solutions: (1) Transferability: We employ a surrogate dataset and manipulate prompt selection to transfer backdoor knowledge to data from other suppliers; (2) Resiliency: We simulate static and dynamic states of the victim to ensure the backdoor trigger remains robust during intense incremental learning processes; and (3) Authenticity: We apply binary cross-entropy loss as an anti-cheating factor to prevent the backdoor trigger from devolving into adversarial noise. Extensive experiments across various benchmark datasets and continual learners validate our continual backdoor framework, with further ablation studies confirming our contributions' effectiveness.

NeurIPS Conference 2025 Conference Paper

ExGra-Med: Extended Context Graph Alignment for Medical Vision-Language Models

  • Duy M. H. Nguyen
  • Nghiem Diep
  • Trung Nguyen
  • Hoang-Bao Le
  • Tai Nguyen
  • Anh-Tien Nguyen
  • TrungTin Nguyen
  • Nhat Ho

State-of-the-art medical multi-modal LLMs (med-MLLMs), such as LLaVA-Med and BioMedGPT, primarily depend on scaling model size and data volume, with training driven largely by autoregressive objectives. However, we reveal that this approach can lead to weak vision-language alignment, making these models overly dependent on costly instruction-following data. To address this, we introduce ExGra-Med, a novel multi-graph alignment framework that jointly aligns images, instruction responses, and extended captions in the latent space, advancing semantic grounding and cross-modal coherence. To scale to large LLMs (e. g. , LLaMa-7B), we develop an efficient end-to-end training scheme using black-box gradient estimation, enabling fast and scalable optimization. Empirically, ExGra-Med matches LLaVA-Med’s performance using just 10\% of pre-training data, achieving a 20. 13\% gain on VQA-RAD and approaching full-data performance. It also outperforms strong baselines like BioMedGPT and RadFM on visual chatbot and zero-shot classification tasks, demonstrating its promise for efficient, high-quality vision-language integration in medical AI.

ICML Conference 2025 Conference Paper

Improving Generalization with Flat Hilbert Bayesian Inference

  • Tuan Truong
  • Quyen Tran
  • Ngoc-Quan Pham
  • Nhat Ho
  • Dinh Q. Phung
  • Trung Le 0001

We introduce Flat Hilbert Bayesian Inference (FHBI), an algorithm designed to enhance generalization in Bayesian inference. Our approach involves an iterative two-step procedure with an adversarial functional perturbation step and a functional descent step within the reproducing kernel Hilbert spaces. This methodology is supported by a theoretical analysis that extends previous findings on generalization ability from finite-dimensional Euclidean spaces to infinite-dimensional functional spaces. To evaluate the effectiveness of FHBI, we conduct comprehensive comparisons against nine baseline methods on the VTAB-1K benchmark, which encompasses 19 diverse datasets across various domains with diverse semantics. Empirical results demonstrate that FHBI consistently outperforms the baselines by notable margins, highlighting its practical efficacy.

JMLR Journal 2025 Journal Article

Instability, Computational Efficiency and Statistical Accuracy

  • Nhat Ho
  • Koulik Khamaru
  • Raaz Dwivedi
  • Martin J. Wainwright
  • Michael I. Jordan
  • Bin Yu

Many statistical estimators are defined as the fixed point of a data-dependent operator, with estimators based on minimizing a cost function being an important special case. The limiting performance of such estimators depends on the properties of the population-level operator in the idealized limit of infinitely many samples. We develop a general framework that yields bounds on statistical accuracy based on the interplay between the deterministic convergence rate of the algorithm at the population level, and its degree of (in)stability when applied to an empirical object based on $n$ samples. Using this framework, we analyze both stable forms of gradient descent and some higher-order and unstable algorithms, including Newton's method and its cubic-regularized variant, as well as the EM algorithm. We provide applications of our general results to several concrete classes of models, including Gaussian mixture estimation, non-linear regression models, and informative non-response models. We exhibit cases in which an unstable algorithm can achieve the same statistical accuracy as a stable algorithm in exponentially fewer steps---namely, with the number of iterations being reduced from polynomial to logarithmic in sample size $n$. [abs] [ pdf ][ bib ] &copy JMLR 2025. ( edit, beta )

ICML Conference 2025 Conference Paper

Lightspeed Geometric Dataset Distance via Sliced Optimal Transport

  • Khai Nguyen
  • Hai Nguyen
  • Tuan Pham
  • Nhat Ho

We introduce sliced optimal transport dataset distance (s-OTDD), a model-agnostic, embedding-agnostic approach for dataset comparison that requires no training, is robust to variations in the number of classes, and can handle disjoint label sets. The core innovation is Moment Transform Projection (MTP), which maps a label, represented as a distribution over features, to a real number. Using MTP, we derive a data point projection that transforms datasets into one-dimensional distributions. The s-OTDD is defined as the expected Wasserstein distance between the projected distributions, with respect to random projection parameters. Leveraging the closed form solution of one-dimensional optimal transport, s-OTDD achieves (near-)linear computational complexity in the number of data points and feature dimensions and is independent of the number of classes. With its geometrically meaningful projection, s-OTDD strongly correlates with the optimal transport dataset distance while being more efficient than existing dataset discrepancy measures. Moreover, it correlates well with the performance gap in transfer learning and classification accuracy in data augmentation.

TMLR Journal 2025 Journal Article

MGPATH: A Vision-Language Model with Multi-Granular Prompt Learning for Few-Shot Whole Slide Pathology Classification

  • Anh-Tien Nguyen
  • Duy Minh Ho Nguyen
  • Nghiem Tuong Diep
  • Trung Quoc Nguyen
  • Nhat Ho
  • Jacqueline Michelle Metsch
  • Miriam Cindy Maurer
  • Daniel Sonntag

Whole slide pathology image classification presents challenges due to gigapixel image sizes and limited annotation labels, hindering model generalization. This paper introduces a prompt learning method to adapt large vision-language models for few-shot pathology classification. We first extend the Prov-GigaPath vision foundation model, pre-trained on 1.3 billion pathology image tiles, into a vision-language model by adding adaptors and aligning it with medical text encoders via contrastive learning on 923K image-text pairs. The model is then used to extract visual features and text embeddings from few-shot annotations and fine-tunes with learnable prompt embeddings. Unlike prior methods that combine prompts with frozen features using prefix embeddings or self-attention, we propose multi-granular attention that compares interactions between learnable prompts with individual image patches and groups of them. This approach improves the model’s ability to capture both fine-grained details and broader context, enhancing its recognition of complex patterns across sub-regions. To further improve accuracy, we leverage (unbalanced) optimal transport-based visual-text distance to secure model robustness by mitigating perturbations that might occur during the data augmentation process. Empirical experiments on lung, kidney, and breast pathology modalities validate the effectiveness of our approach; thereby, we surpass several of the latest competitors and consistently improve performance across diverse architectures, including CLIP, PLIP, and Prov-GigaPath integrated PLIP. We release our implementations and pre-trained models at this https://github.com/HauschildLab/MGPATH.

NeurIPS Conference 2025 Conference Paper

On Minimax Estimation of Parameters in Softmax-Contaminated Mixture of Experts

  • Fanqi Yan
  • Huy Nguyen
  • Le Dung
  • Pedram Akbarian
  • Nhat Ho
  • Alessandro Rinaldo

The softmax-contaminated mixture of experts (MoE) model is deployed when a large-scale pre-trained model, which plays the role of a fixed expert, is fine-tuned for learning downstream tasks by including a new contamination part, or prompt, functioning as a new, trainable expert. Despite its popularity and relevance, the theoretical properties of the softmax-contaminated MoE have remained unexplored in the literature. In the paper, we study the convergence rates of the maximum likelihood estimator of gating and prompt parameters in order to gain insights into the statistical properties and potential challenges of fine-tuning with a new prompt. We find that the estimability of these parameters is compromised when the prompt acquires overlapping knowledge with the pre-trained model, in the sense that we make precise by formulating a novel analytic notion of distinguishability. Under distinguishability of the pre-trained and prompt models, we derive minimax optimal estimation rates for all the gating and prompt parameters. By contrast, when the distinguishability condition is violated, these estimation rates become significantly slower due to their dependence on the prompt convergence rate to the pre-trained model. Finally, we empirically corroborate our theoretical findings through several numerical experiments.

ICML Conference 2025 Conference Paper

On Zero-Initialized Attention: Optimal Prompt and Gating Factor Estimation

  • Nghiem Tuong Diep
  • Huy Nguyen
  • Chau Nguyen
  • Minh Le
  • Duy Minh Ho Nguyen
  • Daniel Sonntag
  • Mathias Niepert
  • Nhat Ho

LLaMA-Adapter has recently emerged as an efficient fine-tuning technique for LLaMA models, leveraging zero-initialized attention to stabilize training and enhance performance. However, despite its empirical success, the theoretical foundations of zero-initialized attention remain largely unexplored. In this paper, we provide a rigorous theoretical analysis, establishing a connection between zero-initialized attention and mixture-of-expert models. We prove that both linear and non-linear prompts, along with gating functions, can be optimally estimated, with non-linear prompts offering greater flexibility for future applications. Empirically, we validate our findings on the open LLM benchmarks, demonstrating that non-linear prompts outperform linear ones. Notably, even with limited training data, both prompt types consistently surpass vanilla attention, highlighting the robustness and adaptability of zero-initialized attention.

ICML Conference 2025 Conference Paper

RepLoRA: Reparameterizing Low-rank Adaptation via the Perspective of Mixture of Experts

  • Tuan Truong
  • Chau Nguyen
  • Huy Nguyen
  • Minh Le
  • Trung Le 0001
  • Nhat Ho

Low-rank Adaptation (LoRA) has emerged as a powerful and efficient method for fine-tuning large-scale foundation models. Despite its popularity, the theoretical understanding of LoRA has remained underexplored. In this paper, we present a theoretical analysis of LoRA by examining its connection to the Mixture of Experts models. Under this framework, we show that a simple technique, reparameterizing LoRA matrices, can notably accelerate the low-rank matrix estimation process. In particular, we prove that reparameterization can reduce the data needed to achieve a desired estimation error from an exponential to a polynomial scale. Motivated by this insight, we propose Rep arameterized Lo w- R ank A daptation (RepLoRA), incorporating a lightweight MLP to reparameterize the LoRA matrices. Extensive experiments across multiple domains demonstrate that RepLoRA consistently outperforms vanilla LoRA. With limited data, RepLoRA surpasses LoRA by a substantial margin of up to 40. 0% and achieves LoRA’s performance using only 30. 0% of the training data, highlighting the theoretical and empirical robustness of our PEFT method.

ICLR Conference 2025 Conference Paper

Revisiting Prefix-tuning: Statistical Benefits of Reparameterization among Prompts

  • Minh Le
  • Chau Nguyen
  • Huy Nguyen
  • Quyen Tran
  • Trung Le 0001
  • Nhat Ho

Prompt-based techniques, such as prompt-tuning and prefix-tuning, have gained prominence for their efficiency in fine-tuning large pre-trained models. Despite their widespread adoption, the theoretical foundations of these methods remain limited. For instance, in prefix-tuning, we observe that a key factor in achieving performance parity with full fine-tuning lies in the reparameterization strategy. However, the theoretical principles underpinning the effectiveness of this approach have yet to be thoroughly examined. Our study demonstrates that reparameterization is not merely an engineering trick but is grounded in deep theoretical foundations. Specifically, we show that the reparameterization strategy implicitly encodes a shared structure between prefix key and value vectors. Building on recent insights into the connection between prefix-tuning and mixture of experts models, we further illustrate that this shared structure significantly improves sample efficiency in parameter estimation compared to non-shared alternatives. The effectiveness of prefix-tuning across diverse tasks is empirically confirmed to be enhanced by the shared structure, through extensive experiments in both visual and language domains. Additionally, we uncover similar structural benefits in prompt-tuning, offering new perspectives on its success. Our findings provide theoretical and empirical contributions, advancing the understanding of prompt-based methods and their underlying mechanisms.

ICLR Conference 2025 Conference Paper

Statistical Advantages of Perturbing Cosine Router in Mixture of Experts

  • Huy Nguyen
  • Pedram Akbarian
  • Huyen Trang Pham
  • Thien Trang Nguyen Vu
  • Shujian Zhang
  • Nhat Ho

The cosine router in Mixture of Experts (MoE) has recently emerged as an attractive alternative to the conventional linear router. Indeed, the cosine router demonstrates favorable performance in image and language tasks and exhibits better ability to mitigate the representation collapse issue, which often leads to parameter redundancy and limited representation potentials. Despite its empirical success, a comprehensive analysis of the cosine router in MoE has been lacking. Considering the least square estimation of the cosine routing MoE, we demonstrate that due to the intrinsic interaction of the model parameters in the cosine router via some partial differential equations, regardless of the structures of the experts, the estimation rates of experts and model parameters can be as slow as $\mathcal{O}(1/\log^{\tau}(n))$ where $\tau > 0$ is some constant and $n$ is the sample size. Surprisingly, these pessimistic non-polynomial convergence rates can be circumvented by the widely used technique in practice to stabilize the cosine router --- simply adding noises to the $\ell^2$-norms in the cosine router, which we refer to as *perturbed cosine router*. Under the strongly identifiable settings of the expert functions, we prove that the estimation rates for both the experts and model parameters under the perturbed cosine routing MoE are significantly improved to polynomial rates. Finally, we conduct extensive simulation studies in both synthetic and real data settings to empirically validate our theoretical results.

ICLR Conference 2025 Conference Paper

Towards Marginal Fairness Sliced Wasserstein Barycenter

  • Khai Nguyen
  • Hai Nguyen
  • Nhat Ho

The Sliced Wasserstein barycenter (SWB) is a widely acknowledged method for efficiently generalizing the averaging operation within probability measure spaces. However, achieving marginal fairness SWB, ensuring approximately equal distances from the barycenter to marginals, remains unexplored. The uniform weighted SWB is not necessarily the optimal choice to obtain the desired marginal fairness barycenter due to the heterogeneous structure of marginals and the non-optimality of the optimization. As the first attempt to tackle the problem, we define the marginal fairness sliced Wasserstein barycenter (MFSWB) as a constrained SWB problem. Due to the computational disadvantages of the formal definition, we propose two hyperparameter-free and computationally tractable surrogate MFSWB problems that implicitly minimize the distances to marginals and encourage marginal fairness at the same time. To further improve the efficiency, we perform slicing distribution selection and obtain the third surrogate definition by introducing a new slicing distribution that focuses more on marginally unfair projecting directions. We discuss the relationship of the three proposed problems and their relationship to sliced multi-marginal Wasserstein distance. Finally, we conduct experiments on finding 3D point-clouds averaging, color harmonization, and training of sliced Wasserstein autoencoder with class-fairness representation to show the favorable performance of the proposed surrogate MFSWB problems.

ICLR Conference 2025 Conference Paper

X-Drive: Cross-modality Consistent Multi-Sensor Data Synthesis for Driving Scenarios

  • Yichen Xie 0002
  • Chenfeng Xu
  • Chensheng Peng
  • Shuqi Zhao
  • Nhat Ho
  • Alexander T. Pham
  • Mingyu Ding
  • Masayoshi Tomizuka

Recent advancements have exploited diffusion models for the synthesis of either LiDAR point clouds or camera image data in driving scenarios. Despite their success in modeling single-modality data marginal distribution, there is an under- exploration in the mutual reliance between different modalities to describe com- plex driving scenes. To fill in this gap, we propose a novel framework, X-DRIVE, to model the joint distribution of point clouds and multi-view images via a dual- branch latent diffusion model architecture. Considering the distinct geometrical spaces of the two modalities, X-DRIVE conditions the synthesis of each modality on the corresponding local regions from the other modality, ensuring better alignment and realism. To further handle the spatial ambiguity during denoising, we design the cross-modality condition module based on epipolar lines to adaptively learn the cross-modality local correspondence. Besides, X-DRIVE allows for controllable generation through multi-level input conditions, including text, bounding box, image, and point clouds. Extensive results demonstrate the high-fidelity synthetic results of X-DRIVE for both point clouds and multi-view images, adhering to input conditions while ensuring reliable cross-modality consistency. Our code will be made publicly available at https://github.com/yichen928/X-Drive.

NeurIPS Conference 2024 Conference Paper

A Bayesian Approach for Personalized Federated Learning in Heterogeneous Settings

  • Disha Makhija
  • Joydeep Ghosh
  • Nhat Ho

Federated learning (FL), through its privacy-preserving collaborative learning approach, has significantly empowered decentralized devices. However, constraints in either data and/or computational resources among participating clients introduce several challenges in learning, including the inability to train large model architectures, heightened risks of overfitting, and more. In this work, we present a novel FL framework grounded in Bayesian learning to address these challenges. Our approach involves training personalized Bayesian models at each client tailored to the unique complexities of the clients' datasets and efficiently collaborating across these clients. By leveraging Bayesian neural networks and their uncertainty quantification capabilities, our local training procedure robustly learns from small datasets. And the novel collaboration procedure utilizing priors in the functional (output) space of the networks facilitates collaboration across models of varying sizes, enabling the framework to adapt well in heterogeneous data and computational settings. Furthermore, we present a differentially private version of the algorithm, accompanied by formal differential privacy guarantees that apply without any assumptions on the learning algorithm. Through experiments on popular FL datasets, we demonstrate that our approach outperforms strong baselines in both homogeneous and heterogeneous settings, and under strict privacy constraints.

ICML Conference 2024 Conference Paper

A General Theory for Softmax Gating Multinomial Logistic Mixture of Experts

  • Huy Nguyen
  • Pedram Akbarian
  • TrungTin Nguyen
  • Nhat Ho

Mixture-of-experts (MoE) model incorporates the power of multiple submodels via gating functions to achieve greater performance in numerous regression and classification applications. From a theoretical perspective, while there have been previous attempts to comprehend the behavior of that model under the regression settings through the convergence analysis of maximum likelihood estimation in the Gaussian MoE model, such analysis under the setting of a classification problem has remained missing in the literature. We close this gap by establishing the convergence rates of density estimation and parameter estimation in the softmax gating multinomial logistic MoE model. Notably, when part of the expert parameters vanish, these rates are shown to be slower than polynomial rates owing to an inherent interaction between the softmax gating and expert functions via partial differential equations. To address this issue, we propose using a novel class of modified softmax gating functions which transform the input before delivering them to the gating functions. As a result, the previous interaction disappears and the parameter estimation rates are significantly improved.

NeurIPS Conference 2024 Conference Paper

Bayesian Nonparametrics Meets Data-Driven Distributionally Robust Optimization

  • Nicola Bariletto
  • Nhat Ho

Training machine learning and statistical models often involves optimizing a data-driven risk criterion. The risk is usually computed with respect to the empirical data distribution, but this may result in poor and unstable out-of-sample performance due to distributional uncertainty. In the spirit of distributionally robust optimization, we propose a novel robust criterion by combining insights from Bayesian nonparametric (i. e. , Dirichlet process) theory and a recent decision-theoretic model of smooth ambiguity-averse preferences. First, we highlight novel connections with standard regularized empirical risk minimization techniques, among which Ridge and LASSO regressions. Then, we theoretically demonstrate the existence of favorable finite-sample and asymptotic statistical guarantees on the performance of the robust optimization procedure. For practical implementation, we propose and study tractable approximations of the criterion based on well-known Dirichlet process representations. We also show that the smoothness of the criterion naturally leads to standard gradient-based numerical optimization. Finally, we provide insights into the workings of our method by applying it to a variety of tasks based on simulated and real datasets.

ICLR Conference 2024 Conference Paper

Beyond Vanilla Variational Autoencoders: Detecting Posterior Collapse in Conditional and Hierarchical Variational Autoencoders

  • Hien Dang 0003
  • Tho Tran Huu
  • Tan Minh Nguyen
  • Nhat Ho

The posterior collapse phenomenon in variational autoencoder (VAE), where the variational posterior distribution closely matches the prior distribution, can hinder the quality of the learned latent variables. As a consequence of posterior collapse, the latent variables extracted by the encoder in VAE preserve less information from the input data and thus fail to produce meaningful representations as input to the reconstruction process in the decoder. While this phenomenon has been an actively addressed topic related to VAE performance, the theory for posterior collapse remains underdeveloped, especially beyond the standard VAE. In this work, we advance the theoretical understanding of posterior collapse to two important and prevalent yet less studied classes of VAE: conditional VAE and hierarchical VAE. Specifically, via a non-trivial theoretical analysis of linear conditional VAE and hierarchical VAE with two levels of latent, we prove that the cause of posterior collapses in these models includes the correlation between the input and output of the conditional VAE and the effect of learnable encoder variance in the hierarchical VAE. We empirically validate our theoretical findings for linear conditional and hierarchical VAE and demonstrate that these results are also predictive for non-linear cases with extensive experiments.

ICLR Conference 2024 Conference Paper

Diffeomorphic Mesh Deformation via Efficient Optimal Transport for Cortical Surface Reconstruction

  • Thanh-Tung Le
  • Khai Nguyen
  • Shanlin Sun
  • Kun Han
  • Nhat Ho
  • Xiaohui Xie

Mesh deformation plays a pivotal role in many 3D vision tasks including dynamic simulations, rendering, and reconstruction. However, defining an efficient discrepancy between predicted and target meshes remains an open problem. A prevalent approach in current deep learning is the set-based approach which measures the discrepancy between two surfaces by comparing two randomly sampled point-clouds from the two meshes with Chamfer pseudo-distance. Nevertheless, the set-based approach still has limitations such as lacking a theoretical guarantee for choosing the number of points in sampled point-clouds, and the pseudo-metricity and the quadratic complexity of the Chamfer divergence. To address these issues, we propose a novel metric for learning mesh deformation. The metric is defined by sliced Wasserstein distance on meshes represented as probability measures that generalize the set-based approach. By leveraging probability measure space, we gain flexibility in encoding meshes using diverse forms of probability measures, such as continuous, empirical, and discrete measures via \textit{varifold} representation. After having encoded probability measures, we can compare meshes by using the sliced Wasserstein distance which is an effective optimal transport distance with linear computational complexity and can provide a fast statistical rate for approximating the surface of meshes. To the end, we employ a neural ordinary differential equation (ODE) to deform the input surface into the target shape by modeling the trajectories of the points on the surface. Our experiments on cortical surface reconstruction demonstrate that our approach surpasses other competing methods in multiple datasets and metrics.

NeurIPS Conference 2024 Conference Paper

FuseMoE: Mixture-of-Experts Transformers for Fleximodal Fusion

  • Xing Han
  • Huy Nguyen
  • Carl Harris
  • Nhat Ho
  • Suchi Saria

As machine learning models in critical fields increasingly grapple with multimodal data, they face the dual challenges of handling a wide array of modalities, often incomplete due to missing elements, and the temporal irregularity and sparsity of collected samples. Successfully leveraging this complex data, while overcoming the scarcity of high-quality training samples, is key to improving these models' predictive performance. We introduce ``FuseMoE'', a mixture-of-experts framework incorporated with an innovative gating function. Designed to integrate a diverse number of modalities, FuseMoE is effective in managing scenarios with missing modalities and irregularly sampled data trajectories. Theoretically, our unique gating function contributes to enhanced convergence rates, leading to better performance in multiple downstream tasks. The practical utility of FuseMoE in the real world is validated by a diverse set of challenging prediction tasks.

NeurIPS Conference 2024 Conference Paper

Hierarchical Hybrid Sliced Wasserstein: A Scalable Metric for Heterogeneous Joint Distributions

  • Khai Nguyen
  • Nhat Ho

Sliced Wasserstein (SW) and Generalized Sliced Wasserstein (GSW) have been widely used in applications due to their computational and statistical scalability. However, the SW and the GSW are only defined between distributions supported on a homogeneous domain. This limitation prevents their usage in applications with heterogeneous joint distributions with marginal distributions supported on multiple different domains. Using SW and GSW directly on the joint domains cannot make a meaningful comparison since their homogeneous slicing operator, i. e. , Radon Transform (RT) and Generalized Radon Transform (GRT) are not expressive enough to capture the structure of the joint supports set. To address the issue, we propose two new slicing operators, i. e. , Partial Generalized Radon Transform (PGRT) and Hierarchical Hybrid Radon Transform (HHRT). In greater detail, PGRT is the generalization of Partial Radon Transform (PRT), which transforms a subset of function arguments non-linearly while HHRT is the composition of PRT and multiple domain-specific PGRT on marginal domain arguments. By using HHRT, we extend the SW into Hierarchical Hybrid Sliced Wasserstein (H2SW) distance which is designed specifically for comparing heterogeneous joint distributions. We then discuss the topological, statistical, and computational properties of H2SW. Finally, we demonstrate the favorable performance of H2SW in 3D mesh deformation, deep 3D mesh autoencoders, and datasets comparison.

ICML Conference 2024 Conference Paper

Improving Computational Complexity in Statistical Models with Local Curvature Information

  • Pedram Akbarian
  • Tongzheng Ren
  • Jiacheng Zhuo
  • Sujay Sanghavi
  • Nhat Ho

It is known that when the statistical models are singular, i. e. , the Fisher information matrix at the true parameter is degenerate, the fixed step-size gradient descent algorithm takes polynomial number of steps in terms of the sample size $n$ to converge to a final statistical radius around the true parameter, which can be unsatisfactory for the practical application. To further improve that computational complexity, we consider utilizing the local curvature information for parameter estimation. Even though there is a rich literature in using the local curvature information for optimization, the statistical rate of these methods in statistical models, to the best of our knowledge, has not been studied rigorously. The major challenge of this problem is due to the non-convex nature of sample loss function. To shed light on these problems, we specifically study the normalized gradient descent (NormGD) algorithm, a variant of gradient descent algorithm whose step size is scaled by the maximum eigenvalue of the Hessian matrix of the empirical loss function, and deal with the aforementioned issue with a population-to-sample analysis. When the population loss function is homogeneous, the NormGD iterates reach a final statistical radius around the true parameter after a logarithmic number of iterations in terms of $n$. Therefore, for fixed dimension $d$, the NormGD algorithm achieves the optimal computational complexity $\mathcal{O}(n)$ to reach the final statistical radius, which is cheaper than the complexity $\mathcal{O}(n^{\tau})$ of the fixed step-size gradient descent algorithm for some $\tau > 1$.

ICML Conference 2024 Conference Paper

Is Temperature Sample Efficient for Softmax Gaussian Mixture of Experts?

  • Huy Nguyen
  • Pedram Akbarian
  • Nhat Ho

Dense-to-sparse gating mixture of experts (MoE) has recently become an effective alternative to a well-known sparse MoE. Rather than fixing the number of activated experts as in the latter model, which could limit the investigation of potential experts, the former model utilizes the temperature to control the softmax weight distribution and the sparsity of the MoE during training in order to stabilize the expert specialization. Nevertheless, while there are previous attempts to theoretically comprehend the sparse MoE, a comprehensive analysis of the dense-to-sparse gating MoE has remained elusive. Therefore, we aim to explore the impacts of the dense-to-sparse gate on the maximum likelihood estimation under the Gaussian MoE in this paper. We demonstrate that due to interactions between the temperature and other model parameters via some partial differential equations, the convergence rates of parameter estimations are slower than any polynomial rates, and could be as slow as $\mathcal{O}(1/\log(n))$, where $n$ denotes the sample size. To address this issue, we propose using a novel activation dense-to-sparse gate, which routes the output of a linear layer to an activation function before delivering them to the softmax function. By imposing linearly independence conditions on the activation function and its derivatives, we show that the parameter estimation rates are significantly improved to polynomial rates. Finally, we conduct a simulation study to empirically validate our theoretical results.

NeurIPS Conference 2024 Conference Paper

Mixture of Experts Meets Prompt-Based Continual Learning

  • Minh Le
  • An Nguyen
  • Huy Nguyen
  • Trang Nguyen
  • Trang Pham
  • Linh Van Ngo
  • Nhat Ho

Exploiting the power of pre-trained models, prompt-based approaches stand out compared to other continual learning solutions in effectively preventing catastrophic forgetting, even with very few learnable parameters and without the need for a memory buffer. While existing prompt-based continual learning methods excel in leveraging prompts for state-of-the-art performance, they often lack a theoretical explanation for the effectiveness of prompting. This paper conducts a theoretical analysis to unravel how prompts bestow such advantages in continual learning, thus offering a new perspective on prompt design. We first show that the attention block of pre-trained models like Vision Transformers inherently encodes a special mixture of experts architecture, characterized by linear experts and quadratic gating score functions. This realization drives us to provide a novel view on prefix tuning, reframing it as the addition of new task-specific experts, thereby inspiring the design of a novel gating mechanism termed Non-linear Residual Gates (NoRGa). Through the incorporation of non-linear activation and residual connection, NoRGa enhances continual learning performance while preserving parameter efficiency. The effectiveness of NoRGa is substantiated both theoretically and empirically across diverse benchmarks and pretraining paradigms. Our code is publicly available at https: //github. com/Minhchuyentoancbn/MoE_PromptCL.

ICML Conference 2024 Conference Paper

Neural Collapse for Cross-entropy Class-Imbalanced Learning with Unconstrained ReLU Features Model

  • Hien Dang 0003
  • Tho Tran Huu
  • Tan Minh Nguyen
  • Nhat Ho

The current paradigm of training deep neural networks for classification tasks includes minimizing the empirical risk, pushing the training loss value towards zero even after the training classification error has vanished. In this terminal phase of training, it has been observed that the last-layer features collapse to their class-means and these class-means converge to the vertices of a simplex Equiangular Tight Frame (ETF). This phenomenon is termed as Neural Collapse ($\mathcal{NC}$). However, this characterization only holds in class-balanced datasets where every class has the same number of training samples. When the training dataset is class-imbalanced, some $\mathcal{NC}$ properties will no longer hold true, for example, the geometry of class-means will skew away from the simplex ETF. In this paper, we generalize $\mathcal{NC}$ to imbalanced regime for cross-entropy loss under the unconstrained ReLU features model. We demonstrate that while the within-class features collapse property still holds in this setting, the class-means will converge to a structure consisting of orthogonal vectors with lengths dependent on the number of training samples. Furthermore, we find that the classifier weights (i. e. , the last-layer linear classifier) are aligned to the scaled and centered class-means, with scaling factors dependent on the number of training samples of each class. This generalizes $\mathcal{NC}$ in the class-balanced setting. We empirically validate our results through experiments on practical architectures and dataset.

ICML Conference 2024 Conference Paper

On Least Square Estimation in Softmax Gating Mixture of Experts

  • Huy Nguyen
  • Nhat Ho
  • Alessandro Rinaldo

Mixture of experts (MoE) model is a statistical machine learning design that aggregates multiple expert networks using a softmax gating function in order to form a more intricate and expressive model. Despite being commonly used in several applications owing to their scalability, the mathematical and statistical properties of MoE models are complex and difficult to analyze. As a result, previous theoretical works have primarily focused on probabilistic MoE models by imposing the impractical assumption that the data are generated from a Gaussian MoE model. In this work, we investigate the performance of the least squares estimators (LSE) under a deterministic MoE model where the data are sampled according to a regression model, a setting that has remained largely unexplored. We establish a condition called strong identifiability to characterize the convergence behavior of various types of expert functions. We demonstrate that the rates for estimating strongly identifiable experts, namely the widely used feed forward networks with activation functions $\mathrm{sigmoid}(\cdot)$ and $\tanh(\cdot)$, are substantially faster than those of polynomial experts, which we show to exhibit a surprising slow estimation rate. Our findings have important practical implications for expert selection.

JMLR Journal 2024 Journal Article

On the Computational and Statistical Complexity of Over-parameterized Matrix Sensing

  • Jiacheng Zhuo
  • Jeongyeol Kwon
  • Nhat Ho
  • Constantine Caramanis

We consider solving the low-rank matrix sensing problem with the Factorized Gradient Descent (FGD) method when the specified rank is larger than the true rank. We refer to this as over-parameterized matrix sensing. If the ground truth signal $\mathbf{X}^* \in \mathbb{R}^{d \times d}$ is of rank $r$, but we try to recover it using $\mathbf{F} \mathbf{F}^\top$ where $\mathbf{F} \in \mathbb{R}^{d \times k}$ and $k>r$, the existing statistical analysis either no longer holds or produces a vacuous statistical error upper bound (infinity) due to the flat local curvature of the loss function around the global maxima. By decomposing the factorized matrix $\mathbf{F}$ into separate column spaces to capture the impact of using $k > r$, we show that $\left\| {\mathbf{F}_t \mathbf{F}_t - \mathbf{X}^*} \right\|_F^2$ converges sub-linearly to a statistical error of $\tilde{\mathcal{O}} (k d \sigma^2/n)$ after $\tilde{\mathcal{O}}(\frac{\sigma_{r}}{\sigma}\sqrt{\frac{n}{d}})$ iterations, where $\mathbf{F}_t$ is the output of FGD after $t$ iterations, $\sigma^2$ is the variance of the observation noise, $\sigma_{r}$ is the $r$-th largest eigenvalue of $\mathbf{X}^*$, and $n$ is the number of samples. With a precise characterization of the convergence behavior and the statistical error, our results, therefore, offer a comprehensive picture of the statistical and computational complexity if we solve the over-parameterized matrix sensing problem with FGD. [abs] [ pdf ][ bib ] &copy JMLR 2024. ( edit, beta )

ICLR Conference 2024 Conference Paper

Quasi-Monte Carlo for 3D Sliced Wasserstein

  • Khai Nguyen
  • Nicola Bariletto
  • Nhat Ho

Monte Carlo (MC) integration has been employed as the standard approximation method for the Sliced Wasserstein (SW) distance, whose analytical expression involves an intractable expectation. However, MC integration is not optimal in terms of absolute approximation error. To provide a better class of empirical SW, we propose quasi-sliced Wasserstein (QSW) approximations that rely on Quasi-Monte Carlo (QMC) methods. For a comprehensive investigation of QMC for SW, we focus on the 3D setting, specifically computing the SW between probability measures in three dimensions. In greater detail, we empirically evaluate various methods to construct QMC point sets on the 3D unit-hypersphere, including the Gaussian-based and equal area mappings, generalized spiral points, and optimizing discrepancy energies. Furthermore, to obtain an unbiased estimator for stochastic optimization, we extend QSW to Randomized Quasi-Sliced Wasserstein (RQSW) by introducing randomness in the discussed point sets. Theoretically, we prove the asymptotic convergence of QSW and the unbiasedness of RQSW. Finally, we conduct experiments on various 3D tasks, such as point-cloud comparison, point-cloud interpolation, image style transfer, and training deep point-cloud autoencoders, to demonstrate the favorable performance of the proposed QSW and RQSW variants.

ICLR Conference 2024 Conference Paper

Revisiting Deep Audio-Text Retrieval Through the Lens of Transportation

  • Manh Luong
  • Khai Nguyen
  • Nhat Ho
  • Gholamreza Haffari
  • Dinh Q. Phung
  • Lizhen Qu

The Learning-to-match (LTM) framework proves to be an effective inverse optimal transport approach for learning the underlying ground metric between two sources of data, facilitating subsequent matching. However, the conventional LTM framework faces scalability challenges, necessitating the use of the entire dataset each time the parameters of the ground metric are updated. In adapting LTM to the deep learning context, we introduce the mini-batch Learning-to-match (m-LTM) framework for audio-text retrieval problems. This framework leverages mini-batch subsampling and Mahalanobis-enhanced family of ground metrics. Moreover, to cope with misaligned training data in practice, we propose a variant using partial optimal transport to mitigate the harm of misaligned data pairs in training data. We conduct extensive experiments on audio-text matching problems using three datasets: AudioCaps, Clotho, and ESC-50. Results demonstrate that our proposed method is capable of learning rich and expressive joint embedding space, which achieves SOTA performance. Beyond this, the proposed m-LTM framework is able to close the modality gap across audio and text embedding, which surpasses both triplet and contrastive loss in the zero-shot sound event detection task on the ESC-50 dataset. Notably, our strategy of employing partial optimal transport with m-LTM demonstrates greater noise tolerance than contrastive loss, especially under varying noise ratios in training data on the AudioCaps dataset. Our code is available at https://github.com/v-manhlt3/m-LTM-Audio-Text-Retrieval

NeurIPS Conference 2024 Conference Paper

Sigmoid Gating is More Sample Efficient than Softmax Gating in Mixture of Experts

  • Huy Nguyen
  • Nhat Ho
  • Alessandro Rinaldo

The softmax gating function is arguably the most popular choice in mixture of experts modeling. Despite its widespread use in practice, the softmax gating may lead to unnecessary competition among experts, potentially causing the undesirable phenomenon of representation collapse due to its inherent structure. In response, the sigmoid gating function has been recently proposed as an alternative and has been demonstrated empirically to achieve superior performance. However, a rigorous examination of the sigmoid gating function is lacking in current literature. In this paper, we verify theoretically that the sigmoid gating, in fact, enjoys a higher sample efficiency than the softmax gating for the statistical task of expert estimation. Towards that goal, we consider a regression framework in which the unknown regression function is modeled as a mixture of experts, and study the rates of convergence of the least squares estimator under the over-specified case in which the number of fitted experts is larger than the true value. We show that two gating regimes naturally arise and, in each of them, we formulate an identifiability condition for the expert functions and derive the corresponding convergence rates. In both cases, we find that experts formulated as feed-forward networks with commonly used activation such as $\mathrm{ReLU}$ and $\mathrm{GELU}$ enjoy faster convergence rates under the sigmoid gating than those under softmax gating. Furthermore, given the same choice of experts, we demonstrate that the sigmoid gating function requires a smaller sample size than its softmax counterpart to attain the same error of expert estimation and, therefore, is more sample efficient.

ICLR Conference 2024 Conference Paper

Sliced Wasserstein Estimation with Control Variates

  • Khai Nguyen
  • Nhat Ho

The sliced Wasserstein (SW) distances between two probability measures are defined as the expectation of the Wasserstein distance between two one-dimensional projections of the two measures. The randomness comes from a projecting direction that is used to project the two input measures to one dimension. Due to the intractability of the expectation, Monte Carlo integration is performed to estimate the value of the SW distance. Despite having various variants, there has been no prior work that improves the Monte Carlo estimation scheme for the SW distance in terms of controlling its variance. To bridge the literature on variance reduction and the literature on the SW distance, we propose computationally efficient control variates to reduce the variance of the empirical estimation of the SW distance. The key idea is to first find Gaussian approximations of projected one-dimensional measures, then we utilize the closed-form of the Wasserstein-2 distance between two Gaussian distributions to design the control variates. In particular, we propose using a lower bound and an upper bound of the Wasserstein-2 distance between two fitted Gaussians as two computationally efficient control variates. We empirically show that the proposed control variate estimators can help to reduce the variance considerably when comparing measures over images and point-clouds. Finally, we demonstrate the favorable performance of the proposed control variate estimators in gradient flows to interpolate between two point-clouds and in deep generative modeling on standard image datasets, such as CIFAR10 and CelebA.

ICML Conference 2024 Conference Paper

Sliced Wasserstein with Random-Path Projecting Directions

  • Khai Nguyen
  • Shujian Zhang
  • Tam Le
  • Nhat Ho

Slicing distribution selection has been used as an effective technique to improve the performance of parameter estimators based on minimizing sliced Wasserstein distance in applications. Previous works either utilize expensive optimization to select the slicing distribution or use slicing distributions that require expensive sampling methods. In this work, we propose an optimization-free slicing distribution that provides a fast sampling for the Monte Carlo estimation of expectation. In particular, we introduce the random-path projecting direction (RPD) which is constructed by leveraging the normalized difference between two random vectors following the two input measures. From the RPD, we derive the random-path slicing distribution (RPSD) and two variants of sliced Wasserstein, i. e. , the Random-Path Projection Sliced Wasserstein (RPSW) and the Importance Weighted Random-Path Projection Sliced Wasserstein (IWRPSW). We then discuss the topological, statistical, and computational properties of RPSW and IWRPSW. Finally, we showcase the favorable performance of RPSW and IWRPSW in gradient flow and the training of denoising diffusion generative models on images.

TMLR Journal 2024 Journal Article

Statistical and Computational Complexities of BFGS Quasi-Newton Method for Generalized Linear Models

  • Qiujiang Jin
  • Tongzheng Ren
  • Nhat Ho
  • Aryan Mokhtari

The gradient descent (GD) method has been used widely to solve parameter estimation in generalized linear models (GLMs), a generalization of linear models when the link function can be non-linear. In GLMs with a polynomial link function, it has been shown that in the high signal-to-noise ratio (SNR) regime, due to the problem's strong convexity and smoothness, GD converges linearly and reaches the final desired accuracy in a logarithmic number of iterations. In contrast, in the low SNR setting, where the problem becomes locally convex, GD converges at a slower rate and requires a polynomial number of iterations to reach the desired accuracy. Even though Newton's method can be used to resolve the flat curvature of the loss functions in the low SNR case, its computational cost is prohibitive in high-dimensional settings as it is $\mathcal{O}(d^3)$, where $d$ the is the problem dimension. To address the shortcomings of GD and Newton's method, we propose the use of the BFGS quasi-Newton method to solve parameter estimation of the GLMs, which has a per iteration cost of $\mathcal{O}(d^2)$. When the SNR is low, for GLMs with a polynomial link function of degree $p$, we demonstrate that the iterates of BFGS converge linearly to the optimal solution of the population least-square loss function, and the contraction coefficient of the BFGS algorithm is comparable to that of Newton's method. Moreover, the contraction factor of the linear rate is independent of problem parameters and only depends on the degree of the link function $p$. Also, for the empirical loss with $n$ samples, we prove that in the low SNR setting of GLMs with a polynomial link function of degree $p$, the iterates of BFGS reach a final statistical radius of $\mathcal{O}((d/n)^{\frac{1}{2p+2}})$ after at most $\log(n/d)$ iterations. This complexity is significantly less than the number required for GD, which scales polynomially with $(n/d)$.

ICLR Conference 2024 Conference Paper

Statistical Perspective of Top-K Sparse Softmax Gating Mixture of Experts

  • Huy Nguyen
  • Pedram Akbarian
  • Fanqi Yan
  • Nhat Ho

Top-K sparse softmax gating mixture of experts has been widely used for scaling up massive deep-learning architectures without increasing the computational cost. Despite its popularity in real-world applications, the theoretical understanding of that gating function has remained an open problem. The main challenge comes from the structure of the top-K sparse softmax gating function, which partitions the input space into multiple regions with distinct behaviors. By focusing on a Gaussian mixture of experts, we establish theoretical results on the effects of the top-K sparse softmax gating function on both density and parameter estimations. Our results hinge upon defining novel loss functions among parameters to capture different behaviors of the input regions. When the true number of experts $k_{\ast}$ is known, we demonstrate that the convergence rates of density and parameter estimations are both parametric on the sample size. However, when $k_{\ast}$ becomes unknown and the true model is over-specified by a Gaussian mixture of $k$ experts where $k > k_{\ast}$, our findings suggest that the number of experts selected from the top-K sparse softmax gating function must exceed the total cardinality of a certain number of Voronoi cells associated with the true parameters to guarantee the convergence of the density estimation. Moreover, while the density estimation rate remains parametric under this setting, the parameter estimation rates become substantially slow due to an intrinsic interaction between the softmax gating and expert functions.

ICML Conference 2024 Conference Paper

Structure-Aware E(3)-Invariant Molecular Conformer Aggregation Networks

  • Duy Minh Ho Nguyen
  • Nina Lukashina
  • Tai Nguyen 0008
  • An T. Le 0001
  • TrungTin Nguyen
  • Nhat Ho
  • Jan Peters 0001
  • Daniel Sonntag

A molecule’s 2D representation consists of its atoms, their attributes, and the molecule’s covalent bonds. A 3D (geometric) representation of a molecule is called a conformer and consists of its atom types and Cartesian coordinates. Every conformer has a potential energy, and the lower this energy, the more likely it occurs in nature. Most existing machine learning methods for molecular property prediction consider either 2D molecular graphs or 3D conformer structure representations in isolation. Inspired by recent work on using ensembles of conformers in conjunction with 2D graph representations, we propose E(3)-invariant molecular conformer aggregation networks. The method integrates a molecule’s 2D representation with that of multiple of its conformers. Contrary to prior work, we propose a novel 2D–3D aggregation mechanism based on a differentiable solver for the Fused Gromov-Wasserstein Barycenter problem and the use of an efficient conformer generation method based on distance geometry. We show that the proposed aggregation mechanism is E(3) invariant and propose an efficient GPU implementation. Moreover, we demonstrate that the aggregation mechanism helps to significantly outperform state-of-the-art molecule property prediction methods on established datasets.

ICLR Conference 2023 Conference Paper

A Primal-Dual Framework for Transformers and Neural Networks

  • Tan Minh Nguyen
  • Tam Minh Nguyen
  • Nhat Ho
  • Andrea L. Bertozzi
  • Richard G. Baraniuk
  • Stanley J. Osher

Self-attention is key to the remarkable success of transformers in sequence modeling tasks including many applications in natural language processing and computer vision. Like neural network layers, these attention mechanisms are often developed by heuristics and experience. To provide a principled framework for constructing attention layers in transformers, we show that the self-attention corresponds to the support vector expansion derived from a support vector regression problem, whose primal formulation has the form of a neural network layer. Using our framework, we derive popular attention layers used in practice and propose two new attentions: 1) the Batch Normalized Attention (Attention-BN) derived from the batch normalization layer and 2) the Attention with Scaled Head (Attention-SH) derived from using less training data to fit the SVR model. We empirically demonstrate the advantages of the Attention-BN and Attention-SH in reducing head redundancy, increasing the model's accuracy, and improving the model's efficiency in a variety of practical applications including image and time-series classification.

NeurIPS Conference 2023 Conference Paper

Demystifying Softmax Gating Function in Gaussian Mixture of Experts

  • Huy Nguyen
  • TrungTin Nguyen
  • Nhat Ho

Understanding the parameter estimation of softmax gating Gaussian mixture of experts has remained a long-standing open problem in the literature. It is mainly due to three fundamental theoretical challenges associated with the softmax gating function: (i) the identifiability only up to the translation of parameters; (ii) the intrinsic interaction via partial differential equations between the softmax gating and the expert functions in the Gaussian density; (iii) the complex dependence between the numerator and denominator of the conditional density of softmax gating Gaussian mixture of experts. We resolve these challenges by proposing novel Voronoi loss functions among parameters and establishing the convergence rates of maximum likelihood estimator (MLE) for solving parameter estimation in these models. When the true number of experts is unknown and over-specified, our findings show a connection between the convergence rate of the MLE and a solvability problem of a system of polynomial equations.

NeurIPS Conference 2023 Conference Paper

Designing Robust Transformers using Robust Kernel Density Estimation

  • Xing Han
  • Tongzheng Ren
  • Tan Nguyen
  • Khai Nguyen
  • Joydeep Ghosh
  • Nhat Ho

Transformer-based architectures have recently exhibited remarkable successes across different domains beyond just powering large language models. However, existing approaches typically focus on predictive accuracy and computational cost, largely ignoring certain other practical issues such as robustness to contaminated samples. In this paper, by re-interpreting the self-attention mechanism as a non-parametric kernel density estimator, we adapt classical robust kernel density estimation methods to develop novel classes of transformers that are resistant to adversarial attacks and data contamination. We first propose methods that down-weight outliers in RKHS when computing the self-attention operations. We empirically show that these methods produce improved performance over existing state-of-the-art methods, particularly on image data under adversarial attacks. Then we leverage the median-of-means principle to obtain another efficient approach that results in noticeably enhanced performance and robustness on language modeling and time series classification tasks. Our methods can be combined with existing transformers to augment their robust properties, thus promising to impact a wide variety of applications.

NeurIPS Conference 2023 Conference Paper

Energy-Based Sliced Wasserstein Distance

  • Khai Nguyen
  • Nhat Ho

The sliced Wasserstein (SW) distance has been widely recognized as a statistically effective and computationally efficient metric between two probability measures. A key component of the SW distance is the slicing distribution. There are two existing approaches for choosing this distribution. The first approach is using a fixed prior distribution. The second approach is optimizing for the best distribution which belongs to a parametric family of distributions and can maximize the expected distance. However, both approaches have their limitations. A fixed prior distribution is non-informative in terms of highlighting projecting directions that can discriminate two general probability measures. Doing optimization for the best distribution is often expensive and unstable. Moreover, designing the parametric family of the candidate distribution could be easily misspecified. To address the issues, we propose to design the slicing distribution as an energy-based distribution that is parameter-free and has the density proportional to an energy function of the projected one-dimensional Wasserstein distance. We then derive a novel sliced Wasserstein variant, energy-based sliced Waserstein (EBSW) distance, and investigate its topological, statistical, and computational properties via importance sampling, sampling importance resampling, and Markov Chain methods. Finally, we conduct experiments on point-cloud gradient flow, color transfer, and point-cloud reconstruction to show the favorable performance of the EBSW.

ICLR Conference 2023 Conference Paper

Hierarchical Sliced Wasserstein Distance

  • Khai Nguyen
  • Tongzheng Ren
  • Huy Nguyen
  • Litu Rout
  • Tan Nguyen
  • Nhat Ho

Sliced Wasserstein (SW) distance has been widely used in different application scenarios since it can be scaled to a large number of supports without suffering from the curse of dimensionality. The value of sliced Wasserstein distance is the average of transportation cost between one-dimensional representations (projections) of original measures that are obtained by Radon Transform (RT). Despite its efficiency in the number of supports, estimating the sliced Wasserstein requires a relatively large number of projections in high-dimensional settings. Therefore, for applications where the number of supports is relatively small compared with the dimension, e.g., several deep learning applications where the mini-batch approaches are utilized, the complexities from matrix multiplication of Radon Transform become the main computational bottleneck. To address this issue, we propose to derive projections by linearly and randomly combining a smaller number of projections which are named bottleneck projections. We explain the usage of these projections by introducing Hierarchical Radon Transform (HRT) which is constructed by applying Radon Transform variants recursively. We then formulate the approach into a new metric between measures, named Hierarchical Sliced Wasserstein (HSW) distance. By proving the injectivity of HRT, we derive the metricity of HSW. Moreover, we investigate the theoretical properties of HSW including its connection to SW variants and its computational and sample complexities. Finally, we compare the computational cost and generative quality of HSW with the conventional SW on the task of deep generative modeling using various benchmark datasets including CIFAR10, CelebA, and Tiny ImageNet.

AAAI Conference 2023 Conference Paper

Joint Self-Supervised Image-Volume Representation Learning with Intra-inter Contrastive Clustering

  • Duy M. H. Nguyen
  • Hoang Nguyen
  • Truong T. N. Mai
  • Tri Cao
  • Binh T. Nguyen
  • Nhat Ho
  • Paul Swoboda
  • Shadi Albarqouni

Collecting large-scale medical datasets with fully annotated samples for training of deep networks is prohibitively expensive, especially for 3D volume data. Recent breakthroughs in self-supervised learning (SSL) offer the ability to overcome the lack of labeled training samples by learning feature representations from unlabeled data. However, most current SSL techniques in the medical field have been designed for either 2D images or 3D volumes. In practice, this restricts the capability to fully leverage unlabeled data from numerous sources, which may include both 2D and 3D data. Additionally, the use of these pre-trained networks is constrained to downstream tasks with compatible data dimensions. In this paper, we propose a novel framework for unsupervised joint learning on 2D and 3D data modalities. Given a set of 2D images or 2D slices extracted from 3D volumes, we construct an SSL task based on a 2D contrastive clustering problem for distinct classes. The 3D volumes are exploited by computing vectored embedding at each slice and then assembling a holistic feature through deformable self-attention mechanisms in Transformer, allowing incorporating long-range dependencies between slices inside 3D volumes. These holistic features are further utilized to define a novel 3D clustering agreement-based SSL task and masking embedding prediction inspired by pre-trained language models. Experiments on downstream tasks, such as 3D brain segmentation, lung nodule detection, 3D heart structures segmentation, and abnormal chest X-ray detection, demonstrate the effectiveness of our joint 2D and 3D SSL approach. We improve plain 2D Deep-ClusterV2 and SwAV by a significant margin and also surpass various modern 2D and 3D SSL approaches.

NeurIPS Conference 2023 Conference Paper

LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical Imaging via Second-order Graph Matching

  • Duy M. H. Nguyen
  • Hoang Nguyen
  • Nghiem Diep
  • Tan Ngoc Pham
  • Tri Cao
  • Binh Nguyen
  • Paul Swoboda
  • Nhat Ho

Obtaining large pre-trained models that can be fine-tuned to new tasks with limited annotated samples has remained an open challenge for medical imaging data. While pre-trained networks on ImageNet and vision-language foundation models trained on web-scale data are the prevailing approaches, their effectiveness on medical tasks is limited due to the significant domain shift between natural and medical images. To bridge this gap, we introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets. We have collected approximately 1. 3 million medical images from 55 publicly available datasets, covering a large number of organs and modalities such as CT, MRI, X-ray, and Ultrasound. We benchmark several state-of-the-art self-supervised algorithms on this dataset and propose a novel self-supervised contrastive learning algorithm using a graph-matching formulation. The proposed approach makes three contributions: (i) it integrates prior pair-wise image similarity metrics based on local and global information; (ii) it captures the structural constraints of feature embeddings through a loss function constructed through a combinatorial graph-matching objective, and (iii) it can be trained efficiently end-to-end using modern gradient-estimation techniques for black-box solvers. We thoroughly evaluate the proposed LVM-Med on 15 downstream medical tasks ranging from segmentation and classification to object detection, and both for the in and out-of-distribution settings. LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models. For challenging tasks such as Brain Tumor Classification or Diabetic Retinopathy Grading, LVM-Med improves previous vision-language models trained on 1 billion masks by 6-7% while using only a ResNet-50.

NeurIPS Conference 2023 Conference Paper

Markovian Sliced Wasserstein Distances: Beyond Independent Projections

  • Khai Nguyen
  • Tongzheng Ren
  • Nhat Ho

Sliced Wasserstein (SW) distance suffers from redundant projections due to independent uniform random projecting directions. To partially overcome the issue, max K sliced Wasserstein (Max-K-SW) distance ($K\geq 1$), seeks the best discriminative orthogonal projecting directions. Despite being able to reduce the number of projections, the metricity of the Max-K-SW cannot be guaranteed in practice due to the non-optimality of the optimization. Moreover, the orthogonality constraint is also computationally expensive and might not be effective. To address the problem, we introduce a new family of SW distances, named Markovian sliced Wasserstein (MSW) distance, which imposes a first-order Markov structure on projecting directions. We discuss various members of the MSW by specifying the Markov structure including the prior distribution, the transition distribution, and the burning and thinning technique. Moreover, we investigate the theoretical properties of MSW including topological properties (metricity, weak convergence, and connection to other distances), statistical properties (sample complexity, and Monte Carlo estimation error), and computational properties (computational complexity and memory complexity). Finally, we compare MSW distances with previous SW variants in various applications such as gradient flows, color transfer, and deep generative modeling to demonstrate the favorable performance of the MSW.

NeurIPS Conference 2023 Conference Paper

Minimax Optimal Rate for Parameter Estimation in Multivariate Deviated Models

  • Dat Do
  • Huy Nguyen
  • Khai Nguyen
  • Nhat Ho

We study the maximum likelihood estimation (MLE) in the multivariate deviated model where the data are generated from the density function $(1-\lambda^{\ast})h_{0}(x)+\lambda^{\ast}f(x|\mu^{\ast}, \Sigma^{\ast})$ in which $h_{0}$ is a known function, $\lambda^{\ast} \in [0, 1]$ and $(\mu^{\ast}, \Sigma^{\ast})$ are unknown parameters to estimate. The main challenges in deriving the convergence rate of the MLE mainly come from two issues: (1) The interaction between the function $h_{0}$ and the density function $f$; (2) The deviated proportion $\lambda^{\ast}$ can go to the extreme points of $[0, 1]$ as the sample size tends to infinity. To address these challenges, we develop the \emph{distinguishability condition} to capture the linear independent relation between the function $h_{0}$ and the density function $f$. We then provide comprehensive convergence rates of the MLE via the vanishing rate of $\lambda^{\ast}$ to zero as well as the distinguishability of two functions $h_{0}$ and $f$.

ICML Conference 2023 Conference Paper

Neural Collapse in Deep Linear Networks: From Balanced to Imbalanced Data

  • Hien Dang 0003
  • Tho Tran Huu
  • Stanley J. Osher
  • Hung Tran-The
  • Nhat Ho
  • Tan Minh Nguyen

Modern deep neural networks have achieved impressive performance on tasks from image classification to natural language processing. Surprisingly, these complex systems with massive amounts of parameters exhibit the same structural properties in their last-layer features and classifiers across canonical datasets when training until convergence. In particular, it has been observed that the last-layer features collapse to their class-means, and those class-means are the vertices of a simplex Equiangular Tight Frame (ETF). This phenomenon is known as Neural Collapse (NC). Recent papers have theoretically shown that NC emerges in the global minimizers of training problems with the simplified “unconstrained feature model”. In this context, we take a step further and prove the NC occurrences in deep linear networks for the popular mean squared error (MSE) and cross entropy (CE) losses, showing that global solutions exhibit NC properties across the linear layers. Furthermore, we extend our study to imbalanced data for MSE loss and present the first geometric analysis of NC under bias-free setting. Our results demonstrate the convergence of the last-layer features and classifiers to a geometry consisting of orthogonal vectors, whose lengths depend on the amount of data in their corresponding classes. Finally, we empirically validate our theoretical analyses on synthetic and practical network architectures with both balanced and imbalanced scenarios.

ICML Conference 2023 Conference Paper

On Excess Mass Behavior in Gaussian Mixture Models with Orlicz-Wasserstein Distances

  • Aritra Guha
  • Nhat Ho
  • XuanLong Nguyen

Dirichlet Process mixture models (DPMM) in combination with Gaussian kernels have been an important modeling tool for numerous data domains arising from biological, physical, and social sciences. However, this versatility in applications does not extend to strong theoretical guarantees for the underlying parameter estimates, for which only a logarithmic rate is achieved. In this work, we (re)introduce and investigate a metric, named Orlicz-Wasserstein distance, in the study of the Bayesian contraction behavior for the parameters. We show that despite the overall slow convergence guarantees for all the parameters, posterior contraction for parameters happens at almost polynomial rates in outlier regions of the parameter space. Our theoretical results provide new insight in understanding the convergence behavior of parameters arising from various settings of hierarchical Bayesian nonparametric models. In addition, we provide an algorithm to compute the metric by leveraging Sinkhorn divergences and validate our findings through a simulation study.

ICML Conference 2023 Conference Paper

Revisiting Over-smoothing and Over-squashing Using Ollivier-Ricci Curvature

  • Khang Nguyen 0002
  • Nong Minh Hieu
  • Vinh Duc Nguyen
  • Nhat Ho
  • Stanley J. Osher
  • Tan Minh Nguyen

Graph Neural Networks (GNNs) had been demonstrated to be inherently susceptible to the problems of over-smoothing and over-squashing. These issues prohibit the ability of GNNs to model complex graph interactions by limiting their effectiveness in taking into account distant information. Our study reveals the key connection between the local graph geometry and the occurrence of both of these issues, thereby providing a unified framework for studying them at a local scale using the Ollivier-Ricci curvature. Specifically, we demonstrate that over-smoothing is linked to positive graph curvature while over-squashing is linked to negative graph curvature. Based on our theory, we propose the Batch Ollivier-Ricci Flow, a novel rewiring algorithm capable of simultaneously addressing both over-smoothing and over-squashing.

ICML Conference 2023 Conference Paper

Self-Attention Amortized Distributional Projection Optimization for Sliced Wasserstein Point-Cloud Reconstruction

  • Khai Nguyen
  • Dang Nguyen
  • Nhat Ho

Max sliced Wasserstein (Max-SW) distance has been widely known as a solution for less discriminative projections of sliced Wasserstein (SW) distance. In applications that have various independent pairs of probability measures, amortized projection optimization is utilized to predict the “max" projecting directions given two input measures instead of using projected gradient ascent multiple times. Despite being efficient, Max-SW and its amortized version cannot guarantee metricity property due to the sub-optimality of the projected gradient ascent and the amortization gap. Therefore, we propose to replace Max-SW with distributional sliced Wasserstein distance with von Mises-Fisher (vMF) projecting distribution (v-DSW). Since v-DSW is a metric with any non-degenerate vMF distribution, its amortized version can guarantee the metricity when performing amortization. Furthermore, current amortized models are not permutation invariant and symmetric. To address the issue, we design amortized models based on self-attention architecture. In particular, we adopt efficient self-attention architectures to make the computation linear in the number of supports. With the two improvements, we derive self-attention amortized distributional projection optimization and show its appealing performance in point-cloud reconstruction and its downstream applications

NeurIPS Conference 2022 Conference Paper

Amortized Projection Optimization for Sliced Wasserstein Generative Models

  • Khai Nguyen
  • Nhat Ho

Seeking informative projecting directions has been an important task in utilizing sliced Wasserstein distance in applications. However, finding these directions usually requires an iterative optimization procedure over the space of projecting directions, which is computationally expensive. Moreover, the computational issue is even more severe in deep learning applications, where computing the distance between two mini-batch probability measures is repeated several times. This nested-loop has been one of the main challenges that prevent the usage of sliced Wasserstein distances based on good projections in practice. To address this challenge, we propose to utilize the \textit{learning-to-optimize} technique or \textit{amortized optimization} to predict the informative direction of any given two mini-batch probability measures. To the best of our knowledge, this is the first work that bridges amortized optimization and sliced Wasserstein generative models. In particular, we derive linear amortized models, generalized linear amortized models, and non-linear amortized models which are corresponding to three types of novel mini-batch losses, named \emph{amortized sliced Wasserstein}. We demonstrate the favorable performance of the proposed sliced losses in deep generative modeling on standard benchmark datasets.

ICML Conference 2022 Conference Paper

Architecture Agnostic Federated Learning for Neural Networks

  • Disha Makhija
  • Xing Han
  • Nhat Ho
  • Joydeep Ghosh

With growing concerns regarding data privacy and rapid increase in data volume, Federated Learning (FL) has become an important learning paradigm. However, jointly learning a deep neural network model in a FL setting proves to be a non-trivial task because of the complexities associated with the neural networks, such as varied architectures across clients, permutation invariance of the neurons, and presence of non-linear transformations in each layer. This work introduces a novel framework, Federated Heterogeneous Neural Networks (FedHeNN), that allows each client to build a personalised model without enforcing a common architecture across clients. This allows each client to optimize with respect to local data and compute constraints, while still benefiting from the learnings of other (potentially more powerful) clients. The key idea of FedHeNN is to use the instance-level representations obtained from peer clients to guide the simultaneous training on each client. The extensive experimental results demonstrate that the FedHeNN framework is capable of learning better performing models on clients in both the settings of homogeneous and heterogeneous architectures across clients.

NeurIPS Conference 2022 Conference Paper

Beyond black box densities: Parameter learning for the deviated components

  • Dat Do
  • Nhat Ho
  • XuanLong Nguyen

As we collect additional samples from a data population for which a known density function estimate may have been previously obtained by a black box method, the increased complexity of the data set may result in the true density being deviated from the known estimate by a mixture distribution. To model this phenomenon, we consider the \emph{deviating mixture model} $(1-\lambda^{*})h_0 + \lambda^{*} (\sum_{i = 1}^{k} p_{i}^{*} f(x|\theta_{i}^{*}))$, where $h_0$ is a known density function, while the deviated proportion $\lambda^{*}$ and latent mixing measure $G_{*} = \sum_{i = 1}^{k} p_{i}^{*} \delta_{\theta_i^{*}}$ associated with the mixture distribution are unknown. Via a novel notion of distinguishability between the known density $h_{0}$ and the deviated mixture distribution, we establish rates of convergence for the maximum likelihood estimates of $\lambda^{*}$ and $G^{*}$ under Wasserstein metric. Simulation studies are carried out to illustrate the theory.

JMLR Journal 2022 Journal Article

Convergence Rates for Gaussian Mixtures of Experts

  • Nhat Ho
  • Chiao-Yu Yang
  • Michael I. Jordan

We provide a theoretical treatment of over-specified Gaussian mixtures of experts with covariate-free gating networks. We establish the convergence rates of the maximum likelihood estimation (MLE) for these models. Our proof technique is based on a novel notion of algebraic independence of the expert functions. Drawing on optimal transport, we establish a connection between the algebraic independence of the expert functions and a certain class of partial differential equations (PDEs) with respect to the parameters. Exploiting this connection allows us to derive convergence rates for parameter estimation. [abs] [ pdf ][ bib ] &copy JMLR 2022. ( edit, beta )

ICML Conference 2022 Conference Paper

Entropic Gromov-Wasserstein between Gaussian Distributions

  • Khang Le
  • Dung Q. Le
  • Huy Nguyen
  • Dat Do
  • Tung Pham 0001
  • Nhat Ho

We study the entropic Gromov-Wasserstein and its unbalanced version between (unbalanced) Gaussian distributions with different dimensions. When the metric is the inner product, which we refer to as inner product Gromov-Wasserstein (IGW), we demonstrate that the optimal transportation plans of entropic IGW and its unbalanced variant are (unbalanced) Gaussian distributions. Via an application of von Neumann’s trace inequality, we obtain closed-form expressions for the entropic IGW between these Gaussian distributions. Finally, we consider an entropic inner product Gromov-Wasserstein barycenter of multiple Gaussian distributions. We prove that the barycenter is a Gaussian distribution when the entropic regularization parameter is small. We further derive a closed-form expression for the covariance matrix of the barycenter.

NeurIPS Conference 2022 Conference Paper

FourierFormer: Transformer Meets Generalized Fourier Integral Theorem

  • Tan Nguyen
  • Minh Pham
  • Tam Nguyen
  • Khai Nguyen
  • Stanley Osher
  • Nhat Ho

Multi-head attention empowers the recent success of transformers, the state-of-the-art models that have achieved remarkable success in sequence modeling and beyond. These attention mechanisms compute the pairwise dot products between the queries and keys, which results from the use of unnormalized Gaussian kernels with the assumption that the queries follow a mixture of Gaussian distribution. There is no guarantee that this assumption is valid in practice. In response, we first interpret attention in transformers as a nonparametric kernel regression. We then propose the FourierFormer, a new class of transformers in which the dot-product kernels are replaced by the novel generalized Fourier integral kernels. Different from the dot-product kernels, where we need to choose a good covariance matrix to capture the dependency of the features of data, the generalized Fourier integral kernels can automatically capture such dependency and remove the need to tune the covariance matrix. We theoretically prove that our proposed Fourier integral kernels can efficiently approximate any key and query distributions. Compared to the conventional transformers with dot-product attention, FourierFormers attain better accuracy and reduce the redundancy between attention heads. We empirically corroborate the advantages of FourierFormers over the baseline transformers in a variety of practical applications including language modeling and image classification.

ICML Conference 2022 Conference Paper

Improving Mini-batch Optimal Transport via Partial Transportation

  • Khai Nguyen
  • Dang Nguyen 0002
  • The-Anh Vu-Le
  • Tung Pham 0001
  • Nhat Ho

Mini-batch optimal transport (m-OT) has been widely used recently to deal with the memory issue of OT in large-scale applications. Despite their practicality, m-OT suffers from misspecified mappings, namely, mappings that are optimal on the mini-batch level but are partially wrong in the comparison with the optimal transportation plan between the original measures. Motivated by the misspecified mappings issue, we propose a novel mini-batch method by using partial optimal transport (POT) between mini-batch empirical measures, which we refer to as mini-batch partial optimal transport (m-POT). Leveraging the insight from the partial transportation, we explain the source of misspecified mappings from the m-OT and motivate why limiting the amount of transported masses among mini-batches via POT can alleviate the incorrect mappings. Finally, we carry out extensive experiments on various applications such as deep domain adaptation, partial domain adaptation, deep generative model, color transfer, and gradient flow to demonstrate the favorable performance of m-POT compared to current mini-batch methods.

NeurIPS Conference 2022 Conference Paper

Improving Transformer with an Admixture of Attention Heads

  • Tan Nguyen
  • Tam Nguyen
  • Hai Do
  • Khai Nguyen
  • Vishwanath Saragadam
  • Minh Pham
  • Khuong Duy Nguyen
  • Nhat Ho

Transformers with multi-head self-attention have achieved remarkable success in sequence modeling and beyond. However, they suffer from high computational and memory complexities for computing the attention matrix at each head. Recently, it has been shown that those attention matrices lie on a low-dimensional manifold and, thus, are redundant. We propose the Transformer with a Finite Admixture of Shared Heads (FiSHformers), a novel class of efficient and flexible transformers that allow the sharing of attention matrices between attention heads. At the core of FiSHformer is a novel finite admixture model of shared heads (FiSH) that samples attention matrices from a set of global attention matrices. The number of global attention matrices is much smaller than the number of local attention matrices generated. FiSHformers directly learn these global attention matrices rather than the local ones as in other transformers, thus significantly improving the computational and memory efficiency of the model. We empirically verify the advantages of the FiSHformer over the baseline transformers in a wide range of practical applications including language modeling, machine translation, and image classification. On the WikiText-103, IWSLT'14 De-En and WMT'14 En-De, FiSHformers use much fewer floating-point operations per second (FLOPs), memory, and parameters compared to the baseline transformers.

ICML Conference 2022 Conference Paper

Improving Transformers with Probabilistic Attention Keys

  • Tam Minh Nguyen
  • Tan Minh Nguyen
  • Dung D. Le
  • Duy Khuong Nguyen
  • Viet-Anh Tran
  • Richard G. Baraniuk
  • Nhat Ho
  • Stanley J. Osher

Multi-head attention is a driving force behind state-of-the-art transformers, which achieve remarkable performance across a variety of natural language processing (NLP) and computer vision tasks. It has been observed that for many applications, those attention heads learn redundant embedding, and most of them can be removed without degrading the performance of the model. Inspired by this observation, we propose Transformer with a Mixture of Gaussian Keys (Transformer-MGK), a novel transformer architecture that replaces redundant heads in transformers with a mixture of keys at each head. These mixtures of keys follow a Gaussian mixture model and allow each attention head to focus on different parts of the input sequence efficiently. Compared to its conventional transformer counterpart, Transformer-MGK accelerates training and inference, has fewer parameters, and requires fewer FLOPs to compute while achieving comparable or better accuracy across tasks. Transformer-MGK can also be easily extended to use with linear attention. We empirically demonstrate the advantage of Transformer-MGK in a range of practical applications, including language modeling and tasks that involve very long sequences. On the Wikitext-103 and Long Range Arena benchmark, Transformer-MGKs with 4 heads attain comparable or better performance to the baseline transformers with 8 heads.

JMLR Journal 2022 Journal Article

On the Complexity of Approximating Multimarginal Optimal Transport

  • Tianyi Lin
  • Nhat Ho
  • Marco Cuturi
  • Michael I. Jordan

We study the complexity of approximating the multimarginal optimal transport (MOT) distance, a generalization of the classical optimal transport distance, considered here between $m$ discrete probability distributions supported each on $n$ support points. First, we show that the standard linear programming (LP) representation of the MOT problem is not a minimum-cost flow problem when $m \geq 3$. This negative result implies that some combinatorial algorithms, e.g., network simplex method, are not suitable for approximating the MOT problem, while the worst-case complexity bound for the deterministic interior-point algorithm remains a quantity of $\tilde{\mathcal{O}}(n^{3m})$. We then propose two simple and deterministic algorithms for approximating the MOT problem. The first algorithm, which we refer to as multimarginal Sinkhorn algorithm, is a provably efficient multimarginal generalization of the Sinkhorn algorithm. We show that it achieves a complexity bound of $\tilde{\mathcal{O}}(m^3n^m\varepsilon^{-2})$ for a tolerance $\varepsilon \in (0, 1)$. This provides a first near-linear time complexity bound guarantee for approximating the MOT problem and matches the best known complexity bound for the Sinkhorn algorithm in the classical OT setting when $m = 2$. The second algorithm, which we refer to as accelerated multimarginal Sinkhorn algorithm, achieves the acceleration by incorporating an estimate sequence and the complexity bound is $\tilde{\mathcal{O}}(m^3n^{m+1/3}\varepsilon^{-4/3})$. This bound is better than that of the first algorithm in terms of $1/\varepsilon$, and accelerated alternating minimization algorithm (Tupitsa et al., 2020) in terms of $n$. Finally, we compare our new algorithms with the commercial LP solver Gurobi. Preliminary results on synthetic data and real images demonstrate the effectiveness and efficiency of our algorithms. [abs] [ pdf ][ bib ] &copy JMLR 2022. ( edit, beta )

JMLR Journal 2022 Journal Article

On the Efficiency of Entropic Regularized Algorithms for Optimal Transport

  • Tianyi Lin
  • Nhat Ho
  • Michael I. Jordan

We present several new complexity results for the entropic regularized algorithms that approximately solve the optimal transport (OT) problem between two discrete probability measures with at most $n$ atoms. First, we improve the complexity bound of a greedy variant of Sinkhorn, known as Greenkhorn, from $\tilde{O}(n^2\varepsilon^{-3})$ to $\tilde{O}(n^2\varepsilon^{-2})$. Notably, our result can match the best known complexity bound of Sinkhorn and help clarify why Greenkhorn significantly outperforms Sinkhorn in practice in terms of row/column updates as observed by Altschuler et al. (2017). Second, we propose a new algorithm, which we refer to as APDAMD and which generalizes an adaptive primal-dual accelerated gradient descent (APDAGD) algorithm (Dvurechensky et al., 2018) with a prespecified mirror mapping $\phi$. We prove that APDAMD achieves the complexity bound of $\tilde{O}(n^2\sqrt{\delta}\varepsilon^{-1})$ in which $\delta>0$ stands for the regularity of $\phi$. In addition, we show by a counterexample that the complexity bound of $\tilde{O}(\min\{n^{9/4}\varepsilon^{-1}, n^2\varepsilon^{-2}\})$ proved for APDAGD before is invalid and give a refined complexity bound of $\tilde{O}(n^{5/2}\varepsilon^{-1})$. Further, we develop a deterministic accelerated variant of Sinkhorn via appeal to estimated sequence and prove the complexity bound of $\tilde{O}(n^{7/3}\varepsilon^{-4/3})$. As such, we see that accelerated variant of Sinkhorn outperforms Sinkhorn and Greenkhorn in terms of $1/\varepsilon$ and APDAGD and accelerated alternating minimization (AAM) (Guminov et al., 2021) in terms of $n$. Finally, we conduct the experiments on synthetic and real data and the numerical results show the efficiency of Greenkhorn, APDAMD and accelerated Sinkhorn in practice. [abs] [ pdf ][ bib ] &copy JMLR 2022. ( edit, beta )

ICML Conference 2022 Conference Paper

On Transportation of Mini-batches: A Hierarchical Approach

  • Khai Nguyen
  • Dang Nguyen 0002
  • Quoc Dinh Nguyen
  • Tung Pham 0001
  • Hung Hai Bui
  • Dinh Q. Phung
  • Trung Le 0001
  • Nhat Ho

Mini-batch optimal transport (m-OT) has been successfully used in practical applications that involve probability measures with a very high number of supports. The m-OT solves several smaller optimal transport problems and then returns the average of their costs and transportation plans. Despite its scalability advantage, the m-OT does not consider the relationship between mini-batches which leads to undesirable estimation. Moreover, the m-OT does not approximate a proper metric between probability measures since the identity property is not satisfied. To address these problems, we propose a novel mini-batch scheme for optimal transport, named Batch of Mini-batches Optimal Transport (BoMb-OT), that finds the optimal coupling between mini-batches and it can be seen as an approximation to a well-defined distance on the space of probability measures. Furthermore, we show that the m-OT is a limit of the entropic regularized version of the BoMb-OT when the regularized parameter goes to infinity. Finally, we carry out experiments on various applications including deep generative models, deep domain adaptation, approximate Bayesian computation, color transfer, and gradient flow to show that the BoMb-OT can be widely applied and performs well in various applications.

ICML Conference 2022 Conference Paper

Refined Convergence Rates for Maximum Likelihood Estimation under Finite Mixture Models

  • Tudor A. Manole
  • Nhat Ho

We revisit the classical problem of deriving convergence rates for the maximum likelihood estimator (MLE) in finite mixture models. The Wasserstein distance has become a standard loss function for the analysis of parameter estimation in these models, due in part to its ability to circumvent label switching and to accurately characterize the behaviour of fitted mixture components with vanishing weights. However, the Wasserstein distance is only able to capture the worst-case convergence rate among the remaining fitted mixture components. We demonstrate that when the log-likelihood function is penalized to discourage vanishing mixing weights, stronger loss functions can be derived to resolve this shortcoming of the Wasserstein distance. These new loss functions accurately capture the heterogeneity in convergence rates of fitted mixture components, and we use them to sharpen existing pointwise and uniform convergence rates in various classes of mixture models. In particular, these results imply that a subset of the components of the penalized MLE typically converge significantly faster than could have been anticipated from past work. We further show that some of these conclusions extend to the traditional MLE. Our theoretical findings are supported by a simulation study to illustrate these improved convergence rates.

NeurIPS Conference 2022 Conference Paper

Revisiting Sliced Wasserstein on Images: From Vectorization to Convolution

  • Khai Nguyen
  • Nhat Ho

The conventional sliced Wasserstein is defined between two probability measures that have realizations as \textit{vectors}. When comparing two probability measures over images, practitioners first need to vectorize images and then project them to one-dimensional space by using matrix multiplication between the sample matrix and the projection matrix. After that, the sliced Wasserstein is evaluated by averaging the two corresponding one-dimensional projected probability measures. However, this approach has two limitations. The first limitation is that the spatial structure of images is not captured efficiently by the vectorization step; therefore, the later slicing process becomes harder to gather the discrepancy information. The second limitation is memory inefficiency since each slicing direction is a vector that has the same dimension as the images. To address these limitations, we propose novel slicing methods for sliced Wasserstein between probability measures over images that are based on the convolution operators. We derive \emph{convolution sliced Wasserstein} (CSW) and its variants via incorporating stride, dilation, and non-linear activation function into the convolution operators. We investigate the metricity of CSW as well as its sample complexity, its computational complexity, and its connection to conventional sliced Wasserstein distances. Finally, we demonstrate the favorable performance of CSW over the conventional sliced Wasserstein in comparing probability measures over images and in training deep generative modeling on images.

NeurIPS Conference 2022 Conference Paper

Stochastic Multiple Target Sampling Gradient Descent

  • Hoang Phan
  • Ngoc Tran
  • Trung Le
  • Toan Tran
  • Nhat Ho
  • Dinh Phung

Sampling from an unnormalized target distribution is an essential problem with many applications in probabilistic inference. Stein Variational Gradient Descent (SVGD) has been shown to be a powerful method that iteratively updates a set of particles to approximate the distribution of interest. Furthermore, when analysing its asymptotic properties, SVGD reduces exactly to a single-objective optimization problem and can be viewed as a probabilistic version of this single-objective optimization problem. A natural question then arises: ``Can we derive a probabilistic version of the multi-objective optimization? ''. To answer this question, we propose Stochastic Multiple Target Sampling Gradient Descent (MT-SGD), enabling us to sample from multiple unnormalized target distributions. Specifically, our MT-SGD conducts a flow of intermediate distributions gradually orienting to multiple target distributions, which allows the sampled particles to move to the joint high-likelihood region of the target distributions. Interestingly, the asymptotic analysis shows that our approach reduces exactly to the multiple-gradient descent algorithm for multi-objective optimization, as expected. Finally, we conduct comprehensive experiments to demonstrate the merit of our approach to multi-task learning.

ICLR Conference 2021 Conference Paper

Distributional Sliced-Wasserstein and Applications to Generative Modeling

  • Khai Nguyen
  • Nhat Ho
  • Tung Pham 0001
  • Hung Hai Bui

Sliced-Wasserstein distance (SW) and its variant, Max Sliced-Wasserstein distance (Max-SW), have been used widely in the recent years due to their fast computation and scalability even when the probability measures lie in a very high dimensional space. However, SW requires many unnecessary projection samples to approximate its value while Max-SW only uses the most important projection, which ignores the information of other useful directions. In order to account for these weaknesses, we propose a novel distance, named Distributional Sliced-Wasserstein distance (DSW), that finds an optimal distribution over projections that can balance between exploring distinctive projecting directions and the informativeness of projections themselves. We show that the DSW is a generalization of Max-SW, and it can be computed efficiently by searching for the optimal push-forward measure over a set of probability measures over the unit sphere satisfying certain regularizing constraints that favor distinct directions. Finally, we conduct extensive experiments with large-scale datasets to demonstrate the favorable performances of the proposed distances over the previous sliced-based distances in generative modeling applications.

ICLR Conference 2021 Conference Paper

Improving Relational Regularized Autoencoders with Spherical Sliced Fused Gromov Wasserstein

  • Khai Nguyen
  • Son Nguyen
  • Nhat Ho
  • Tung Pham 0001
  • Hung Hai Bui

Relational regularized autoencoder (RAE) is a framework to learn the distribution of data by minimizing a reconstruction loss together with a relational regularization on the prior of latent space. A recent attempt to reduce the inner discrepancy between the prior and aggregated posterior distributions is to incorporate sliced fused Gromov-Wasserstein (SFG) between these distributions. That approach has a weakness since it treats every slicing direction similarly, meanwhile several directions are not useful for the discriminative task. To improve the discrepancy and consequently the relational regularization, we propose a new relational discrepancy, named spherical sliced fused Gromov Wasserstein (SSFG), that can find an important area of projections characterized by a von Mises-Fisher distribution. Then, we introduce two variants of SSFG to improve its performance. The first variant, named mixture spherical sliced fused Gromov Wasserstein (MSSFG), replaces the vMF distribution by a mixture of von Mises-Fisher distributions to capture multiple important areas of directions that are far from each other. The second variant, named power spherical sliced fused Gromov Wasserstein (PSSFG), replaces the vMF distribution by a power spherical distribution to improve the sampling time of the vMF distribution in high dimension settings. We then apply the new discrepancies to the RAE framework to achieve its new variants. Finally, we conduct extensive experiments to show that the new autoencoders have favorable performance in learning latent manifold structure, image generation, and reconstruction.

ICML Conference 2021 Conference Paper

LAMDA: Label Matching Deep Domain Adaptation

  • Trung Le 0001
  • Tuan Nguyen 0004
  • Nhat Ho
  • Hung Hai Bui
  • Dinh Q. Phung

Deep domain adaptation (DDA) approaches have recently been shown to perform better than their shallow rivals with better modeling capacity on complex domains (e. g. , image, structural data, and sequential data). The underlying idea is to learn domain invariant representations on a latent space that can bridge the gap between source and target domains. Several theoretical studies have established insightful understanding and the benefit of learning domain invariant features; however, they are usually limited to the case where there is no label shift, hence hindering its applicability. In this paper, we propose and study a new challenging setting that allows us to use a Wasserstein distance (WS) to not only quantify the data shift but also to define the label shift directly. We further develop a theory to demonstrate that minimizing the WS of the data shift leads to closing the gap between the source and target data distributions on the latent space (e. g. , an intermediate layer of a deep net), while still being able to quantify the label shift with respect to this latent space. Interestingly, our theory can consequently explain certain drawbacks of learning domain invariant features on the latent space. Finally, grounded on the results and guidance of our developed theory, we propose the Label Matching Deep Domain Adaptation (LAMDA) approach that outperforms baselines on real-world datasets for DA problems.

JMLR Journal 2021 Journal Article

On efficient multilevel Clustering via Wasserstein distances

  • Viet Huynh
  • Nhat Ho
  • Nhan Dam
  • XuanLong Nguyen
  • Mikhail Yurochkin
  • Hung Bui
  • Dinh Phung

We propose a novel approach to the problem of multilevel clustering, which aims to simultaneously partition data in each group and discover grouping patterns among groups in a potentially large hierarchically structured corpus of data. Our method involves a joint optimization formulation over several spaces of discrete probability measures, which are endowed with Wasserstein distance metrics. We propose several variants of this problem, which admit fast optimization algorithms, by exploiting the connection to the problem of finding Wasserstein barycenters. Consistency properties are established for the estimates of both local and global clusters. Finally, experimental results with both synthetic and real data are presented to demonstrate the flexibility and scalability of the proposed approach. [abs] [ pdf ][ bib ] &copy JMLR 2021. ( edit, beta )

NeurIPS Conference 2021 Conference Paper

On Robust Optimal Transport: Computational Complexity and Barycenter Computation

  • Khang Le
  • Huy Nguyen
  • Quang M Nguyen
  • Tung Pham
  • Hung Bui
  • Nhat Ho

We consider robust variants of the standard optimal transport, named robust optimal transport, where marginal constraints are relaxed via Kullback-Leibler divergence. We show that Sinkhorn-based algorithms can approximate the optimal cost of robust optimal transport in $\widetilde{\mathcal{O}}(\frac{n^2}{\varepsilon})$ time, in which $n$ is the number of supports of the probability distributions and $\varepsilon$ is the desired error. Furthermore, we investigate a fixed-support robust barycenter problem between $m$ discrete probability distributions with at most $n$ number of supports and develop an approximating algorithm based on iterative Bregman projections (IBP). For the specific case $m = 2$, we show that this algorithm can approximate the optimal barycenter value in $\widetilde{\mathcal{O}}(\frac{mn^2}{\varepsilon})$ time, thus being better than the previous complexity $\widetilde{\mathcal{O}}(\frac{mn^2}{\varepsilon^2})$ of the IBP algorithm for approximating the Wasserstein barycenter.

NeurIPS Conference 2021 Conference Paper

Structured Dropout Variational Inference for Bayesian Neural Networks

  • Son Nguyen
  • Duong Nguyen
  • Khai Nguyen
  • Khoat Than
  • Hung Bui
  • Nhat Ho

Approximate inference in Bayesian deep networks exhibits a dilemma of how to yield high fidelity posterior approximations while maintaining computational efficiency and scalability. We tackle this challenge by introducing a novel variational structured approximation inspired by the Bayesian interpretation of Dropout regularization. Concretely, we focus on the inflexibility of the factorized structure in Dropout posterior and then propose an improved method called Variational Structured Dropout (VSD). VSD employs an orthogonal transformation to learn a structured representation on the variational Gaussian noise with plausible complexity, and consequently induces statistical dependencies in the approximate posterior. Theoretically, VSD successfully addresses the pathologies of previous Variational Dropout methods and thus offers a standard Bayesian justification. We further show that VSD induces an adaptive regularization term with several desirable properties which contribute to better generalization. Finally, we conduct extensive experiments on standard benchmarks to demonstrate the effectiveness of VSD over state-of-the-art variational methods on predictive accuracy, uncertainty estimation, and out-of-distribution detection.

NeurIPS Conference 2020 Conference Paper

Fixed-Support Wasserstein Barycenters: Computational Hardness and Fast Algorithm

  • Tianyi Lin
  • Nhat Ho
  • Xi Chen
  • Marco Cuturi
  • Michael Jordan

We study the fixed-support Wasserstein barycenter problem (FS-WBP), which consists in computing the Wasserstein barycenter of $m$ discrete probability measures supported on a finite metric space of size $n$. We show first that the constraint matrix arising from the standard linear programming (LP) representation of the FS-WBP is \textit{not totally unimodular} when $m \geq 3$ and $n \geq 3$. This result resolves an open question pertaining to the relationship between the FS-WBP and the minimum-cost flow (MCF) problem since it proves that the FS-WBP in the standard LP form is not an MCF problem when $m \geq 3$ and $n \geq 3$. We also develop a provably fast \textit{deterministic} variant of the celebrated iterative Bregman projection (IBP) algorithm, named \textsc{FastIBP}, with a complexity bound of $\tilde{O}(mn^{7/3}\varepsilon^{-4/3})$, where $\varepsilon \in (0, 1)$ is the desired tolerance. This complexity bound is better than the best known complexity bound of $\tilde{O}(mn^2\varepsilon^{-2})$ for the IBP algorithm in terms of $\varepsilon$, and that of $\tilde{O}(mn^{5/2}\varepsilon^{-1})$ from accelerated alternating minimization algorithm or accelerated primal-dual adaptive gradient algorithm in terms of $n$. Finally, we conduct extensive experiments with both synthetic data and real images and demonstrate the favorable performance of the \textsc{FastIBP} algorithm in practice.

ICML Conference 2020 Conference Paper

On Unbalanced Optimal Transport: An Analysis of Sinkhorn Algorithm

  • Khiem Pham
  • Khang Le
  • Nhat Ho
  • Tung Pham 0001
  • Hung Hai Bui

We provide a computational complexity analysis for the Sinkhorn algorithm that solves the entropic regularized Unbalanced Optimal Transport (UOT) problem between two measures of possibly different masses with at most $n$ components. We show that the complexity of the Sinkhorn algorithm for finding an $\varepsilon$-approximate solution to the UOT problem is of order $\widetilde{\mathcal{O}}(n^2/ \varepsilon)$. To the best of our knowledge, this complexity is better than the best known complexity upper bound of the Sinkhorn algorithm for solving the Optimal Transport (OT) problem, which is of order $\widetilde{\mathcal{O}}(n^2/\varepsilon^2)$. Our proof technique is based on the geometric convergence rate of the Sinkhorn updates to the optimal dual solution of the entropic regularized UOT problem and scaling properties of the primal solution. It is also different from the proof technique used to establish the complexity of the Sinkhorn algorithm for approximating the OT problem since the UOT solution does not need to meet the marginal constraints of the measures.

NeurIPS Conference 2020 Conference Paper

Projection Robust Wasserstein Distance and Riemannian Optimization

  • Tianyi Lin
  • Chenyou Fan
  • Nhat Ho
  • Marco Cuturi
  • Michael Jordan

Projection robust Wasserstein (PRW) distance, or Wasserstein projection pursuit (WPP), is a robust variant of the Wasserstein distance. Recent work suggests that this quantity is more robust than the standard Wasserstein distance, in particular when comparing probability measures in high-dimensions. However, it is ruled out for practical application because the optimization model is essentially non-convex and non-smooth which makes the computation intractable. Our contribution in this paper is to revisit the original motivation behind WPP/PRW, but take the hard route of showing that, despite its non-convexity and lack of nonsmoothness, and even despite some hardness results proved by~\citet{Niles-2019-Estimation} in a minimax sense, the original formulation for PRW/WPP \textit{can} be efficiently computed in practice using Riemannian optimization, yielding in relevant cases better behavior than its convex relaxation. More specifically, we provide three simple algorithms with solid theoretical guarantee on their complexity bound (one in the appendix), and demonstrate their effectiveness and efficiency by conducing extensive experiments on synthetic and real data. This paper provides a first step into a computational theory of the PRW distance and provides the links between optimal transport and Riemannian optimization.

ICML Conference 2019 Conference Paper

On Efficient Optimal Transport: An Analysis of Greedy and Accelerated Mirror Descent Algorithms

  • Tianyi Lin
  • Nhat Ho
  • Michael I. Jordan

We provide theoretical analyses for two algorithms that solve the regularized optimal transport (OT) problem between two discrete probability measures with at most $n$ atoms. We show that a greedy variant of the classical Sinkhorn algorithm, known as the Greenkhorn algorithm, can be improved to $\bigOtil\left(n^2/\varepsilon^2\right)$, improving on the best known complexity bound of $\bigOtil\left(n^2/\varepsilon^3\right)$. This matches the best known complexity bound for the Sinkhorn algorithm and helps explain why the Greenkhorn algorithm outperforms the Sinkhorn algorithm in practice. Our proof technique is based on a primal-dual formulation and provide a tight upper bound for the dual solution, leading to a class of adaptive primal-dual accelerated mirror descent (APDAMD) algorithms. We prove that the complexity of these algorithms is $\bigOtil\left(n^2\sqrt{\gamma}/\varepsilon\right)$ in which $\gamma \in (0, n]$ refers to some constants in the Bregman divergence. Experimental results on synthetic and real datasets demonstrate the favorable performance of the Greenkhorn and APDAMD algorithms in practice.

ICML Conference 2017 Conference Paper

Multilevel Clustering via Wasserstein Means

  • Nhat Ho
  • XuanLong Nguyen
  • Mikhail Yurochkin
  • Hung Hai Bui
  • Viet Huynh
  • Dinh Q. Phung

We propose a novel approach to the problem of multilevel clustering, which aims to simultaneously partition data in each group and discover grouping patterns among groups in a potentially large hierarchically structured corpus of data. Our method involves a joint optimization formulation over several spaces of discrete probability measures, which are endowed with Wasserstein distance metrics. We propose a number of variants of this problem, which admit fast optimization algorithms, by exploiting the connection to the problem of finding Wasserstein barycenters. Consistency properties are established for the estimates of both local and global clusters. Finally, experiment results with both synthetic and real data are presented to demonstrate the flexibility and scalability of the proposed approach.