Arrow Research search

Author name cluster

Samuel Kaski

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

75 papers
2 author rows

Possible papers

75

AAAI Conference 2026 Conference Paper

More than Irrational: Modeling Belief-Biased Agents

  • Yifan Zhu
  • Sammie Katt
  • Samuel Kaski

Despite the explosive growth of AI and the technologies built upon it, predicting and inferring the sub-optimal behavior of users or human collaborators remains a critical challenge. In many cases, such behaviors are not a result of irrationality, but rather a rational decision made given inherent cognitive bounds and biased beliefs about the world. In this paper, we formally introduce a class of computational-rational (CR) user models for cognitively-bounded agents acting optimally under biased beliefs. The key novelty lies in explicitly modeling how a bounded memory process leads to a dynamically inconsistent and biased belief state and, consequently, sub-optimal sequential decision-making. We address the challenge of identifying the latent user-specific bound and inferring biased belief states from passive observations on the fly. We argue that for our formalized CR model family with an explicit and parameterized cognitive process, this challenge is tractable. To support our claim, we propose an efficient online inference method based on nested particle filtering that simultaneously tracks the user's latent belief state and estimates the unknown cognitive bound from a stream of observed actions. We validate our approach in a representative navigation task using memory decay as an example of a cognitive bound. With simulations, we show that (1) our CR model generates intuitively plausible behaviors corresponding to different levels of memory capacity, and (2) our inference method accurately and efficiently recovers the ground-truth cognitive bounds from limited observations (less than 100 steps). We further demonstrate how this approach provides a principled foundation for developing adaptive AI assistants, enabling adaptive assistance that accounts for the user's memory limitations.

TMLR Journal 2026 Journal Article

Rank-1 Approximation of Inverse Fisher for Natural Policy Gradients in Deep Reinforcement Learning

  • Yingxiao Huo
  • Satya Prakash Dash
  • Radu Stoican
  • Samuel Kaski
  • Mingfei Sun

Natural gradients have been long studied in deep reinforcement learning due to its fast convergence properties and covariant weight updates. However, computing natural gradients requires inversion of Fisher Information Matrix (FIM) at each iteration, which is computationally prohibitive in nature. In this paper, we present an efficient and scalable natural policy optimization technique which leverages a rank-1 approximation to full inverse-FIM. We theoretically show that under certain conditions, rank-1 approximation to inverse-FIM converges faster than policy gradients and under some condition, enjoys the same sample complexity as stochastic policy gradient methods. We benchmark our method on a diverse set of environments and show that our methods achieve superior performance than standard trust-region and actor-critic baselines.

NeurIPS Conference 2025 Conference Paper

A Principle of Targeted Intervention for Multi-Agent Reinforcement Learning

  • Anjie Liu
  • Jianhong Wang
  • Samuel Kaski
  • Jun Wang
  • Mengyue Yang

Steering cooperative multi-agent reinforcement learning (MARL) towards desired outcomes is challenging, particularly when the global guidance from a human on the whole multi-agent system is impractical in a large-scale MARL. On the other hand, designing external mechanisms (e. g. , intrinsic rewards and human feedback) to coordinate agents mostly relies on empirical studies, lacking a easy-to-use research tool. In this work, we employ multi-agent influence diagrams (MAIDs) as a graphical framework to address the above issues. First, we introduce the concept of MARL interaction paradigms (orthogonal to MARL learning paradigms), using MAIDs to analyze and visualize both unguided self-organization and global guidance mechanisms in MARL. Then, we design a new MARL interaction paradigm, referred to as the targeted intervention paradigm that is applied to only a single targeted agent, so the problem of global guidance can be mitigated. In implementation, we introduce a causal inference technique—referred to as Pre-Strategy Intervention (PSI)—to realize the targeted intervention paradigm. Since MAIDs can be regarded as a special class of causal diagrams, a composite desired outcome that integrates the primary task goal and an additional desired outcome can be achieved by maximizing the corresponding causal effect through the PSI. Moreover, the bundled relevance graph analysis of MAIDs provides a tool to identify whether an MARL learning paradigm is workable under the design of an MARL interaction paradigm. In experiments, we demonstrate the effectiveness of our proposed targeted intervention, and verify the result of relevance graph analysis.

NeurIPS Conference 2025 Conference Paper

ALINE: Joint Amortization for Bayesian Inference and Active Data Acquisition

  • Daolang Huang
  • Xinyi Wen
  • Ayush Bharti
  • Samuel Kaski
  • Luigi Acerbi

Many critical applications, from autonomous scientific discovery to personalized medicine, demand systems that can both strategically acquire the most informative data and instantaneously perform inference based upon it. While amortized methods for Bayesian inference and experimental design offer part of the solution, neither approach is optimal in the most general and challenging task, where new data needs to be collected for instant inference. To tackle this issue, we introduce the Amortized Active Learning and Inference Engine (ALINE), a unified framework for amortized Bayesian inference and active data acquisition. ALINE leverages a transformer architecture trained via reinforcement learning with a reward based on self-estimated information gain provided by its own integrated inference component. This allows it to strategically query informative data points while simultaneously refining its predictions. Moreover, ALINE can selectively direct its querying strategy towards specific subsets of model parameters or designated predictive tasks, optimizing for posterior estimation, data prediction, or a mixture thereof. Empirical results on regression-based active learning, classical Bayesian experimental design benchmarks, and a psychometric model with selectively targeted parameters demonstrate that ALINE delivers both instant and accurate inference along with efficient selection of informative points.

ICRA Conference 2025 Conference Paper

DroneDiffusion: Robust Quadrotor Dynamics Learning with Diffusion Models

  • Avirup Das
  • Rishabh Dev Yadav
  • Sihao Sun
  • Mingfei Sun 0001
  • Samuel Kaski
  • Wei Pan 0004

An inherent fragility of quadrotor systems stems from model inaccuracies and external disturbances. These factors hinder performance and compromise the stability of the system, making precise control challenging. Existing model-based approaches either make deterministic assumptions, utilize Gaussian-based representations of uncertainty, or rely on nominal models, all of which often fall short in capturing the complex, multimodal nature of real-world dynamics. This work introduces DroneDiffusion, a novel framework that leverages conditional diffusion models to learn quadrotor dynamics, formulated as a sequence generation task. DroneDiffusion achieves superior generalization to unseen, complex scenarios by capturing the temporal nature of uncertainties and mitigating error propagation. We integrate the learned dynamics with an adaptive controller for trajectory tracking with stability guarantees. Extensive experiments in both simulation and real-world flights demonstrate the robustness of the framework across a range of scenarios, including unfamiliar flight paths and varying payloads, velocities, and wind disturbances. Project page: https://sites.google.com/view/dronediffusion.

ICLR Conference 2025 Conference Paper

Generalization and Distributed Learning of GFlowNets

  • Tiago Silva
  • Amauri H. Souza Jr.
  • Omar Rivasplata
  • Vikas K. Garg 0001
  • Samuel Kaski
  • Diego Mesquita

Conventional wisdom attributes the success of Generative Flow Networks (GFlowNets) to their ability to exploit the compositional structure of the sample space for learning generalizable flow functions (Bengio et al., 2021). Despite the abundance of empirical evidence, formalizing this belief with verifiable non-vacuous statistical guarantees has remained elusive. We address this issue with the first data-dependent generalization bounds for GFlowNets. We also elucidate the negative impact of the state space size on the generalization performance of these models via Azuma-Hoeffding-type oracle PAC-Bayesian inequalities. We leverage our theoretical insights to design a novel distributed learning algorithm for GFlowNets, which we call *Subgraph Asynchronous Learning* (SAL). In a nutshell, SAL utilizes a divide-and-conquer strategy: multiple GFlowNets are trained in parallel on smaller subnetworks of the flow network, and then aggregated with an additional GFlowNet that allocates appropriate flow to each subnetwork. Our experiments with synthetic and real-world problems demonstrate the benefits of SAL over centralized training in terms of mode coverage and distribution matching.

ICLR Conference 2025 Conference Paper

PABBO: Preferential Amortized Black-Box Optimization

  • Xinyu Zhang
  • Daolang Huang
  • Samuel Kaski
  • Julien Martinelli

Preferential Bayesian Optimization (PBO) is a sample-efficient method to learn latent user utilities from preferential feedback over a pair of designs. It relies on a statistical surrogate model for the latent function, usually a Gaussian process, and an acquisition strategy to select the next candidate pair to get user feedback on. Due to the non-conjugacy of the associated likelihood, every PBO step requires a significant amount of computations with various approximate inference techniques. This computational overhead is incompatible with the way humans interact with computers, hindering the use of PBO in real-world cases. Building on the recent advances of amortized BO, we propose to circumvent this issue by fully amortizing PBO, meta-learning both the surrogate and the acquisition function. Our method comprises a novel transformer neural process architecture, trained using reinforcement learning and tailored auxiliary losses. On a benchmark composed of synthetic and real-world datasets, our method is several orders of magnitude faster than the usual Gaussian process-based strategies and often outperforms them in accuracy.

UAI Conference 2025 Conference Paper

Privacy-Preserving Neural Processes for Probabilistic User Modeling

  • Amir Sonee
  • Haripriya Harikumar
  • Alex Hämäläinen
  • Lukas Prediger
  • Samuel Kaski

Uncertainty-aware user modeling is crucial for designing AI systems that adapt to users in real-time while addressing privacy concerns. This paper proposes a novel framework for privacy-preserving probabilistic user modeling that integrates uncertainty quantification and differential privacy (DP). Building on neural processes (NPs), a scalable latent variable probabilistic model, we enable meta-learning for user behaviour prediction under privacy constraints. By employing differentially private stochastic gradient descent (DP-SGD), our method achieves rigorous privacy guarantees while preserving predictive accuracy. Unlike prior work, which primarily addresses privacy-preserving learning for convex or smooth functions, we establish theoretical guarantees for non-convex objectives, focusing on the utility-privacy trade-offs inherent in uncertainty-aware models. Through extensive experiments, we demonstrate that our approach achieves competitive accuracy under stringent privacy budgets. Our results showcase the potential of privacy-preserving probabilistic user models to enable trustworthy AI systems in real-world interactive applications.

UAI Conference 2025 Conference Paper

Proxy-informed Bayesian transfer learning with unknown sources

  • Sabina J. Sloman
  • Julien Martinelli
  • Samuel Kaski

Generalization outside the scope of one’s training data requires leveraging prior knowledge about the effects that transfer, and the effects that don’t, between different data sources. Transfer learning is a framework for specifying and refining this knowledge about sets of source (training) and target (prediction) data. A challenging open problem is addressing the empirical phenomenon of negative transfer, whereby the transfer learner performs worse on the target data after taking the source data into account than before. We first introduce a Bayesian perspective on negative transfer, and then a method to address it. The key insight from our formulation is that negative transfer can stem from misspecified prior information about non-transferable causes of the source data. Our proposed method, proxy-informed robust method for probabilistic transfer learning (PROMPT), does not require prior knowledge of the source data (the data sources may be "unknown"). PROMPT is thus applicable when differences between tasks are unobserved, such as in the presence of latent confounders. Moreover, the learner need not have access to observations in the target task (may not have the ability to "fine-tune"), and instead makes use of proxy (indirect) information. Our theoretical results show that the threat of negative transfer does not depend on the informativeness of the proxy information, highlighting the usefulness of PROMPT in cases where only noisy indirect information, such as human feedback, is available.

NeurIPS Conference 2025 Conference Paper

Robust and Computation-Aware Gaussian Processes

  • Marshal Sinaga
  • Julien Martinelli
  • Samuel Kaski

Gaussian processes (GPs) are widely used for regression and optimization tasks such as Bayesian optimization (BO) due to their expressiveness and principled uncertainty estimates. However, in settings with large datasets corrupted by outliers, standard GPs and their sparse approximations struggle with computational tractability and robustness. We introduce Robust Computation-aware Gaussian Process (RCaGP), a novel GP model that jointly addresses these challenges by combining a principled treatment of approximation-induced uncertainty with robust generalized Bayesian updating. The key insight is that robustness and approximation-awareness are not orthogonal but intertwined: approximations can exacerbate the impact of outliers, and mitigating one without the other is insufficient. Unlike previous work that focuses narrowly on either robustness or approximation quality, RCaGP combines both in a principled and scalable framework, thus effectively managing both outliers and computational uncertainties introduced by approximations such as low-rank matrix multiplications. Our model ensures more conservative and reliable uncertainty estimates, a property we rigorously demonstrate. Additionally, we establish a robustness property and show that the mean function is key to preserving it, motivating a tailored model selection scheme for robust mean functions. Empirical results confirm that solving these challenges jointly leads to superior performance across both clean and outlier-contaminated settings, both on regression and high-throughput Bayesian optimization benchmarks.

TMLR Journal 2025 Journal Article

Video-Language Critic: Transferable Reward Functions for Language-Conditioned Robotics

  • Minttu Alakuijala
  • Reginald McLean
  • Isaac Woungang
  • Nariman Farsad
  • Samuel Kaski
  • Pekka Marttinen
  • Kai Yuan

Natural language is often the easiest and most convenient modality for humans to specify tasks for robots. However, learning to ground language to behavior typically requires impractical amounts of diverse, language-annotated demonstrations collected on each target robot. In this work, we aim to separate the problem of what to accomplish from how to accomplish it, as the former can benefit from substantial amounts of external observation-only data, and only the latter depends on a specific robot embodiment. To this end, we propose Video-Language Critic, a reward model that can be trained on readily available cross-embodiment data using contrastive learning and a temporal ranking objective, and use it to score behavior traces from a separate actor. When trained on Open X-Embodiment data, our reward model enables 2x more sample-efficient policy training on Meta-World tasks than a sparse reward only, despite a significant domain gap. Using in-domain data but in a challenging task generalization setting on Meta-World, we further demonstrate more sample-efficient training than is possible with prior language-conditioned reward models that are either trained with binary classification, use static images, or do not leverage the temporal information present in video data.

ICLR Conference 2025 Conference Paper

When do GFlowNets learn the right distribution?

  • Tiago da Silva
  • Rodrigo Barreto Alves
  • Eliezer de Souza da Silva
  • Amauri H. Souza Jr.
  • Vikas K. Garg 0001
  • Samuel Kaski
  • Diego Mesquita

Generative Flow Networks (GFlowNets) are an emerging class of sampling methods for distributions over discrete and compositional objects, e.g., graphs. In spite of their remarkable success in problems such as drug discovery and phylogenetic inference, the question of when and whether GFlowNets learn to sample from the target distribution remains underexplored. To tackle this issue, we first assess the extent to which a violation of the detailed balance of the underlying flow network might hamper the correctness of GFlowNet's sampling distribution. In particular, we demonstrate that the impact of an imbalanced edge on the model's accuracy is influenced by the total amount of flow passing through it and, as a consequence, is unevenly distributed across the network. We also argue that, depending on the parameterization, imbalance may be inevitable. In this regard, we consider the problem of sampling from distributions over graphs with GFlowNets parameterized by graph neural networks (GNNs) and show that the representation limits of GNNs delineate which distributions these GFlowNets can approximate. Lastly, we address these limitations by proposing a theoretically sound and computationally tractable metric for assessing GFlowNets, experimentally showing it is a better proxy for correctness than popular evaluation protocols.

NeurIPS Conference 2024 Conference Paper

Amortized Bayesian Experimental Design for Decision-Making

  • Daolang Huang
  • Yujia Guo
  • Luigi Acerbi
  • Samuel Kaski

Many critical decisions, such as personalized medical diagnoses and product pricing, are made based on insights gained from designing, observing, and analyzing a series of experiments. This highlights the crucial role of experimental design, which goes beyond merely collecting information on system parameters as in traditional Bayesian experimental design (BED), but also plays a key part in facilitating downstream decision-making. Most recent BED methods use an amortized policy network to rapidly design experiments. However, the information gathered through these methods is suboptimal for down-the-line decision-making, as the experiments are not inherently designed with downstream objectives in mind. In this paper, we present an amortized decision-aware BED framework that prioritizes maximizing downstream decision utility. We introduce a novel architecture, the Transformer Neural Decision Process (TNDP), capable of instantly proposing the next experimental design, whilst inferring the downstream decision, thus effectively amortizing both tasks within a unified workflow. We demonstrate the performance of our method across several tasks, showing that it can deliver informative designs and facilitate accurate decision-making.

UAI Conference 2024 Conference Paper

Bayesian Active Learning in the Presence of Nuisance Parameters

  • Sabina J. Sloman
  • Ayush Bharti
  • Julien Martinelli
  • Samuel Kaski

In many settings, such as scientific inference, optimization, and transfer learning, the learner has a well-defined objective, which can be treated as estimation of a target parameter, and no intrinsic interest in characterizing the entire data-generating process. Usually, the learner must also contend with additional sources of uncertainty or variables — with nuisance parameters. Bayesian active learning, or sequential optimal experimental design, can straightforwardly accommodate the presence of nuisance parameters, and so is a natural active learning framework for such problems. However, the introduction of nuisance parameters can lead to bias in the Bayesian learner’s estimate of the target parameters, a phenomenon we refer to as negative interference. We characterize the threat of negative interference and how it fundamentally changes the nature of the Bayesian active learner’s task. We show that the extent of negative interference can be extremely large, and that accurate estimation of the nuisance parameters is critical to reducing it. The Bayesian active learner is confronted with a dilemma: whether to spend a finite acquisition budget in pursuit of estimation of the target or of the nuisance parameters. Our setting encompasses Bayesian transfer learning as a special case, and our results shed light on the phenomenon of negative transfer between learning environments.

ICML Conference 2024 Conference Paper

Embarrassingly Parallel GFlowNets

  • Tiago da Silva
  • Luiz Max Carvalho
  • Amauri H. Souza Jr.
  • Samuel Kaski
  • Diego Mesquita

GFlowNets are a promising alternative to MCMC sampling for discrete compositional random variables. Training GFlowNets requires repeated evaluations of the unnormalized target distribution, or reward function. However, for large-scale posterior sampling, this may be prohibitive since it incurs traversing the data several times. Moreover, if the data are distributed across clients, employing standard GFlowNets leads to intensive client-server communication. To alleviate both these issues, we propose embarrassingly parallel GFlowNet (EP-GFlowNet). EP-GFlowNet is a provably correct divide-and-conquer method to sample from product distributions of the form $R(\cdot) \propto R_1(\cdot). .. R_N(\cdot)$ — e. g. , in parallel or federated Bayes, where each $R_n$ is a local posterior defined on a data partition. First, in parallel, we train a local GFlowNet targeting each $R_n$ and send the resulting models to the server. Then, the server learns a global GFlowNet by enforcing our newly proposed aggregating balance condition, requiring a single communication step. Importantly, EP-GFlowNets can also be applied to multi-objective optimization and model reuse. Our experiments illustrate the effectiveness of EP-GFlowNets on multiple tasks, including parallel Bayesian phylogenetics, multi-objective multiset and sequence generation, and federated Bayesian structure learning.

NeurIPS Conference 2024 Conference Paper

Improving robustness to corruptions with multiplicative weight perturbations

  • Trung Trinh
  • Markus Heinonen
  • Luigi Acerbi
  • Samuel Kaski

Deep neural networks (DNNs) excel on clean images but struggle with corrupted ones. Incorporating specific corruptions into the data augmentation pipeline can improve robustness to those corruptions but may harm performance on clean images and other types of distortion. In this paper, we introduce an alternative approach that improves the robustness of DNNs to a wide range of corruptions without compromising accuracy on clean images. We first demonstrate that input perturbations can be mimicked by multiplicative perturbations in the weight space. Leveraging this, we propose Data Augmentation via Multiplicative Perturbation (DAMP), a training method that optimizes DNNs under random multiplicative weight perturbations. We also examine the recently proposed Adaptive Sharpness-Aware Minimization (ASAM) and show that it optimizes DNNs under adversarial multiplicative weight perturbations. Experiments on image classification datasets (CIFAR-10/100, TinyImageNet and ImageNet) and neural network architectures (ResNet50, ViT-S/16, ViT-B/16) show that DAMP enhances model generalization performance in the presence of corruptions across different settings. Notably, DAMP is able to train a ViT-S/16 on ImageNet from scratch, reaching the top-1 error of 23. 7% which is comparable to ResNet50 without extensive data augmentations.

ICLR Conference 2024 Conference Paper

Input-gradient space particle inference for neural network ensembles

  • Trung Q. Trinh
  • Markus Heinonen
  • Luigi Acerbi
  • Samuel Kaski

Deep Ensembles (DEs) demonstrate improved accuracy, calibration and robustness to perturbations over single neural networks partly due to their functional diversity. Particle-based variational inference (ParVI) methods enhance diversity by formalizing a repulsion term based on a network similarity kernel. However, weight-space repulsion is inefficient due to over-parameterization, while direct function-space repulsion has been found to produce little improvement over DEs. To sidestep these difficulties, we propose First-order Repulsive Deep Ensemble (FoRDE), an ensemble learning method based on ParVI, which performs repulsion in the space of first-order input gradients. As input gradients uniquely characterize a function up to translation and are much smaller in dimension than the weights, this method guarantees that ensemble members are functionally different. Intuitively, diversifying the input gradients encourages each network to learn different features, which is expected to improve the robustness of an ensemble. Experiments on image classification datasets and transfer learning tasks show that FoRDE significantly outperforms the gold-standard DEs and other ensemble methods in accuracy and calibration under covariate shift due to input perturbations.

UAI Conference 2024 Conference Paper

Learning relevant contextual variables within Bayesian optimization

  • Julien Martinelli
  • Ayush Bharti
  • Armi Tiihonen
  • S. T. John
  • Louis Filstroff
  • Sabina J. Sloman
  • Patrick Rinke
  • Samuel Kaski

Contextual Bayesian Optimization (CBO) efficiently optimizes black-box functions with respect to design variables, while simultaneously integrating _contextual_ information regarding the environment, such as experimental conditions. However, the relevance of contextual variables is not necessarily known beforehand. Moreover, contextual variables can sometimes be optimized themselves at additional cost, a setting overlooked by current CBO algorithms. Cost-sensitive CBO would simply include optimizable contextual variables as part of the design variables based on their cost. Instead, we adaptively select a subset of contextual variables to include in the optimization, based on the trade-off between their _relevance_ and the additional cost incurred by optimizing them compared to leaving them to be determined by the environment. We learn the relevance of contextual variables by sensitivity analysis of the posterior surrogate model while minimizing the cost of optimization by leveraging recent developments on early stopping for BO. We empirically evaluate our proposed Sensitivity-Analysis-Driven Contextual BO (_SADCBO_) method against alternatives on both synthetic and real-world experiments, together with extensive ablation studies, and demonstrate a consistent improvement across examples.

ICML Conference 2024 Conference Paper

Open Ad Hoc Teamwork with Cooperative Game Theory

  • Jianhong Wang
  • Yang Li 0116
  • Yuan Zhang 0027
  • Wei Pan 0004
  • Samuel Kaski

Ad hoc teamwork poses a challenging problem, requiring the design of an agent to collaborate with teammates without prior coordination or joint training. Open ad hoc teamwork (OAHT) further complicates this challenge by considering environments with a changing number of teammates, referred to as open teams. One promising solution in practice to this problem is leveraging the generalizability of graph neural networks to handle an unrestricted number of agents with various agent-types, named graph-based policy learning (GPL). However, its joint Q-value representation over a coordination graph lacks convincing explanations. In this paper, we establish a new theory to understand the representation of the joint Q-value for OAHT and its learning paradigm, through the lens of cooperative game theory. Building on our theory, we propose a novel algorithm named CIAO, based on GPL’s framework, with additional provable implementation tricks that can facilitate learning. The demos of experimental results are available on https: //sites. google. com/view/ciao2024, and the code of experiments is published on https: //github. com/hsvgbkhgbv/CIAO.

NeurIPS Conference 2024 Conference Paper

Preference Learning of Latent Decision Utilities with a Human-like Model of Preferential Choice

  • Sebastiaan De Peuter
  • Shibei Zhu
  • Yujia Guo
  • Andrew Howes
  • Samuel Kaski

Preference learning methods make use of models of human choice in order to infer the latent utilities that underlie human behavior. However, accurate modeling of human choice behavior is challenging due to a range of context effects that arise from how humans contrast and evaluate options. Cognitive science has proposed several models that capture these intricacies but, due to their intractable nature, work on preference learning has, in practice, had to rely on tractable but simplified variants of the well-known Bradley-Terry model. In this paper, we take one state-of-the-art intractable cognitive model and propose a tractable surrogate that is suitable for deployment in preference learning. We then introduce a mechanism for fitting the surrogate to human data and extend it to account for data that cannot be explained by the original cognitive model. We demonstrate on large-scale human data that this model produces significantly better inferences on static and actively elicited data than existing Bradley-Terry variants. We further show in simulation that when using this model for preference learning, we can significantly improve utility in a range of real-world tasks.

TMLR Journal 2024 Journal Article

Targeted Active Learning for Bayesian Decision-Making

  • Louis Filstroff
  • Iiris Sundin
  • Petrus Mikkola
  • Aleksei Tiulpin
  • Juuso Kylmäoja
  • Samuel Kaski

Active learning is usually applied to acquire labels of informative data points in supervised learning, to maximize accuracy in a sample-efficient way. However, maximizing the supervised learning accuracy is not the end goal when the results are used for decision-making, for example in personalized medicine or economics. We argue that when acquiring samples sequentially, the common practice of separating learning and decision-making is sub-optimal, and we introduce an active learning strategy that takes the down-the-line decision problem into account. Specifically, we adopt a Bayesian experimental design approach, in which the proposed acquisition criterion maximizes the expected information gain on the posterior distribution of the optimal decision. We compare our targeted active learning strategy to existing alternatives on both simulated and real data and show improved performance in decision-making accuracy.

NeurIPS Conference 2024 Conference Paper

TSGM: A Flexible Framework for Generative Modeling of Synthetic Time Series

  • Alexander Nikitin
  • Letizia Iannucci
  • Samuel Kaski

Time series data are essential in a wide range of machine learning (ML) applications. However, temporal data are often scarce or highly sensitive, limiting data sharing and the use of data-intensive ML methods. A possible solution to this problem is the generation of synthetic datasets that resemble real data. In this work, we introduce Time Series Generative Modeling (TSGM), an open-source framework for the generative modeling and evaluation of synthetic time series datasets. TSGM includes a broad repertoire of machine learning methods: generative models, probabilistic, simulation-based approaches, and augmentation techniques. The framework enables users to evaluate the quality of the produced data from different angles: similarity, downstream effectiveness, predictive consistency, diversity, fairness, and privacy. TSGM is extensible and user-friendly, which allows researchers to rapidly implement their own methods and compare them in a shareable environment. The framework has been tested on open datasets and in production and proved to be beneficial in both cases. https: //github. com/AlexanderVNikitin/tsgm

AAMAS Conference 2024 Conference Paper

Uncoupled Learning of Differential Stackelberg Equilibria with Commitments

  • Robert Loftin
  • Mustafa Mert Çelikok
  • Herke van Hoof
  • Samuel Kaski
  • Frans A. Oliehoek

In multi-agent problems requiring a high degree of cooperation, success often depends on the ability of the agents to adapt to each other’s behavior. A natural solution concept in such settings is the Stackelberg equilibrium, in which the “leader” agent selects the strategy that maximizes its own payoff given that the “follower” agent will choose their best response to this strategy. Recent work has extended this solution concept to two-player differentiable games, such as those arising from multi-agent deep reinforcement learning, in the form of the differential Stackelberg equilibrium. While this previous work has presented learning dynamics which converge to such equilibria, these dynamics are “coupled” in the sense that the learning updates for the leader’s strategy require some information about the follower’s payoff function. As such, these methods cannot be applied to truly decentralised multi-agent settings, particularly ad hoc cooperation, where each agent only has access to its own payoff function. In this work we present “uncoupled” learning dynamics based on zeroth-order gradient estimators, in which each agent’s strategy update depends only on their observations of the other’s behavior. We analyze the convergence of these dynamics in general-sum games, and prove that they converge to differential Stackelberg equilibria under the same conditions as previous coupled methods. Furthermore, we present an online mechanism by which symmetric learners can negotiate leader-follower roles. We conclude with a discussion of the implications of our work for multi-agent reinforcement learning and ad hoc collaboration more generally.

NeurIPS Conference 2023 Conference Paper

Compositional Sculpting of Iterative Generative Processes

  • Timur Garipov
  • Sebastiaan De Peuter
  • Ge Yang
  • Vikas Garg
  • Samuel Kaski
  • Tommi Jaakkola

High training costs of generative models and the need to fine-tune them for specific tasks have created a strong interest in model reuse and composition. A key challenge in composing iterative generative processes, such as GFlowNets and diffusion models, is that to realize the desired target distribution, all steps of the generative process need to be coordinated, and satisfy delicate balance conditions. In this work, we propose Compositional Sculpting: a general approach for defining compositions of iterative generative processes. We then introduce a method for sampling from these compositions built on classifier guidance. We showcase ways to accomplish compositional sculpting in both GFlowNets and diffusion models. We highlight two binary operations $\\unicode{x2014}$ the $\\textit{harmonic mean}\\unicode{x00A0}(p_1 \\otimes p_2$) and the $\\textit{contrast}\\unicode{x00A0}(p_1 \\, \\unicode{x25D1}\\, \\, p_2$) between pairs, and the generalization of these operations to multiple component distributions. We offer empirical results on image and molecular generation tasks. Project codebase: https: //github. com/timgaripov/compositional-sculpting.

UAI Conference 2023 Conference Paper

Differentiable user models

  • Alex Hämäläinen
  • Mustafa Mert Çelikok
  • Samuel Kaski

Probabilistic user modeling is essential for building machine learning systems in the ubiquitous cases with humans in the loop. However, modern advanced user models, often designed as cognitive behavior simulators, are incompatible with modern machine learning pipelines and computationally prohibitive for most practical applications. We address this problem by introducing widely-applicable differentiable surrogates for bypassing this computational bottleneck; the surrogates enable computationally efficient inference with modern cognitive models. We show experimentally that modeling capabilities comparable to the only available solution, existing likelihood-free inference methods, are achievable with a computational cost suitable for online applications. Finally, we demonstrate how AI-assistants can now use cognitive models for online interaction in a menu-search task, which has so far required hours of computation during interaction.

TMLR Journal 2023 Journal Article

DPVIm: Differentially Private Variational Inference Improved

  • Joonas Jälkö
  • Lukas Prediger
  • Antti Honkela
  • Samuel Kaski

Differentially private (DP) release of multidimensional statistics typically considers an aggregate sensitivity, e.g. the vector norm of a high-dimensional vector. However, different dimensions of that vector might have widely different magnitudes and therefore DP perturbation disproportionately affects the signal across dimensions. We observe this problem in the gradient release of the DP-SGD algorithm when using it for variational inference (VI), where it manifests in poor convergence as well as high variance in outputs for certain variational parameters, and make the following contributions: (i) We mathematically isolate the cause for the difference in magnitudes between gradient parts corresponding to different variational parameters. Using this as prior knowledge we establish a link between the gradients of the variational parameters, and propose an efficient while simple fix for the problem to obtain a less noisy gradient estimator, which we call \emph{aligned} gradients. This approach allows us to obtain the updates for the covariance parameter of a Gaussian posterior approximation without a privacy cost. We compare this to alternative approaches for scaling the gradients using analytically derived preconditioning, e.g. natural gradients. (ii) We suggest using iterate averaging over the DP parameter traces recovered during the training, to reduce the DP-induced noise in parameter estimates at no additional cost in privacy. Finally, (iii) to accurately capture the additional uncertainty DP introduces to the model parameters, we infer the DP-induced noise from the parameter traces and include that in the learned posteriors to make them \emph{noise aware}. We demonstrate the efficacy of our proposed improvements through various experiments on real data.

IROS Conference 2023 Conference Paper

Imitation-Guided Multimodal Policy Generation from Behaviourally Diverse Demonstrations

  • Shibei Zhu
  • Rituraj Kaushik
  • Samuel Kaski
  • Ville Kyrki

Learning policies from multiple demonstrators is often difficult because different individuals perform the same task differently due to hidden factors such as preferences. In the context of policy learning, this leads to multimodal policies. Existing policy learning methods often converge to a single solution mode, failing to capture the diversity in the solution space. In this paper, we introduce an imitation-guided reinforcement learning framework to solve the multimodal policy learning problem from a limited number of state-only demonstrations. Then, we propose LfBD (Learning from Behaviourally diverse Demonstration), an algorithm that builds a parameterised solution space to capture the variability in the behaviour space defined by demonstrations. To this end, we define a projection function based on the state density distributions from demonstrations to define such space. Our goal is not only to learn how to solve the task as the human demonstrator but also to extrapolate beyond the provided demonstrations. In addition, we show that with our method, we can perform a post-hoc policy search in the built solution space to recover policies that satisfy specific constraints or to find a policy that matches a given (state-only) behaviour.

NeurIPS Conference 2023 Conference Paper

Learning Robust Statistics for Simulation-based Inference under Model Misspecification

  • Daolang Huang
  • Ayush Bharti
  • Amauri Souza
  • Luigi Acerbi
  • Samuel Kaski

Simulation-based inference (SBI) methods such as approximate Bayesian computation (ABC), synthetic likelihood, and neural posterior estimation (NPE) rely on simulating statistics to infer parameters of intractable likelihood models. However, such methods are known to yield untrustworthy and misleading inference outcomes under model misspecification, thus hindering their widespread applicability. In this work, we propose the first general approach to handle model misspecification that works across different classes of SBI methods. Leveraging the fact that the choice of statistics determines the degree of misspecification in SBI, we introduce a regularized loss function that penalizes those statistics that increase the mismatch between the data and the model. Taking NPE and ABC as use cases, we demonstrate the superior performance of our method on high-dimensional time-series models that are artificially misspecified. We also apply our method to real data from the field of radio propagation where the model is known to be misspecified. We show empirically that the method yields robust inference in misspecified scenarios, whilst still being accurate when the model is well-specified.

ICML Conference 2023 Conference Paper

Optimally-weighted Estimators of the Maximum Mean Discrepancy for Likelihood-Free Inference

  • Ayush Bharti
  • Masha Naslidnyk
  • Oscar Key
  • Samuel Kaski
  • François-Xavier Briol

Likelihood-free inference methods typically make use of a distance between simulated and real data. A common example is the maximum mean discrepancy (MMD), which has previously been used for approximate Bayesian computation, minimum distance estimation, generalised Bayesian inference, and within the nonparametric learning framework. The MMD is commonly estimated at a root-$m$ rate, where $m$ is the number of simulated samples. This can lead to significant computational challenges since a large $m$ is required to obtain an accurate estimate, which is crucial for parameter estimation. In this paper, we propose a novel estimator for the MMD with significantly improved sample complexity. The estimator is particularly well suited for computationally expensive smooth simulators with low- to mid-dimensional inputs. This claim is supported through both theoretical results and an extensive simulation study on benchmark simulators.

NeurIPS Conference 2023 Conference Paper

Practical Equivariances via Relational Conditional Neural Processes

  • Daolang Huang
  • Manuel Haussmann
  • Ulpu Remes
  • ST John
  • Grégoire Clarté
  • Kevin Luck
  • Samuel Kaski
  • Luigi Acerbi

Conditional Neural Processes (CNPs) are a class of metalearning models popular for combining the runtime efficiency of amortized inference with reliable uncertainty quantification. Many relevant machine learning tasks, such as in spatio-temporal modeling, Bayesian Optimization and continuous control, inherently contain equivariances – for example to translation – which the model can exploit for maximal performance. However, prior attempts to include equivariances in CNPs do not scale effectively beyond two input dimensions. In this work, we propose Relational Conditional Neural Processes (RCNPs), an effective approach to incorporate equivariances into any neural process model. Our proposed method extends the applicability and impact of equivariant neural processes to higher dimensions. We empirically demonstrate the competitive performance of RCNPs on a large array of tasks naturally containing equivariances.

AAAI Conference 2023 Conference Paper

Teaching to Learn: Sequential Teaching of Learners with Internal States

  • Mustafa Mert Çelikok
  • Pierre-Alexandre Murena
  • Samuel Kaski

In sequential machine teaching, a teacher’s objective is to provide the optimal sequence of inputs to sequential learners in order to guide them towards the best model. However, this teaching objective considers a restricted class of learners with fixed inductive biases. In this paper, we extend the machine teaching framework to learners that can improve their inductive biases, represented as latent internal states, in order to generalize to new datasets. We introduce a novel framework in which learners’ inductive biases may change with the teaching interaction, which affects the learning performance in future tasks. In order to teach such learners, we propose a multi-objective control approach that takes the future performance of the learner after teaching into account. This framework provides tools for modelling learners with internal states, humans and meta-learning algorithms alike. Furthermore, we distinguish manipulative teaching, which can be done by effectively hiding data and also used for indoctrination, from teaching to learn which aims to help the learner become better at learning from new datasets in the absence of a teacher. Our empirical results demonstrate that our framework is able to reduce the number of required tasks for online meta-learning, and increases independent learning performance of simulated human users in future tasks.

AAAI Conference 2023 Conference Paper

Zero-Shot Assistance in Sequential Decision Problems

  • Sebastiaan De Peuter
  • Samuel Kaski

We consider the problem of creating assistants that can help agents solve new sequential decision problems, assuming the agent is not able to specify the reward function explicitly to the assistant. Instead of acting in place of the agent as in current automation-based approaches, we give the assistant an advisory role and keep the agent in the loop as the main decision maker. The difficulty is that we must account for potential biases of the agent which may cause it to seemingly irrationally reject advice. To do this we introduce a novel formalization of assistance that models these biases, allowing the assistant to infer and adapt to them. We then introduce a new method for planning the assistant's actions which can scale to large decision making problems. We show experimentally that our approach adapts to these agent biases, and results in higher cumulative reward for the agent than automation-based alternatives. Lastly, we show that an approach combining advice and automation outperforms advice alone at the cost of losing some safety guarantees.

ICML Conference 2022 Conference Paper

Approximate Bayesian Computation with Domain Expert in the Loop

  • Ayush Bharti
  • Louis Filstroff
  • Samuel Kaski

Approximate Bayesian computation (ABC) is a popular likelihood-free inference method for models with intractable likelihood functions. As ABC methods usually rely on comparing summary statistics of observed and simulated data, the choice of the statistics is crucial. This choice involves a trade-off between loss of information and dimensionality reduction, and is often determined based on domain knowledge. However, handcrafting and selecting suitable statistics is a laborious task involving multiple trial-and-error steps. In this work, we introduce an active learning method for ABC statistics selection which reduces the domain expert’s work considerably. By involving the experts, we are able to handle misspecified models, unlike the existing dimension reduction methods. Moreover, empirical results show better posterior estimates than with existing methods, when the simulation budget is limited.

AAMAS Conference 2022 Conference Paper

Best-Response Bayesian Reinforcement Learning with Bayes-adaptive POMDPs for Centaurs

  • Mustafa Mert Çelikok
  • Frans A. Oliehoek
  • Samuel Kaski

Centaurs are half-human, half-AI decision-makers where the AI’s goal is to complement the human. To do so, the AI must be able to recognize the goals and constraints of the human and have the means to help them. We present a novel formulation of the interaction between the human and the AI as a sequential game where the agents are modelled using Bayesian best-response models. We show that in this case the AI’s problem of helping bounded-rational humans make better decisions reduces to a Bayes-adaptive POMDP. In our simulated experiments, we consider an instantiation of our framework for humans who are subjectively optimistic about the AI’s future behaviour. Our results show that when equipped with a model of the human, the AI can infer the human’s bounds and nudge them towards better decisions. We discuss ways in which the machine can learn to improve upon its own limitations as well with the help of the human. We identify a novel trade-off for centaurs in partially observable tasks: for the AI’s actions to be acceptable to the human, the machine must make sure their beliefs are sufficiently aligned, but aligning beliefs might be costly. We present a preliminary theoretical analysis of this trade-off and its dependence on task structure.

NeurIPS Conference 2022 Conference Paper

Deconfounded Representation Similarity for Comparison of Neural Networks

  • Tianyu Cui
  • Yogesh Kumar
  • Pekka Marttinen
  • Samuel Kaski

Similarity metrics such as representational similarity analysis (RSA) and centered kernel alignment (CKA) have been used to understand neural networks by comparing their layer-wise representations. However, these metrics are confounded by the population structure of data items in the input space, leading to inconsistent conclusions about the \emph{functional} similarity between neural networks, such as spuriously high similarity of completely random neural networks and inconsistent domain relations in transfer learning. We introduce a simple and generally applicable fix to adjust for the confounder with covariate adjustment regression, which improves the ability of CKA and RSA to reveal functional similarity and also retains the intuitive invariance properties of the original similarity measures. We show that deconfounding the similarity metrics increases the resolution of detecting functionally similar neural networks across domains. Moreover, in real-world applications, deconfounding improves the consistency between CKA and domain similarity in transfer learning, and increases the correlation between CKA and model out-of-distribution accuracy similarity.

NeurIPS Conference 2022 Conference Paper

Modular Flows: Differential Molecular Generation

  • Yogesh Verma
  • Samuel Kaski
  • Markus Heinonen
  • Vikas Garg

Generating new molecules is fundamental to advancing critical applications such as drug discovery and material synthesis. Flows can generate molecules effectively by inverting the encoding process, however, existing flow models either require artifactual dequantization or specific node/edge orderings, lack desiderata such as permutation invariance, or induce discrepancy between encoding and decoding steps that necessitates post hoc validity correction. Inspired by graph PDEs, we circumvent these issues with novel continuous normalizing E(3)-equivariant flows, based on a system of coupled node ODEs, that repeatedly reconcile locally toward globally aligned densities. Our models can be cast as message passing temporal networks, and result in superlative density estimation and molecular generation. In particular, our generated samples achieve state of the art on both the standard QM9 and ZINC250K benchmarks.

NeurIPS Conference 2022 Conference Paper

Provably expressive temporal graph networks

  • Amauri Souza
  • Diego Mesquita
  • Samuel Kaski
  • Vikas Garg

Temporal graph networks (TGNs) have gained prominence as models for embedding dynamic interactions, but little is known about their theoretical underpinnings. We establish fundamental results about the representational power and limits of the two main categories of TGNs: those that aggregate temporal walks (WA-TGNs), and those that augment local message passing with recurrent memory modules (MP-TGNs). Specifically, novel constructions reveal the inadequacy of MP-TGNs and WA-TGNs, proving that neither category subsumes the other. We extend the 1-WL (Weisfeiler-Leman) test to temporal graphs, and show that the most powerful MP-TGNs should use injective updates, as in this case they become as expressive as the temporal WL. Also, we show that sufficiently deep MP-TGNs cannot benefit from memory, and MP/WA-TGNs fail to compute graph properties such as girth. These theoretical insights lead us to PINT --- a novel architecture that leverages injective temporal message passing and relative positional features. Importantly, PINT is provably more expressive than both MP-TGNs and WA-TGNs. PINT significantly outperforms existing TGNs on several real-world benchmarks.

ICML Conference 2022 Conference Paper

Tackling covariate shift with node-based Bayesian neural networks

  • Trung Q. Trinh
  • Markus Heinonen
  • Luigi Acerbi
  • Samuel Kaski

Bayesian neural networks (BNNs) promise improved generalization under covariate shift by providing principled probabilistic representations of epistemic uncertainty. However, weight-based BNNs often struggle with high computational complexity of large-scale architectures and datasets. Node-based BNNs have recently been introduced as scalable alternatives, which induce epistemic uncertainty by multiplying each hidden node with latent random variables, while learning a point-estimate of the weights. In this paper, we interpret these latent noise variables as implicit representations of simple and domain-agnostic data perturbations during training, producing BNNs that perform well under covariate shift due to input corruptions. We observe that the diversity of the implicit corruptions depends on the entropy of the latent variables, and propose a straightforward approach to increase the entropy of these variables during training. We evaluate the method on out-of-distribution image classification benchmarks, and show improved uncertainty estimation of node-based BNNs under covariate shift due to input perturbations. As a side effect, the method also provides robustness against noisy training labels.

UAI Conference 2022 Conference Paper

Variational multiple shooting for Bayesian ODEs with Gaussian processes

  • Pashupati Hegde
  • Çagatay Yildiz
  • Harri Lähdesmäki
  • Samuel Kaski
  • Markus Heinonen

Recent machine learning advances have proposed black-box estimation of \textit{unknown continuous-time system dynamics} directly from data. However, earlier works are based on approximative solutions or point estimates. We propose a novel Bayesian nonparametric model that uses Gaussian processes to infer posteriors of unknown ODE systems directly from data. We derive sparse variational inference with decoupled functional sampling to represent vector field posteriors. We also introduce a probabilistic shooting augmentation to enable efficient inference from arbitrarily long trajectories. The method demonstrates the benefit of computing vector field posteriors, with predictive uncertainty scores outperforming alternative methods on multiple ODE learning tasks.

NeurIPS Conference 2021 Conference Paper

De-randomizing MCMC dynamics with the diffusion Stein operator

  • Zheyang Shen
  • Markus Heinonen
  • Samuel Kaski

Approximate Bayesian inference estimates descriptors of an intractable target distribution - in essence, an optimization problem within a family of distributions. For example, Langevin dynamics (LD) extracts asymptotically exact samples from a diffusion process because the time evolution of its marginal distributions constitutes a curve that minimizes the KL-divergence via steepest descent in the Wasserstein space. Parallel to LD, Stein variational gradient descent (SVGD) similarly minimizes the KL, albeit endowed with a novel Stein-Wasserstein distance, by deterministically transporting a set of particle samples, thus de-randomizes the stochastic diffusion process. We propose de-randomized kernel-based particle samplers to all diffusion-based samplers known as MCMC dynamics. Following previous work in interpreting MCMC dynamics, we equip the Stein-Wasserstein space with a fiber-Riemannian Poisson structure, with the capacity of characterizing a fiber-gradient Hamiltonian flow that simulates MCMC dynamics. Such dynamics discretizes into generalized SVGD (GSVGD), a Stein-type deterministic particle sampler, with particle updates coinciding with applying the diffusion Stein operator to a kernel function. We demonstrate empirically that GSVGD can de-randomize complex MCMC dynamics, which combine the advantages of auxiliary momentum variables and Riemannian structure, while maintaining the high sample quality from an interacting particle system.

ICML Conference 2021 Conference Paper

Differentially Private Bayesian Inference for Generalized Linear Models

  • Tejas Kulkarni
  • Joonas Jälkö
  • Antti Koskela
  • Samuel Kaski
  • Antti Honkela

Generalized linear models (GLMs) such as logistic regression are among the most widely used arms in data analyst’s repertoire and often used on sensitive datasets. A large body of prior works that investigate GLMs under differential privacy (DP) constraints provide only private point estimates of the regression coefficients, and are not able to quantify parameter uncertainty. In this work, with logistic and Poisson regression as running examples, we introduce a generic noise-aware DP Bayesian inference method for a GLM at hand, given a noisy sum of summary statistics. Quantifying uncertainty allows us to determine which of the regression coefficients are statistically significantly different from zero. We provide a previously unknown tight privacy analysis and experimentally demonstrate that the posteriors obtained from our model, while adhering to strong privacy guarantees, are close to the non-private posteriors.

UAI Conference 2021 Conference Paper

Federated stochastic gradient Langevin dynamics

  • Khaoula el Mekkaoui
  • Diego Mesquita
  • Paul Blomstedt
  • Samuel Kaski

Stochastic gradient MCMC methods, such as stochastic gradient Langevin dynamics (SGLD), employ fast but noisy gradient estimates to enable large-scale posterior sampling. Although we can easily extend SGLD to distributed settings, it suffers from two issues when applied to federated non-IID data. First, the variance of these estimates increases significantly. Second, delaying communication causes the Markov chains to diverge from the true posterior even for very simple models. To alleviate both these problems, we propose conducive gradients, a simple mechanism that combines local likelihood approximations to correct gradient updates. Notably, conducive gradients are easy to compute, and since we only calculate the approximations once, they incur negligible overhead. We apply conducive gradients to distributed stochastic gradient Langevin dynamics (DSGLD) and call the resulting method “federated stochastic gradient Langevin dynamics” (FSGLD). We demonstrate that our approach can handle delayed communication rounds, converging to the target posterior in cases where DSGLD fails. We also show that FSGLD outperforms DSGLD for non-IID federated data with experiments on metric learning and neural networks.

ECAI Conference 2020 Conference Paper

Learning Global Pairwise Interactions with Bayesian Neural Networks

  • Tianyu Cui
  • Pekka Marttinen
  • Samuel Kaski

Estimating global pairwise interaction effects, i. e. , the difference between the joint effect and the sum of marginal effects of two input features, with uncertainty properly quantified, is centrally important in science applications. We propose a non-parametric probabilistic method for detecting interaction effects of unknown form. First, the relationship between the features and the output is modelled using a Bayesian neural network, capable of representing complex interactions and principled uncertainty. Second, interaction effects and their uncertainty are estimated from the trained model. For the second step, we propose an intuitive global interaction measure: Bayesian Group Expected Hessian (GEH), which aggregates information of local interactions as captured by the Hessian. GEH provides a natural trade-off between type I and type II error and, moreover, comes with theoretical guarantees ensuring that the estimated interaction effects and their uncertainty can be improved by training a more accurate BNN. The method empirically outperforms available non-probabilistic alternatives on simulated and real-world data. Finally, we demonstrate its ability to detect interpretable interactions between higher-level features (at deeper layers of the neural network).

ICML Conference 2020 Conference Paper

Projective Preferential Bayesian Optimization

  • Petrus Mikkola
  • Milica Todorovic
  • Jari Järvi
  • Patrick Rinke
  • Samuel Kaski

Bayesian optimization is an effective method for finding extrema of a black-box function. We propose a new type of Bayesian optimization for learning user preferences in high-dimensional spaces. The central assumption is that the underlying objective function cannot be evaluated directly, but instead a minimizer along a projection can be queried, which we call a projective preferential query. The form of the query allows for feedback that is natural for a human to give, and which enables interaction. This is demonstrated in a user experiment in which the user feedback comes in the form of optimal position and orientation of a molecule adsorbing to a surface. We demonstrate that our framework is able to find a global minimum of a high-dimensional black-box function, which is an infeasible task for existing preferential Bayesian optimization frameworks that are based on pairwise comparisons.

NeurIPS Conference 2020 Conference Paper

Rethinking pooling in graph neural networks

  • Diego Mesquita
  • Amauri Souza
  • Samuel Kaski

Graph pooling is a central component of a myriad of graph neural network (GNN) architectures. As an inheritance from traditional CNNs, most approaches formulate graph pooling as a cluster assignment problem, extending the idea of local patches in regular grids to graphs. Despite the wide adherence to this design choice, no work has rigorously evaluated its influence on the success of GNNs. In this paper, we build upon representative GNNs and introduce variants that challenge the need for locality-preserving representations, either using randomization or clustering on the complement graph. Strikingly, our experiments demonstrate that using these variants does not result in any decrease in performance. To understand this phenomenon, we study the interplay between convolutional layers and the subsequent pooling ones. We show that the convolutions play a leading role in the learned representations. In contrast to the common belief, local pooling is not responsible for the success of GNNs on relevant and widely-used benchmarks.

AAAI Conference 2020 Conference Paper

Scalable Probabilistic Matrix Factorization with Graph-Based Priors

  • Jonathan Strahl
  • Jaakko Peltonen
  • Hirsohi Mamitsuka
  • Samuel Kaski

In matrix factorization, available graph side-information may not be well suited for the matrix completion problem, having edges that disagree with the latent-feature relations learnt from the incomplete data matrix. We show that removing these contested edges improves prediction accuracy and scalability. We identify the contested edges through a highlyefficient graphical lasso approximation. The identification and removal of contested edges adds no computational complexity to state-of-the-art graph-regularized matrix factorization, remaining linear with respect to the number of nonzeros. Computational load even decreases proportional to the number of edges removed. Formulating a probabilistic generative model and using expectation maximization to extend graph-regularised alternating least squares (GRALS) guarantees convergence. Rich simulated experiments illustrate the desired properties of the resulting algorithm. On real data experiments we demonstrate improved prediction accuracy with fewer graph edges (empirical evidence that graph sideinformation is often inaccurate). A 300 thousand dimensional graph with three million edges (Yahoo music sideinformation) can be analyzed in under ten minutes on a standard laptop computer demonstrating the efficiency of our graph update.

ICML Conference 2019 Conference Paper

Active Learning for Decision-Making from Imbalanced Observational Data

  • Iiris Sundin
  • Peter Schulam 0001
  • Eero Siivola
  • Aki Vehtari
  • Suchi Saria
  • Samuel Kaski

Machine learning can help personalized decision support by learning models to predict individual treatment effects (ITE). This work studies the reliability of prediction-based decision-making in a task of deciding which action $a$ to take for a target unit after observing its covariates $\tilde{x}$ and predicted outcomes $\hat{p}(\tilde{y} \mid \tilde{x}, a)$. An example case is personalized medicine and the decision of which treatment to give to a patient. A common problem when learning these models from observational data is imbalance, that is, difference in treated/control covariate distributions, which is known to increase the upper bound of the expected ITE estimation error. We propose to assess the decision-making reliability by estimating the ITE model’s Type S error rate, which is the probability of the model inferring the sign of the treatment effect wrong. Furthermore, we use the estimated reliability as a criterion for active learning, in order to collect new (possibly expensive) observations, instead of making a forced choice based on unreliable predictions. We demonstrate the effectiveness of this decision-making aware active learning in two decision-making tasks: in simulated data with binary outcomes and in a medical dataset with synthetic and continuous treatment outcomes.

UAI Conference 2019 Conference Paper

Embarrassingly Parallel MCMC using Deep Invertible Transformations

  • Diego Mesquita
  • Paul Blomstedt
  • Samuel Kaski

While MCMC methods have become a main work-horse for Bayesian inference, scaling them to large distributed datasets is still a challenge. Embarrassingly parallel MCMC strategies take a divide-and-conquer stance to achieve this by writing the target posterior as a product of subposteriors, running MCMC for each of them in parallel and subsequently combining the results. The challenge then lies in devising efficient aggregation strategies. Current strategies trade-off between approximation quality, and costs of communication and computation. In this work, we introduce a novel method that addresses these issues simultaneously. Our key insight is to introduce a deep invertible transformation to approximate each of the subposteriors. These approximations can be made accurate even for complex distributions and serve as intermediate representations, keeping the total communication cost limited. Moreover, they enable us to sample from the product of the subposteriors using an efficient and stable importance sampling scheme. We demonstrate that the approach outperforms available state-of-the-art methods in a range of challenging scenarios, including high-dimensional and heterogeneous subposteriors.

IJCAI Conference 2019 Conference Paper

Human-in-the-loop Active Covariance Learning for Improving Prediction in Small Data Sets

  • Homayun Afrabandpey
  • Tomi Peltola
  • Samuel Kaski

Learning predictive models from small high-dimensional data sets is a key problem in high-dimensional statistics. Expert knowledge elicitation can help, and a strong line of work focuses on directly eliciting informative prior distributions for parameters. This either requires considerable statistical expertise or is laborious, as the emphasis has been on accuracy and not on efficiency of the process. Another line of work queries about importance of features one at a time, assuming them to be independent and hence missing covariance information. In contrast, we propose eliciting expert knowledge about pairwise feature similarities, to borrow statistical strength in the predictions, and using sequential decision making techniques to minimize the effort of the expert. Empirical results demonstrate improvement in predictive performance on both simulated and real data, in high-dimensional linear regression tasks, where we learn the covariance structure with a Gaussian process, based on sequential elicitation.

NeurIPS Conference 2019 Conference Paper

Machine Teaching of Active Sequential Learners

  • Tomi Peltola
  • Mustafa Mert Çelikok
  • Pedram Daee
  • Samuel Kaski

Machine teaching addresses the problem of finding the best training data that can guide a learning algorithm to a target model with minimal effort. In conventional settings, a teacher provides data that are consistent with the true data distribution. However, for sequential learners which actively choose their queries, such as multi-armed bandits and active learners, the teacher can only provide responses to the learner’s queries, not design the full data. In this setting, consistent teachers can be sub-optimal for finite horizons. We formulate this sequential teaching problem, which current techniques in machine teaching do not address, as a Markov decision process, with the dynamics nesting a model of the learner and the actions being the teacher's responses. Furthermore, we address the complementary problem of learning from a teacher that plans: to recognise the teaching intent of the responses, the learner is endowed with a model of the teacher. We test the formulation with multi-armed bandit learners in simulated experiments and a user study. The results show that learning is improved by (i) planning teaching and (ii) the learner having a model of the teacher. The approach gives tools to taking into account strategic (planning) behaviour of users of interactive intelligent systems, such as recommendation engines, by considering them as boundedly optimal teachers.

RLDM Conference 2019 Conference Abstract

Modelling User’s Theory of AI’s Mind in Interactive Intelligent Systems

  • Mustafa Mert Çelikok
  • Tomi Peltola
  • Samuel Kaski

Multi-armed bandits provide a sample- and computationally efficient approach to developing assisting agents for interactive systems. Yet, they cannot capture strategic behaviour of an intelligent user, be it human or artificial, who forms a mental model of the system. We propose a new probabilistic multi- agent model that endows bandits with a theory of mind: the system has a model of the user having a model of the system. This is implemented as a nested bandit–Markov decision process–bandit model. We show that inference in the model reduces to probabilistic inverse reinforcement learning. Results show improved performance in simulations and in a user experiment. The improvements when users can form accurate mental models that the system can capture imply that predictability of the interactive intelligent system is important not only for the user experience but also for the design of the system’s statistical models.

IJCAI Conference 2019 Conference Paper

Scalable Bayesian Non-linear Matrix Completion

  • Xiangju Qin
  • Paul Blomstedt
  • Samuel Kaski

Matrix completion aims to predict missing elements in a partially observed data matrix which in typical applications, such as collaborative filtering, is large and extremely sparsely observed. A standard solution is matrix factorization, which predicts unobserved entries as linear combinations of latent variables. We generalize to non-linear combinations in massive-scale matrices. Bayesian approaches have been proven beneficial in linear matrix completion, but not applied in the more general non-linear case, due to limited scalability. We introduce a Bayesian non-linear matrix completion algorithm, which is based on a recent Bayesian formulation of Gaussian process latent variable models. To solve the challenges regarding scalability and computation, we propose a data-parallel distributed computational approach with a restricted communication scheme. We evaluate our method on challenging out-of-matrix prediction tasks using both simulated and real-world data.

JMLR Journal 2018 Journal Article

ELFI: Engine for Likelihood-Free Inference

  • Jarno Lintusaari
  • Henri Vuollekoski
  • Antti Kangasrääsiö
  • Kusti Skytén
  • Marko Järvenpää
  • Pekka Marttinen
  • Michael U. Gutmann
  • Aki Vehtari

Engine for Likelihood-Free Inference (ELFI) is a Python software library for performing likelihood-free inference (LFI). ELFI provides a convenient syntax for arranging components in LFI, such as priors, simulators, summaries or distances, to a network called ELFI graph. The components can be implemented in a wide variety of languages. The stand-alone ELFI graph can be used with any of the available inference methods without modifications. A central method implemented in ELFI is Bayesian Optimization for Likelihood-Free Inference (BOLFI), which has recently been shown to accelerate likelihood-free inference up to several orders of magnitude by surrogate-modelling the distance. ELFI also has an inbuilt support for output data storing for reuse and analysis, and supports parallelization of computation from multiple cores up to a cluster environment. ELFI is designed to be extensible and provides interfaces for widening its functionality. This makes the adding of new inference methods to ELFI straightforward and automatically compatible with the inbuilt features. [abs] [ pdf ][ bib ] [ webpage ] [ code ] &copy JMLR 2018. ( edit, beta )

UAI Conference 2018 Conference Paper

Variational zero-inflated Gaussian processes with sparse kernels

  • Pashupati Hegde
  • Markus Heinonen
  • Samuel Kaski

Zero-inflated datasets, which have an excess of zero outputs, are commonly encountered in problems such as climate or rare event modelling. Conventional machine learning approaches tend to overestimate the non-zeros leading to poor performance. We propose a novel model family of zero-inflated Gaussian processes (ZiGP) for such zero-inflated datasets, produced by sparse kernels through learning a latent probit Gaussian process that can zero out kernel rows and columns whenever the signal is absent. The ZiGPs are particularly useful for making the powerful Gaussian process networks more interpretable. We introduce sparse GP networks where variable-order latent modelling is achieved through sparse mixing signals. We derive the non-trivial stochastic variational inference tractably for scalable learning of the sparse kernels in both models. The novel output-sparse approach improves both prediction of zero-inflated data and interpretability of latent mixing models.

NeurIPS Conference 2017 Conference Paper

Differentially private Bayesian learning on distributed data

  • Mikko Heikkilä
  • Eemil Lagerspetz
  • Samuel Kaski
  • Kana Shimizu
  • Sasu Tarkoma
  • Antti Honkela

Many applications of machine learning, for example in health care, would benefit from methods that can guarantee privacy of data subjects. Differential privacy (DP) has become established as a standard for protecting learning results. The standard DP algorithms require a single trusted party to have access to the entire data, which is a clear weakness, or add prohibitive amounts of noise. We consider DP Bayesian learning in a distributed setting, where each party only holds a single sample or a few samples of the data. We propose a learning strategy based on a secure multi-party sum function for aggregating summaries from data holders and the Gaussian mechanism for DP. Our method builds on an asymptotically optimal and practically efficient DP Bayesian inference with rapidly diminishing extra cost.

JMLR Journal 2017 Journal Article

GFA: Exploratory Analysis of Multiple Data Sources with Group Factor Analysis

  • Eemeli Leppäaho
  • Muhammad Ammad-ud-din
  • Samuel Kaski

The R package GFA provides a full pipeline for factor analysis of multiple data sources that are represented as matrices with co-occurring samples. It allows learning dependencies between subsets of the data sources, decomposed into latent factors. The package also implements sparse priors for the factorization, providing interpretable biclusters of the multi-source data. [abs] [ pdf ][ bib ] [ code ] [ r-project.org ] &copy JMLR 2017. ( edit, beta )

NeurIPS Conference 2017 Conference Paper

Non-Stationary Spectral Kernels

  • Sami Remes
  • Markus Heinonen
  • Samuel Kaski

We propose non-stationary spectral kernels for Gaussian process regression by modelling the spectral density of a non-stationary kernel function as a mixture of input-dependent Gaussian process frequency density surfaces. We solve the generalised Fourier transform with such a model, and present a family of non-stationary and non-monotonic kernels that can learn input-dependent and potentially long-range, non-monotonic covariances between inputs. We derive efficient inference using model whitening and marginalized posterior, and show with case studies that these kernels are necessary when modelling even rather simple time series, image or geospatial data with non-stationary characteristics.

IJCAI Conference 2016 Conference Paper

A Robust Convex Formulation for Ensemble Clustering

  • Junning Gao
  • Makoto Yamada
  • Samuel Kaski
  • Hiroshi Mamitsuka
  • Shanfeng Zhu

We formulate ensemble clustering as a regularization problem over nuclear norm and cluster-wise group norm, and present an efficient optimization algorithm, which we call Robust Convex Ensemble Clustering (RCEC). A key feature of RCEC allows to remove anomalous cluster assignments obtained from component clustering methods by using the group-norm regularization. Moreover, the proposed method is convex and can find the globally optimal solution. We first showed that using synthetic data experiments, RCEC could learn stable cluster assignments from the input matrix including anomalous clusters. We then showed that RCEC outperformed state-of-the-art ensemble clustering methods by using real-world data sets.

JMLR Journal 2016 Journal Article

Multiple Output Regression with Latent Noise

  • Jussi Gillberg
  • Pekka Marttinen
  • Matti Pirinen
  • Antti J. Kangas
  • Pasi Soininen
  • Mehreen Ali
  • Aki S. Havulinna
  • Marjo-Riitta Järvelin

In high-dimensional data, structured noise caused by observed and unobserved factors affecting multiple target variables simultaneously, imposes a serious challenge for modeling, by masking the often weak signal. Therefore, (1) explaining away the structured noise in multiple-output regression is of paramount importance. Additionally, (2) assumptions about the correlation structure of the regression weights are needed. We note that both can be formulated in a natural way in a latent variable model, in which both the interesting signal and the noise are mediated through the same latent factors. Under this assumption, the signal model then borrows strength from the noise model by encouraging similar effects on correlated targets. We introduce a hyperparameter for the latent signal-to-noise ratio which turns out to be important for modelling weak signals, and an ordered infinite-dimensional shrinkage prior that resolves the rotational unidentifiability in reduced-rank regression models. Simulations and prediction experiments with metabolite, gene expression, FMRI measurement, and macroeconomic time series data show that our model equals or exceeds the state-of-the-art performance and, in particular, outperforms the standard approach of assuming independent noise and signal models. [abs] [ pdf ][ bib ] &copy JMLR 2016. ( edit, beta )

AAAI Conference 2014 Conference Paper

Optimal Neighborhood Preserving Visualization by Maximum Satisfiability

  • Kerstin Bunte
  • Matti Järvisalo
  • Jeremias Berg
  • Petri Myllymäki
  • Jaakko Peltonen
  • Samuel Kaski

We present a novel approach to low-dimensional neighbor embedding for visualization, based on formulating an information retrieval based neighborhood preservation cost function as Maximum satisfiability on a discretized output display. The method has a rigorous interpretation as optimal visualization based on the cost function. Unlike previous lowdimensional neighbor embedding methods, our formulation is guaranteed to yield globally optimal visualizations, and does so reasonably fast. Unlike previous manifold learning methods yielding global optima of their cost functions, our cost function and method are designed for low-dimensional visualization where evaluation and minimization of visualization errors are crucial. Our method performs well in experiments, yielding clean embeddings of datasets where a stateof-the-art comparison method yields poor arrangements. In a real-world case study for semi-supervised WLAN signal mapping in buildings we outperform state-of-the-art methods.

ICML Conference 2014 Conference Paper

Optimization Equivalence of Divergences Improves Neighbor Embedding

  • Zhirong Yang
  • Jaakko Peltonen
  • Samuel Kaski

Visualization methods that arrange data objects in 2D or 3D layouts have followed two main schools, methods oriented for graph layout and methods oriented for vectorial embedding. We show the two previously separate approaches are tied by an optimization equivalence, making it possible to relate methods from the two approaches and to build new methods that take the best of both worlds. In detail, we prove a theorem of optimization equivalences between beta- and gamma-, as well as alpha- and Renyi-divergences through a connection scalar. Through the equivalences we represent several nonlinear dimensionality reduction and graph drawing methods in a generalized stochastic neighbor embedding setting, where information divergences are minimized between similarities in input and output spaces, and the optimal connection scalar provides a natural choice for the tradeoff between attractive and repulsive forces. We give two examples of developing new visualization methods through the equivalences: 1) We develop weighted symmetric stochastic neighbor embedding (ws-SNE) from Elastic Embedding and analyze its benefits, good performance for both vectorial and network data; in experiments ws-SNE has good performance across data sets of different types, whereas comparison methods fail for some of the data sets; 2) we develop a gamma-divergence version of a PolyLog layout method; the new method is scale invariant in the output space and makes it possible to efficiently use large-scale smoothed neighborhoods.

JMLR Journal 2013 Journal Article

Bayesian Canonical Correlation Analysis

  • Arto Klami
  • Seppo Virtanen
  • Samuel Kaski

Canonical correlation analysis (CCA) is a classical method for seeking correlations between two multivariate data sets. During the last ten years, it has received more and more attention in the machine learning community in the form of novel computational formulations and a plethora of applications. We review recent developments in Bayesian models and inference methods for CCA which are attractive for their potential in hierarchical extensions and for coping with the combination of large dimensionalities and small sample sizes. The existing methods have not been particularly successful in fulfilling the promise yet; we introduce a novel efficient solution that imposes group-wise sparsity to estimate the posterior of an extended model which not only extracts the statistical dependencies (correlations) between data sets but also decomposes the data into shared and data set-specific components. In statistics literature the model is known as inter-battery factor analysis (IBFA), for which we now provide a Bayesian treatment. [abs] [ pdf ][ bib ] &copy JMLR 2013. ( edit, beta )

ICML Conference 2013 Conference Paper

Kernelized Bayesian Matrix Factorization

  • Mehmet Gönen
  • Suleiman A. Khan
  • Samuel Kaski

We extend kernelized matrix factorization with a fully Bayesian treatment and with an ability to work with multiple side information sources expressed as different kernels. Kernel functions have been introduced to matrix factorization to integrate side information about the rows and columns (e. g. , objects and users in recommender systems), which is necessary for making out-of-matrix (i. e. , cold start) predictions. We discuss specifically bipartite graph inference, where the output matrix is binary, but extensions to more general matrices are straightforward. We extend the state of the art in two key aspects: (i) A fully conjugate probabilistic formulation of the kernelized matrix factorization problem enables an efficient variational approximation, whereas fully Bayesian treatments are not computationally feasible in the earlier approaches. (ii) Multiple side information sources are included, treated as different kernels in multiple kernel learning that additionally reveals which side information sources are informative. Our method outperforms alternatives in predicting drug-protein interactions on two data sets. We then show that our framework can also be used for solving multilabel learning problems by considering samples and labels as the two domains where matrix factorization operates on. Our algorithm obtains the lowest Hamming loss values on 10 out of 14 multilabel classification data sets compared to five state-of-the-art multilabel learning algorithms.

ICML Conference 2013 Conference Paper

Scalable Optimization of Neighbor Embedding for Visualization

  • Zhirong Yang
  • Jaakko Peltonen
  • Samuel Kaski

Neighbor embedding (NE) methods have found their use in data visualization but are limited in big data analysis tasks due to their O(n^2) complexity for n data samples. We demonstrate that the obvious approach of subsampling produces inferior results and propose a generic approximated optimization technique that reduces the NE optimization cost to O(n log n). The technique is based on realizing that in visualization the embedding space is necessarily very low-dimensional (2D or 3D), and hence efficient approximations developed for n-body force calculations can be applied. In gradient-based NE algorithms the gradient for an individual point decomposes into “forces” exerted by the other points. The contributions of close-by points need to be computed individually but far-away points can be approximated by their “center of mass”, rapidly computable by applying a recursive decomposition of the visualization space into quadrants. The new algorithm brings a significant speed-up for medium-size data, and brings “big data” within reach of visualization.

UAI Conference 2010 Conference Paper

Bayesian exponential family projections for coupled data sources

  • Arto Klami
  • Seppo Virtanen
  • Samuel Kaski

Exponential family extensions of principal component analysis (EPCA) have received a considerable amount of attention in recent years, demonstrating the growing need for basic modeling tools that do not assume the squared loss or Gaussian distribution. We extend the EPCA model toolbox by presenting the first exponential family multi-view learning methods of the partial least squares and canonical correlation analysis, based on a unified representation of EPCA as matrix factorization of the natural parameters of exponential family. The models are based on a new family of priors that are generally usable for all such factorizations. We also introduce new inference strategies, and demonstrate how the methods outperform earlier ones when the Gaussianity assumption does not hold.

JMLR Journal 2010 Journal Article

Information Retrieval Perspective to Nonlinear Dimensionality Reduction for Data Visualization

  • Jarkko Venna
  • Jaakko Peltonen
  • Kristian Nybo
  • Helena Aidos
  • Samuel Kaski

Nonlinear dimensionality reduction methods are often used to visualize high-dimensional data, although the existing methods have been designed for other related tasks such as manifold learning. It has been difficult to assess the quality of visualizations since the task has not been well-defined. We give a rigorous definition for a specific visualization task, resulting in quantifiable goodness measures and new visualization methods. The task is information retrieval given the visualization: to find similar data based on the similarities shown on the display. The fundamental tradeoff between precision and recall of information retrieval can then be quantified in visualizations as well. The user needs to give the relative cost of missing similar points vs. retrieving dissimilar points, after which the total cost can be measured. We then introduce a new method NeRV ( neighbor retrieval visualizer ) which produces an optimal visualization by minimizing the cost. We further derive a variant for supervised visualization; class information is taken rigorously into account when computing the similarity relationships. We show empirically that the unsupervised version outperforms existing unsupervised dimensionality reduction methods in the visualization task, and the supervised version outperforms existing supervised methods. [abs] [ pdf ][ bib ] &copy JMLR 2010. ( edit, beta )

ICML Conference 2007 Conference Paper

Local dependent components

  • Arto Klami
  • Samuel Kaski

We introduce a mixture of probabilistic canonical correlation analyzers model for analyzing local correlations, or more generally mutual statistical dependencies, in cooccurring data pairs. The model extends the traditional canonical correlation analysis and its probabilistic interpretation in three main ways. First, a full Bayesian treatment enables analysis of small samples (large p , small n , a crucial problem in bioinformatics, for instance), and rigorous estimation of the degree of dependency and independency. Secondly, the mixture formulation generalizes the method from global linearity to the more reasonable assumption of different kinds of dependencies for different kinds of data. As a third novel extension the method decomposes the variation in the data into shared and data set-specific components.

UAI Conference 2005 Conference Paper

Two-Way Latent Grouping Model for User Preference Prediction

  • Eerika Savia
  • Kai Puolamäki
  • Janne Sinkkonen
  • Samuel Kaski

We introduce a novel latent grouping model for predicting the relevance of a new document to a user. The model assumes a latent group structure for both users and documents. We compared the model against a state-of-the-art method, the User Rating Profile model, where only users have a latent group structure. We estimate both models by Gibbs sampling. The new method predicts relevance more accurately for new documents that have few known ratings. The reason is that generalization over documents then becomes necessary and hence the twoway grouping is profitable.