Author name cluster

Michael Hersche

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

16 papers

2 author rows

JBHI Journal 2026 Journal Article

A Composable Channel-Adaptive Architecture for Seizure Classification

Francesco S. Carzaniga
Michael Hersche
Kaspar A. Schindler
Abbas Rahimi

Multi-variate time-series are one of the primary data modalities involved in large classes of problems, where deep learning models represent the state-of-the-art solution. In the healthcare domain electrophysiological data, such as intracranial electroencephalography (iEEG), is used to perform a variety of tasks. However, iEEG models require that the number of channels be fixed, while iEEG setups in clinics are highly personalized and thus vary considerably from one subject to the next. To address this concern, we propose a channel-adaptive (CA) architecture that seamlessly functions on any multi-variate signal with an arbitrary number of channels. Each CA-model can be pre-trained on a large corpus of iEEG recordings from multiple heterogeneous subjects, and then finetuned to each subject using equal or lower amounts of data compared to existing state-of-the-art models, and in only 1/5 of the time. We evaluate our CA-models on a seizure detection task both on a short-term ( $\sim$ 15 hours) and a long-term ( $\sim$ 2600 hours) dataset. In particular, our CA-EEGWaveNet — based on EEGWaveNet — is trained on a single seizure of the tested subject, while the baseline EEGWaveNet is trained on all but one. CA-EEGWaveNet surpasses the baseline in median F1-score (0. 78 vs 0. 76). Similarly, CA-EEGNet — based on EEGNet — also surpasses its baseline (0. 79 vs 0. 74). Overall, we show that the CA architecture is a drop-in replacement for existing seizure classification models, bringing better characteristics and performance across the board.

Details DOI

NAI Journal 2026 Journal Article

Towards Learning to Reason: Comparing LLMs With Neuro-Symbolic on Arithmetic Relations in Abstract Reasoning

Michael Hersche
Giacomo Camposampiero
Roger Wattenhofer
Abu Sebastian
Abbas Rahimi

This work compares large language models (LLMs) and neuro-symbolic approaches in solving Raven’s progressive matrices (RPMs), a visual abstract reasoning test that involves the understanding of mathematical rules such as progression or arithmetic addition. Providing the visual attributes directly as textual prompts, which assumes an oracle visual perception module, allows us to measure the model’s abstract reasoning capability in isolation. Despite providing such compositionally-structured representations from the oracle visual perception and advanced prompting techniques, both GPT-4 and Llama-3 70B cannot achieve perfect accuracy on the center constellation of the I-RAVEN dataset. Our analysis reveals that the root cause lies in the LLM’s weakness in understanding and executing arithmetic rules. As a potential remedy, we analyze the Abductive Rule Learner with Context-awareness (ARLC), a neuro-symbolic approach that learns to reason with vector-symbolic architectures. Here, concepts are represented with distributed vectors such that dot products between encoded vectors define a similarity kernel, and element-wise vector operations perform addition/subtraction on the encoded values. We find that ARLC achieves almost perfect accuracy on the center constellation of I-RAVEN, demonstrating a high fidelity in arithmetic rules. To stress the length generalization capabilities, we extend the RPM tests to larger matrices (3 × 10 instead of typical 3 × 3) and larger dynamic ranges of the attribute values (from 10 up to 1000). We find that the LLM’s accuracy of solving arithmetic rules drops to sub-10%, especially as the dynamic range expands, while ARLC can maintain a high accuracy due to emulating symbolic computations on top of distributed representations. 1

Details DOI

NeSy Conference 2025 Conference Paper

Can Large Reasoning Models do Analogical Reasoning under Perceptual Uncertainty?

Giacomo Camposampiero
Michael Hersche
Roger Wattenhofer
Abu Sebastian
Abbas Rahimi

This work presents a first evaluation of two state-of-the-art Large Reasoning Models (LRMs), OpenAI’s o3-mini and DeepSeek R1, on analogical reasoning, focusing on well-established nonverbal human IQ tests based on Raven’s progressive matrices. We benchmark with the I-RAVEN dataset and its extension, I-RAVEN-X, which tests the ability to generalize to longer reasoning rules and ranges of the attribute values. To assess the influence of visual uncertainties on these symbolic analogical reasoning tests, we extend the I-RAVEN-X dataset, which otherwise assumes an oracle perception. We adopt a two-fold strategy to simulate this imperfect visual perception: 1) we introduce confounding attributes which, being sampled at random, do not contribute to the prediction of the correct answer of the puzzles, and 2) smooth the distributions of the input attributes’ values. We observe a sharp decline in OpenAI’s o3-mini task accuracy, dropping from 86. 6% on the original I-RAVEN to just 17. 0%—approaching random chance—on the more challenging I-RAVEN-X, which increases input length and range and emulates perceptual uncertainty. This drop occurred despite spending 3. 4x more reasoning tokens. A similar trend is also observed for DeepSeek R1: from 80. 6% to 23. 2%. On the other hand, a neuro-symbolic probabilistic abductive model, ARLC, that achieves state-of-the-art performances on I-RAVEN, can robustly reason under all these out-of-distribution tests, maintaining strong accuracy with only a modest accuracy reduction from 98. 6% to 88. 0%. Our code is available at https: //github. com/IBM/raven-large-language-models.

Details

NAI Journal 2025 Journal Article

Factorizers for distributed sparse block codes

Michael Hersche
Aleksandar Terzić
Geethan Karunaratne
Jovin Langenegger
Angéline Pouget
Giovanni Cherubini
Luca Benini
Abu Sebastian

Distributed sparse block codes (SBCs) exhibit compact representations for encoding and manipulating symbolic data structures using fixed-width vectors. One major challenge however is to disentangle, or factorize, the distributed representation of data structures into their constituent elements without having to search through all possible combinations. This factorization becomes more challenging when SBCs vectors are noisy due to perceptual uncertainty and approximations made by modern neural networks to generate the query SBCs vectors. To address these challenges, we first propose a fast and highly accurate method for factorizing a more flexible and hence generalized form of SBCs, dubbed GSBCs. Our iterative factorizer introduces a threshold-based nonlinear activation, conditional random sampling, and an ℓ ∞ -based similarity metric. Its random sampling mechanism, in combination with the search in superposition, allows us to analytically determine the expected number of decoding iterations, which matches the empirical observations up to the GSBC’s bundling capacity. Secondly, the proposed factorizer maintains a high accuracy when queried by noisy product vectors generated using deep convolutional neural networks (CNNs). This facilitates its application in replacing the large fully connected layer (FCL) in CNNs, whereby C trainable class vectors, or attribute combinations, can be implicitly represented by our factorizer having F -factor codebooks, each with C F fixed codevectors. We provide a methodology to flexibly integrate our factorizer in the classification layer of CNNs with a novel loss function. With this integration, the convolutional layers can generate a noisy product vector that our factorizer can still decode, whereby the decoded factors can have different interpretations based on downstream tasks. We demonstrate the feasibility of our method on four deep CNN architectures over CIFAR-100, ImageNet-1K, and RAVEN datasets. In all use cases, the number of parameters and operations are notably reduced compared to the FCL.

Details DOI

AAAI Conference 2025 Conference Paper

On the Expressiveness and Length Generalization of Selective State Space Models on Regular Languages

Aleksandar Terzic
Michael Hersche
Giacomo Camposampiero
Thomas Hofmann
Abu Sebastian
Abbas Rahimi

Selective state-space models (SSMs) are an emerging alternative to the Transformer, offering the unique advantage of parallel training and sequential inference. Although these models have shown promising performance on a variety of tasks, their formal expressiveness and length generalization properties remain underexplored. In this work, we provide insight into the workings of selective SSMs by analyzing their expressiveness and length generalization performance on regular language tasks, i.e., finite-state automaton (FSA) emulation. We address certain limitations of modern SSM-based architectures by introducing the Selective Dense State-Space Model (SD-SSM), the first selective SSM that exhibits perfect length generalization on a set of various regular language tasks using a single layer. It utilizes a dictionary of dense transition matrices, a softmax selection mechanism that creates a convex combination of dictionary matrices at each time step, and a readout consisting of layer normalization followed by a linear map. We then proceed to evaluate variants of diagonal selective SSMs by considering their empirical performance on commutative and non-commutative automata. We explain the experimental results with theoretical considerations.

PDF Details DOI

NeSy Conference 2025 Conference Paper

Practical Lessons on Vector-Symbolic Architectures in Deep Learning-Inspired Environments

Francesco S. Carzaniga
Michael Hersche
Kaspar Schindler
Abbas Rahimi

Neural networks have shown unprecedented capabilities, rivaling human performance in many tasks. However, current neural architectures are not capable of symbolic manipulation, which is thought to be a hallmark of human intelligence. Vector-symbolic architectures (VSAs) promise to bring this ability through simple vector manipulation, highly amenable to current and emerging hardware and software stacks built for their neural counterparts. Integrating the two models into the paradigm of neuro-vector-symbolic architectures may achieve even more human-like performance. However, despite ongoing efforts, there are no clear guidelines on the deployment of VSA in deep learning-based training situations. In this work, we aim to begin providing such guidelines by offering four practical lessons we have observed through the analysis of many VSA models and implementations. We provide thorough benchmarks and results that corroborate such lessons. First, we observe that Multiply-add-permute (MAP) and Hadamard linear binding (HLB) are up to 3-4$\times$ faster than holographic reduced representations (HRR), even when the latter is equipped with optimized FFT-based convolutions. Second, we propose further speed improvements by replacing similarity search with a linear readout, with no effect on retrieval. Third, we analyze the retrieval performance of MAP, HRR and HLB in a noise-free and noisy scenario to simulate processing by a neural network, and show that they are equivalent. Finally, we implement a hierarchical multi-level composition scheme, with notable benefits to the flexibility of integration of VSAs inside existing neural architectures. Overall, we show that these four lessons lead to faster and more effective deployment of VSA.

Details

NeurIPS Conference 2025 Conference Paper

Scalable Evaluation and Neural Models for Compositional Generalization

Giacomo Camposampiero
Pietro Barbiero
Michael Hersche
Roger Wattenhofer
Abbas Rahimi

Compositional generalization—a key open challenge in modern machine learning—requires models to predict unknown combinations of known concepts. However, assessing compositional generalization remains a fundamental challenge due to the lack of standardized evaluation protocols and the limitations of current benchmarks, which often favor efficiency over rigor. At the same time, general-purpose vision architectures lack the necessary inductive biases, and existing approaches to endow them compromise scalability. As a remedy, this paper introduces: 1) a rigorous evaluation framework that unifies and extends previous approaches while reducing computational requirements from combinatorial to constant; 2) an extensive and modern evaluation on the status of compositional generalization in supervised vision backbones, training more than 5000 models; 3) Attribute Invariant Networks, a class of models establishing a new Pareto frontier in compositional generalization, achieving a 23. 43% accuracy improvement over baselines while reducing parameter overhead from 600% to 16% compared to fully disentangled counterparts.

PDF Details

NeurIPS Conference 2025 Conference Paper

Structured Sparse Transition Matrices to Enable State Tracking in State-Space Models

Aleksandar Terzic
Nicolas Menet
Michael Hersche
Thomas Hofmann
Abbas Rahimi

Modern state-space models (SSMs) often utilize structured transition matrices which enable efficient computation but pose restrictions on the model’s expressivity, as measured in terms of the ability to emulate finite-state automata (FSA). While unstructured transition matrices are optimal in terms of expressivity, they come at a prohibitively high compute and memory cost, even for moderate state sizes. We propose a structured sparse parametrization of transition matrices in SSMs that enables FSA state tracking with provably optimal state size and depth, while keeping the computational cost of the recurrence comparable to that of diagonal SSMs. Our method, \emph{PD-SSM}, parametrizes the transition matrix as the product of a column one-hot matrix ($P$) and a complex-valued diagonal matrix ($D$). As a result, the computational cost of parallel scans scales linearly with the state size. Theoretically, the model is BIBO-stable and can emulate any $N$-state FSA with one layer of dimension $N$ and a linear readout of size $N ×N$, significantly improving on all current structured SSM guarantees. Experimentally, the model significantly outperforms a wide collection of modern SSM variants on various FSA state tracking tasks. On multivariate time-series classification, it outperforms neural controlled differential equations, a paradigm explicitly built for time-series analysis. Finally, we integrate PD-SSM into a hybrid Transformer-SSM architecture and demonstrate that the model can effectively track the states of a complex FSA in which transitions are encoded into sets of variable-length English sentences. The code is available at https: //github. com/IBM/expressive-sparse-state-space-model.

PDF Details

ICLR Conference 2025 Conference Paper

The Case for Cleaner Biosignals: High-fidelity Neural Compressor Enables Transfer from Cleaner iEEG to Noisier EEG

Francesco S. Carzaniga
Gary Tom Hoppeler
Michael Hersche
Kaspar Schindler
Abbas Rahimi

All data modalities are not created equal, even when the signal they measure comes from the same source. In the case of the brain, two of the most important data modalities are the scalp electroencephalogram (EEG), and the intracranial electroencephalogram (iEEG). iEEG benefits from a higher signal-to-noise ratio (SNR), as it measures the electrical activity directly in the brain, while EEG is noisier and has lower spatial and temporal resolutions. Nonetheless, both EEG and iEEG are important sources of data for human neurology, from healthcare to brain–machine interfaces. They are used by human experts, supported by deep learning (DL) models, to accomplish a variety of tasks, such as seizure detection and motor imagery classification. Although the differences between EEG and iEEG are well understood by human experts, the performance of DL models across these two modalities remains under-explored. To help characterize the importance of clean data on the performance of DL models, we propose BrainCodec, a high-fidelity EEG and iEEG neural compressor. We find that training BrainCodec on iEEG and then transferring to EEG yields higher reconstruction quality than training on EEG directly. In addition, we also find that training BrainCodec on both EEG and iEEG improves fidelity when reconstructing EEG. Our work indicates that data sources with higher SNR, such as iEEG, provide better performance across the board also in the medical time-series domain. This finding is consistent with reports coming from natural language processing, where clean data sources appear to have an outsized effect on the performance of the DL model overall. BrainCodec also achieves up to a 64x compression on iEEG and EEG without a notable decrease in quality. BrainCodec markedly surpasses current state-of-the-art compression models both in final compression ratio and in reconstruction fidelity. We also evaluate the fidelity of the compressed signals objectively on a seizure detection and a motor imagery task performed by standard DL models. Here, we find that BrainCodec achieves a reconstruction fidelity high enough to ensure no performance degradation on the downstream tasks. Finally, we collect the subjective assessment of an expert neurologist, that confirms the high reconstruction quality of BrainCodec in a realistic scenario. The code is available at https://github.com/IBM/eeg-ieeg-brain-compressor.

Details

NeurIPS Conference 2024 Conference Paper

Limits of Transformer Language Models on Learning to Compose Algorithms

Jonathan Thomm
Giacomo Camposampiero
Aleksandar Terzic
Michael Hersche
Bernhard Schölkopf
Abbas Rahimi

We analyze the capabilities of Transformer language models in learning compositional discrete tasks. To this end, we evaluate training LLaMA models and prompting GPT-4 and Gemini on four tasks demanding to learn a composition of several discrete sub-tasks. In particular, we measure how well these models can reuse primitives observable in the sub-tasks to learn the composition task. Our results indicate that compositional learning in state-of-the-art Transformer language models is highly sample inefficient: LLaMA requires more data samples than relearning all sub-tasks from scratch to learn the compositional task; in-context prompting with few samples is unreliable and fails at executing the sub-tasks or correcting the errors in multi-round code generation. Further, by leveraging complexity theory, we support these findings with a theoretical analysis focused on the sample inefficiency of gradient descent in memorizing feedforward models. We open source our code at https: //github. com/IBM/limitations-lm-algorithmic-compositional-learning.

PDF Details DOI

NeSy Conference 2024 Conference Paper

Terminating Differentiable Tree Experts

Jonathan Thomm
Michael Hersche
Giacomo Camposampiero
Aleksandar Terzic
Bernhard Schölkopf
Abbas Rahimi

Abstract We advance the recently proposed neuro-symbolic Differentiable Tree Machine, which learns tree operations using a combination of transformers and Tensor Product Representations. We investigate the architecture and propose two key components. We first remove a series of different transformer layers that are used in every step by introducing a mixture of experts. This results in a Differentiable Tree Experts model with a constant number of parameters for any arbitrary number of steps in the computation, compared to the previous method in the Differentiable Tree Machine with a linear growth. Given this flexibility in the number of steps, we additionally propose a new termination algorithm to provide the model the power to choose how many steps to make automatically. The resulting Terminating Differentiable Tree Experts model sluggishly learns to predict the number of steps without an oracle. It can do so while maintaining the learning capabilities of the model, converging to the optimal amount of steps.

Details

NeSy Conference 2024 Conference Paper

Towards Learning Abductive Reasoning Using VSA Distributed Representations

Giacomo Camposampiero
Michael Hersche
Aleksandar Terzic
Roger Wattenhofer
Abu Sebastian
Abbas Rahimi

Abstract We introduce the Abductive Rule Learner with Context-awareness (ARLC), a model that solves abstract reasoning tasks based on Learn-VRF. ARLC features a novel and more broadly applicable training objective for abductive reasoning, resulting in better interpretability and higher accuracy when solving Raven’s progressive matrices (RPM). ARLC allows both programming domain knowledge and learning the rules underlying a data distribution. We evaluate ARLC on the I-RAVEN dataset, showcasing state-of-the-art accuracy across both in-distribution and out-of-distribution (unseen attribute-rule pairs) tests. ARLC surpasses neuro-symbolic and connectionist baselines, including large language models, despite having orders of magnitude fewer parameters. We show ARLC ’s robustness to post-programming training by incrementally learning from examples on top of programmed knowledge, which only improves its performance and does not result in catastrophic forgetting of the programmed solution. We validate ARLC ’s seamless transfer learning from a 2 $\, \times \, $ 2 RPM constellation to unseen constellations. Our code is available at https: //github. com/IBM/abductive-rule-learner-with-context-awareness.

Details

NeSy Conference 2023 Conference Paper

Decoding Superpositions of Bound Symbols Represented by Distributed Representations

Michael Hersche
Zuzanna Opala
Geethan Karunaratne
Abu Sebastian
Abbas Rahimi

Vector-symbolic architectures (VSAs) express data structures with an arbitrary complexity and perform symbolic computations on them by exploiting high-dimensional distributed representations and associated key operations. VSAs typically use dense random vectors, aka hypervectors, to represent atomic symbols that can be combined into compound symbols by multiplicative binding and additive superposition operators. For instance, a VSA-based neural encoder can bind two atomic symbols, and further superpose a set of such bound symbols—all by distributed vectors that have the same dimension. Nevertheless, decoding such an additive-multiplicative vector, to the atomic symbols from which it is built, is not a trivial task. Recently, a solution based on resonator networks was proposed to iteratively factorize one of the bound symbols. After finding the factorization, it is explained away by subtracting it from the superposition. This explaining away, however, causes noise amplification that limits the number of symbols that can be reliably decoded in large problem sizes. Here, we present novel methods that efficiently decode VSA-based data structures consisting of multiplicative binding and additive superposition of symbols. We expand the pure sequential explaining away approach by performing multiple decodings in parallel using a dedicated query sampler. Compared to the baseline resonator network, this mix of sequential and parallel decoding retrieves up to 8× more additive components from larger problems in synthetic and real-world experiments.

Details

NeurIPS Conference 2023 Conference Paper

MIMONets: Multiple-Input-Multiple-Output Neural Networks Exploiting Computation in Superposition

Nicolas Menet
Michael Hersche
Geethan Karunaratne
Luca Benini
Abu Sebastian
Abbas Rahimi

With the advent of deep learning, progressively larger neural networks have been designed to solve complex tasks. We take advantage of these capacity-rich models to lower the cost of inference by exploiting computation in superposition. To reduce the computational burden per input, we propose Multiple-Input-Multiple-Output Neural Networks (MIMONets) capable of handling many inputs at once. MIMONets augment various deep neural network architectures with variable binding mechanisms to represent an arbitrary number of inputs in a compositional data structure via fixed-width distributed representations. Accordingly, MIMONets adapt nonlinear neural transformations to process the data structure holistically, leading to a speedup nearly proportional to the number of superposed input items in the data structure. After processing in superposition, an unbinding mechanism recovers each transformed input of interest. MIMONets also provide a dynamic trade-off between accuracy and throughput by an instantaneous on-demand switching between a set of accuracy-throughput operating points, yet within a single set of fixed parameters. We apply the concept of MIMONets to both CNN and Transformer architectures resulting in MIMOConv and MIMOFormer, respectively. Empirical evaluations show that MIMOConv achieves $\approx 2$–$4\times$ speedup at an accuracy delta within [+0. 68, -3. 18]% compared to WideResNet CNNs on CIFAR10 and CIFAR100. Similarly, MIMOFormer can handle $2$–$4$ inputs at once while maintaining a high average accuracy within a [-1. 07, -3. 43]% delta on the long range arena benchmark. Finally, we provide mathematical bounds on the interference between superposition channels in MIMOFormer. Our code is available at https: //github. com/IBM/multiple-input-multiple-output-nets.

PDF Details

NeSy Conference 2023 Conference Paper

Solving Raven's Progressive Matrices via a Neuro-vector-symbolic Architecture

Michael Hersche
Mustafa Zeqiri
Luca Benini
Abu Sebastian
Abbas Rahimi

Details

NeSy Conference 2023 Conference Paper

VSA-based Positional Encoding Can Replace Recurrent Networks in Emergent Symbol Binding

Francesco S. Carzaniga
Michael Hersche
Kaspar Schindler
Abbas Rahimi

Variable binding is an open problem in both neuroscience and machine learning relating to how neural circuits combine multiple features into a single entity. Emergent Symbols through Binding in External Memory is a recent development tackling variable binding with a compelling solution. An emergent symbolic binding network (ESBN) is able to infer abstract rules through indirection using a dual-stack setup—whereby one stack contains variables and the other contains the associated keys—by autonomously learning a relationship between the two. New keys are generated from previous ones by maintaining a strict time-ordering through the usage of recurrent networks, in particular LSTMs. It is then a natural question whether such an expensive requirement could be replaced by a more economical alternative. In this work, we explore the viability of replacing LSTMs with simpler multi-layer perceptrons (MLPs) by exploiting the properties of high-dimensional spaces through a bundling-based positional encoding. We show how a combination of vector symbolic architectures and appropriate activation functions can achieve and surpass the results reported in the ESBN work, highlighting the role that imbuing the latent space with an explicit structure can play for these unconventional symbolic models.

Details