Arrow Research search

Author name cluster

Thomas Hofmann

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

38 papers
1 author row

Possible papers

38

AAAI Conference 2025 Conference Paper

On the Expressiveness and Length Generalization of Selective State Space Models on Regular Languages

  • Aleksandar Terzic
  • Michael Hersche
  • Giacomo Camposampiero
  • Thomas Hofmann
  • Abu Sebastian
  • Abbas Rahimi

Selective state-space models (SSMs) are an emerging alternative to the Transformer, offering the unique advantage of parallel training and sequential inference. Although these models have shown promising performance on a variety of tasks, their formal expressiveness and length generalization properties remain underexplored. In this work, we provide insight into the workings of selective SSMs by analyzing their expressiveness and length generalization performance on regular language tasks, i.e., finite-state automaton (FSA) emulation. We address certain limitations of modern SSM-based architectures by introducing the Selective Dense State-Space Model (SD-SSM), the first selective SSM that exhibits perfect length generalization on a set of various regular language tasks using a single layer. It utilizes a dictionary of dense transition matrices, a softmax selection mechanism that creates a convex combination of dictionary matrices at each time step, and a readout consisting of layer normalization followed by a linear map. We then proceed to evaluate variants of diagonal selective SSMs by considering their empirical performance on commutative and non-commutative automata. We explain the experimental results with theoretical considerations.

NeurIPS Conference 2025 Conference Paper

Structured Sparse Transition Matrices to Enable State Tracking in State-Space Models

  • Aleksandar Terzic
  • Nicolas Menet
  • Michael Hersche
  • Thomas Hofmann
  • Abbas Rahimi

Modern state-space models (SSMs) often utilize structured transition matrices which enable efficient computation but pose restrictions on the model’s expressivity, as measured in terms of the ability to emulate finite-state automata (FSA). While unstructured transition matrices are optimal in terms of expressivity, they come at a prohibitively high compute and memory cost, even for moderate state sizes. We propose a structured sparse parametrization of transition matrices in SSMs that enables FSA state tracking with provably optimal state size and depth, while keeping the computational cost of the recurrence comparable to that of diagonal SSMs. Our method, \emph{PD-SSM}, parametrizes the transition matrix as the product of a column one-hot matrix ($P$) and a complex-valued diagonal matrix ($D$). As a result, the computational cost of parallel scans scales linearly with the state size. Theoretically, the model is BIBO-stable and can emulate any $N$-state FSA with one layer of dimension $N$ and a linear readout of size $N ×N$, significantly improving on all current structured SSM guarantees. Experimentally, the model significantly outperforms a wide collection of modern SSM variants on various FSA state tracking tasks. On multivariate time-series classification, it outperforms neural controlled differential equations, a paradigm explicitly built for time-series analysis. Finally, we integrate PD-SSM into a hybrid Transformer-SSM architecture and demonstrate that the model can effectively track the states of a complex FSA in which transitions are encoded into sets of variable-length English sentences. The code is available at https: //github. com/IBM/expressive-sparse-state-space-model.

NeurIPS Conference 2025 Conference Paper

The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?

  • Denis Sutter
  • Julian Minder
  • Thomas Hofmann
  • Tiago Pimentel

The concept of causal abstraction got recently popularised to demystify the opaque decision-making processes of machine learning models; in short, a neural network can be abstracted as a higher-level algorithm if there exists a function which allows us to map between them. Notably, most interpretability papers implement these maps as linear functions, motivated by the linear representation hypothesis: the idea that features are encoded linearly in a model's representations. However, this linearity constraint is not required by the definition of causal abstraction. In this work, we critically examine the concept of causal abstraction by considering arbitrarily powerful alignment maps. In particular, we prove that under reasonable assumptions, any neural network can be mapped to any algorithm, rendering this unrestricted notion of causal abstraction trivial and uninformative. We complement these theoretical findings with empirical evidence, demonstrating that it is possible to perfectly map models to algorithms even when these models are incapable of solving the actual task; e. g. , on an experiment using randomly initialised language models, our alignment maps reach 100\% interchange-intervention accuracy on the indirect object identification task. This raises the non-linear representation dilemma: if we lift the linearity constraint imposed to alignment maps in causal abstraction analyses, we are left with no principled way to balance the inherent trade-off between these maps' complexity and accuracy. Together, these results suggest an answer to our title's question: causal abstraction is not enough for mechanistic interpretability, as it becomes vacuous without assumptions about how models encode information. Studying the connection between this information-encoding assumption and causal abstraction should lead to exciting future work.

NeurIPS Conference 2024 Conference Paper

Super Consistency of Neural Network Landscapes and Learning Rate Transfer

  • Lorenzo Noci
  • Alexandru Meterez
  • Thomas Hofmann
  • Antonio Orvieto

Recently, there has been growing evidence that if the width and depth of a neural network are scaled toward the so-called rich feature learning limit ($\mu$P and its depth extension), then some hyperparameters --- such as the learning rate --- exhibit transfer from small to very large models. From an optimization perspective, this phenomenon is puzzling, as it implies that the loss landscape is consistently similar across very different model sizes. In this work, we study the landscape through the lens of the Hessian, with a focus on its largest eigenvalue (i. e. the sharpness), and find that certain spectral properties under $\mu$P are largely independent of the width and depth of the network along the training trajectory. We name this property *super consistency* of the landscape. On the other hand, we show that in the Neural Tangent Kernel (NTK) and other scaling regimes, the sharpness exhibits very different dynamics at different scales. But what causes these differences in the sharpness dynamics? Through a connection between the Hessian's and the NTK's spectrum, we argue that the cause lies in the presence (for $\mu$P) or progressive absence (for the NTK scaling) of feature learning. We corroborate our claims with a substantial suite of experiments, covering a wide range of datasets and architectures: from ResNets and Vision Transformers trained on benchmark vision datasets to Transformers-based language models trained on WikiText.

NeurIPS Conference 2024 Conference Paper

Understanding and Minimising Outlier Features in Transformer Training

  • Bobby He
  • Lorenzo Noci
  • Daniele Paliotta
  • Imanol Schlag
  • Thomas Hofmann

Outlier Features (OFs) are neurons whose activation magnitudes significantly exceed the average over a neural network's (NN) width. They are well known to emerge during standard transformer training and have the undesirable effect of hindering quantisation in afflicted models. Despite their practical importance, little is known behind why OFs emerge during training, nor how one can minimise them. Our work focuses on the above questions, first identifying several quantitative metrics, such as the kurtosis over neuron activation norms, to measure OFs. With these metrics, we study how architectural and optimisation choices influence OFs, and provide practical insights to minimise OFs during training. As highlights, we introduce a novel unnormalised transformer block, the Outlier Protected block, and present a previously unknown benefit of non-diagonal preconditioning optimisers, finding both approaches to significantly reduce OFs and improve quantisation without compromising convergence speed, at scales of up to 7B parameters. Notably, our combination of OP block and non-diagonal preconditioner (SOAP) achieves 14. 87 weight-and-activation int8 perplexity (from 14. 71 in standard precision), compared to 63. 4 int8 perplexity (from 16. 00) with a default OF-prone combination of Pre-Norm model and Adam, when quantising OPT-125m models post-training.

NeurIPS Conference 2023 Conference Paper

Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers

  • Sotiris Anagnostidis
  • Dario Pavllo
  • Luca Biggio
  • Lorenzo Noci
  • Aurelien Lucchi
  • Thomas Hofmann

Autoregressive Transformers adopted in Large Language Models (LLMs) are hard to scale to long sequences. Despite several works trying to reduce their computational cost, most of LLMs still adopt attention layers between all pairs of tokens in the sequence, thus incurring a quadratic cost. In this study, we present a novel approach that dynamically prunes contextual information while preserving the model's expressiveness, resulting in reduced memory and computational requirements during inference. Our method employs a learnable mechanism that determines which uninformative tokens can be dropped from the context at any point across the generation process. By doing so, our approach not only addresses performance concerns but also enhances interpretability, providing valuable insight into the model's decision-making process. Our technique can be applied to existing pre-trained models through a straightforward fine-tuning process, and the pruning strength can be specified by a sparsity parameter. Notably, our empirical findings demonstrate that we can effectively prune up to 80\% of the context without significant performance degradation on downstream tasks, offering a valuable tool for mitigating inference costs. Our reference implementation achieves up to $2\times$ increase in inference throughput and even greater memory savings.

NeurIPS Conference 2023 Conference Paper

Scaling MLPs: A Tale of Inductive Bias

  • Gregor Bachmann
  • Sotiris Anagnostidis
  • Thomas Hofmann

In this work we revisit the most fundamental building block in deep learning, the multi-layer perceptron (MLP), and study the limits of its performance on vision tasks. Empirical insights into MLPs are important for multiple reasons. (1) Given the recent narrative "less inductive bias is better", popularized due to transformers eclipsing convolutional models, it is natural to explore the limits of this hypothesis. To that end, MLPs offer an ideal test bed, as they lack any vision-specific inductive bias. (2) MLPs have almost exclusively been the main protagonist in the deep learning theory literature due to their mathematical simplicity, serving as a proxy to explain empirical phenomena observed for more complex architectures. Surprisingly, experimental datapoints for MLPs are very difficult to find in the literature, especially when coupled with large pre-training protocols. This discrepancy between practice and theory is worrying: \textit{Do MLPs reflect the empirical advances exhibited by practical models? } Or do theorists need to rethink the role of MLPs as a proxy? We provide insights into both these aspects. We show that the performance of MLPs drastically improves with scale (95% on CIFAR10, 82% on CIFAR100, 58% on ImageNet ReaL), highlighting that lack of inductive bias can indeed be compensated. We observe that MLPs mimic the behaviour of their modern counterparts faithfully, with some components in the learning setting however exhibiting stronger or unexpected behaviours. Due to their inherent computational efficiency, large pre-training experiments become more accessible for academic researchers. All of our experiments were run on a single GPU.

NeurIPS Conference 2023 Conference Paper

The Shaped Transformer: Attention Models in the Infinite Depth-and-Width Limit

  • Lorenzo Noci
  • Chuning Li
  • Mufan Li
  • Bobby He
  • Thomas Hofmann
  • Chris J. Maddison
  • Dan Roy

In deep learning theory, the covariance matrix of the representations serves as aproxy to examine the network’s trainability. Motivated by the success of Transform-ers, we study the covariance matrix of a modified Softmax-based attention modelwith skip connections in the proportional limit of infinite-depth-and-width. Weshow that at initialization the limiting distribution can be described by a stochasticdifferential equation (SDE) indexed by the depth-to-width ratio. To achieve awell-defined stochastic limit, the Transformer’s attention mechanism is modifiedby centering the Softmax output at identity, and scaling the Softmax logits by awidth-dependent temperature parameter. We examine the stability of the networkthrough the corresponding SDE, showing how the scale of both the drift and diffu-sion can be elegantly controlled with the aid of residual connections. The existenceof a stable SDE implies that the covariance structure is well-behaved, even for verylarge depth and width, thus preventing the notorious issues of rank degeneracyin deep attention models. Finally, we show, through simulations, that the SDEprovides a surprisingly good description of the corresponding finite-size model. We coin the name shaped Transformer for these architectural modifications.

TMLR Journal 2022 Journal Article

Boosting Search Engines with Interactive Agents

  • Leonard Adolphs
  • Benjamin Börschinger
  • Christian Buck
  • Michelle Chen Huebscher
  • Massimiliano Ciaramita
  • Lasse Espeholt
  • Thomas Hofmann
  • Yannic Kilcher

This paper presents first successful steps in designing search agents that learn meta-strategies for iterative query refinement in information-seeking tasks. Our approach uses machine reading to guide the selection of refinement terms from aggregated search results. Agents are then empowered with simple but effective search operators to exert fine-grained and transparent control over queries and search results. We develop a novel way of generating synthetic search sessions, which leverages the power of transformer-based language models through (self-)supervised learning. We also present a reinforcement learning agent with dynamically constrained actions that learns interactive search strategies from scratch. Our search agents obtain retrieval and answer quality performance comparable to recent neural methods, using only a traditional term-based BM25 ranking function and interpretable discrete reranking and filtering actions.

NeurIPS Conference 2022 Conference Paper

OpenFilter: A Framework to Democratize Research Access to Social Media AR Filters

  • Piera Riccio
  • Bill Psomas
  • Francesco Galati
  • Francisco Escolano
  • Thomas Hofmann
  • Nuria Oliver

Augmented Reality or AR filters on selfies have become very popular on social media platforms for a variety of applications, including marketing, entertainment and aesthetics. Given the wide adoption of AR face filters and the importance of faces in our social structures and relations, there is increased interest by the scientific community to analyze the impact of such filters from a psychological, artistic and sociological perspective. However, there are few quantitative analyses in this area mainly due to a lack of publicly available datasets of facial images with applied AR filters. The proprietary, close nature of most social media platforms does not allow users, scientists and practitioners to access the code and the details of the available AR face filters. Scraping faces from these platforms to collect data is ethically unacceptable and should, therefore, be avoided in research. In this paper, we present OpenFilter, a flexible framework to apply AR filters available in social media platforms on existing large collections of human faces. Moreover, we share FairBeauty and B-LFW, two beautified versions of the publicly available FairFace and LFW datasets and we outline insights derived from the analysis of these beautified datasets.

NeurIPS Conference 2021 Conference Paper

Analytic Insights into Structure and Rank of Neural Network Hessian Maps

  • Sidak Pal Singh
  • Gregor Bachmann
  • Thomas Hofmann

The Hessian of a neural network captures parameter interactions through second-order derivatives of the loss. It is a fundamental object of study, closely tied to various problems in deep learning, including model design, optimization, and generalization. Most prior work has been empirical, typically focusing on low-rank approximations and heuristics that are blind to the network structure. In contrast, we develop theoretical tools to analyze the range of the Hessian map, which provide us with a precise understanding of its rank deficiency and the structural reasons behind it. This yields exact formulas and tight upper bounds for the Hessian rank of deep linear networks --- allowing for an elegant interpretation in terms of rank deficiency. Moreover, we demonstrate that our bounds remain faithful as an estimate of the numerical Hessian rank, for a larger class of models such as rectified and hyperbolic tangent networks. Further, we also investigate the implications of model architecture (e. g. ~width, depth, bias) on the rank deficiency. Overall, our work provides novel insights into the source and extent of redundancy in overparameterized neural networks.

NeurIPS Conference 2021 Conference Paper

Disentangling the Roles of Curation, Data-Augmentation and the Prior in the Cold Posterior Effect

  • Lorenzo Noci
  • Kevin Roth
  • Gregor Bachmann
  • Sebastian Nowozin
  • Thomas Hofmann

The “cold posterior effect” (CPE) in Bayesian deep learning describes the disturbing observation that the predictive performance of Bayesian neural networks can be significantly improved if the Bayes posterior is artificially sharpened using a temperature parameter T <1. The CPE is problematic in theory and practice and since the effect was identified many researchers have proposed hypotheses to explain the phenomenon. However, despite this intensive research effort the effect remains poorly understood. In this work we provide novel and nuanced evidence relevant to existing explanations for the cold posterior effect, disentangling three hypotheses: 1. The dataset curation hypothesis of Aitchison (2020): we show empirically that the CPE does not arise in a real curated data set but can be produced in a controlled experiment with varying curation strength. 2. The data augmentation hypothesis of Izmailov et al. (2021) and Fortuin et al. (2021): we show empirically that data augmentation is sufficient but not necessary for the CPE to be present. 3. The bad prior hypothesis of Wenzel et al. (2020): we use a simple experiment evaluating the relative importance of the prior and the likelihood, strongly linking the CPE to the prior. Our results demonstrate how the CPE can arise in isolation from synthetic curation, data augmentation, and bad priors. Cold posteriors observed “in the wild” are therefore unlikely to arise from a single simple cause; as a result, we do not expect a simple “fix” for cold posteriors.

NeurIPS Conference 2021 Conference Paper

Precise characterization of the prior predictive distribution of deep ReLU networks

  • Lorenzo Noci
  • Gregor Bachmann
  • Kevin Roth
  • Sebastian Nowozin
  • Thomas Hofmann

Recent works on Bayesian neural networks (BNNs) have highlighted the need to better understand the implications of using Gaussian priors in combination with the compositional structure of the network architecture. Similar in spirit to the kind of analysis that has been developed to devise better initialization schemes for neural networks (cf. He- or Xavier initialization), we derive a precise characterization of the prior predictive distribution of finite-width ReLU networks with Gaussian weights. While theoretical results have been obtained for their heavy-tailedness, the full characterization of the prior predictive distribution (i. e. its density, CDF and moments), remained unknown prior to this work. Our analysis, based on the Meijer-G function, allows us to quantify the influence of architectural choices such as the width or depth of the network on the resulting shape of the prior predictive distribution. We also formally connect our results to previous work in the infinite width setting, demonstrating that the moments of the distribution converge to those of a normal log-normal mixture in the infinite depth limit. Finally, our results provide valuable guidance on prior design: for instance, controlling the predictive variance with depth- and width-informed priors on the weights of the network.

NeurIPS Conference 2020 Conference Paper

Adversarial Training is a Form of Data-dependent Operator Norm Regularization

  • Kevin Roth
  • Yannic Kilcher
  • Thomas Hofmann

We establish a theoretical link between adversarial training and operator norm regularization for deep neural networks. Specifically, we prove that $l_p$-norm constrained projected gradient ascent based adversarial training with an $l_q$-norm loss on the logits of clean and perturbed inputs is equivalent to data-dependent (p, q) operator norm regularization. This fundamental connection confirms the long-standing argument that a network’s sensitivity to adversarial examples is tied to its spectral properties and hints at novel ways to robustify and defend against adversarial attacks. We provide extensive empirical evidence on state-of-the-art network architectures to support our theoretical results.

NeurIPS Conference 2020 Conference Paper

Batch normalization provably avoids ranks collapse for randomly initialised deep networks

  • Hadi Daneshmand
  • Jonas Kohler
  • Francis Bach
  • Thomas Hofmann
  • Aurelien Lucchi

Randomly initialized neural networks are known to become harder to train with increasing depth, unless architectural enhancements like residual connections and batch normalization are used. We here investigate this phenomenon by revisiting the connection between random initialization in deep networks and spectral instabilities in products of random matrices. Given the rich literature on random matrices, it is not surprising to find that the rank of the intermediate representations in unnormalized networks collapses quickly with depth. In this work we highlight the fact that batch normalization is an effective strategy to avoid rank collapse for both linear and ReLU networks. Leveraging tools from Markov chain theory, we derive a meaningful lower rank bound in deep linear networks. Empirically, we also demonstrate that this rank robustness generalizes to ReLU nets. Finally, we conduct an extensive set of experiments on real-world data sets, which confirm that rank stability is indeed a crucial condition for training modern-day deep neural architectures.

NeurIPS Conference 2020 Conference Paper

Convolutional Generation of Textured 3D Meshes

  • Dario Pavllo
  • Graham Spinks
  • Thomas Hofmann
  • Marie-Francine Moens
  • Aurelien Lucchi

While recent generative models for 2D images achieve impressive visual results, they clearly lack the ability to perform 3D reasoning. This heavily restricts the degree of control over generated objects as well as the possible applications of such models. In this work, we bridge this gap by leveraging recent advances in differentiable rendering. We design a framework that can generate triangle meshes and associated high-resolution texture maps, using only 2D supervision from single-view natural images. A key contribution of our work is the encoding of the mesh and texture as 2D representations, which are semantically aligned and can be easily modeled by a 2D convolutional GAN. We demonstrate the efficacy of our method on Pascal3D+ Cars and CUB, both in an unconditional setting and in settings where the model is conditioned on class labels, attributes, and text. Finally, we propose an evaluation methodology that assesses the mesh and texture quality separately.

AAAI Conference 2020 Conference Paper

LeDeepChef Deep Reinforcement Learning Agent for Families of Text-Based Games

  • Leonard Adolphs
  • Thomas Hofmann

While Reinforcement Learning (RL) approaches lead to significant achievements in a variety of areas in recent history, natural language tasks remained mostly unaffected, due to the compositional and combinatorial nature that makes them notoriously hard to optimize. With the emerging field of Text- Based Games (TBGs), researchers try to bridge this gap. Inspired by the success of RL algorithms on Atari games, the idea is to develop new methods in a restricted game world and then gradually move to more complex environments. Previous work in the area of TBGs has mainly focused on solving individual games. We, however, consider the task of designing an agent that not just succeeds in a single game, but performs well across a whole family of games, sharing the same theme. In this work, we present our deep RL agent—LeDeepChef—that shows generalization capabilities to never-before-seen games of the same family with different environments and task descriptions. The agent participated in Microsoft Research's First TextWorld Problems: A Language and Reinforcement Learning Challenge and outperformed all but one competitor on the final test set. The games from the challenge all share the same theme, namely cooking in a modern house environment, but differ significantly in the arrangement of the rooms, the presented objects, and the specific goal (recipe to cook). To build an agent that achieves high scores across a whole family of games, we use an actor-critic framework and prune the action-space by using ideas from hierarchical reinforcement learning and a specialized module trained on a recipe database.

NeurIPS Conference 2019 Conference Paper

A Domain Agnostic Measure for Monitoring and Evaluating GANs

  • Paulina Grnarova
  • Kfir Y. Levy
  • Aurelien Lucchi
  • Nathanael Perraudin
  • Ian Goodfellow
  • Thomas Hofmann
  • Andreas Krause

Generative Adversarial Networks (GANs) have shown remarkable results in modeling complex distributions, but their evaluation remains an unsettled issue. Evaluations are essential for: (i) relative assessment of different models and (ii) monitoring the progress of a single model throughout training. The latter cannot be determined by simply inspecting the generator and discriminator loss curves as they behave non-intuitively. We leverage the notion of duality gap from game theory to propose a measure that addresses both (i) and (ii) at a low computational cost. Extensive experiments show the effectiveness of this measure to rank different GAN models and capture the typical GAN failure scenarios, including mode collapse and non-convergent behaviours. This evaluation metric also provides meaningful monitoring on the progression of the loss during training. It highly correlates with FID on natural image datasets, and with domain specific scores for text, sound and cosmology data where FID is not directly suitable. In particular, our proposed metric requires no labels or a pretrained classifier, making it domain agnostic.

NeurIPS Conference 2018 Conference Paper

Deep State Space Models for Unconditional Word Generation

  • Florian Schmidt
  • Thomas Hofmann

Autoregressive feedback is considered a necessity for successful unconditional text generation using stochastic sequence models. However, such feedback is known to introduce systematic biases into the training process and it obscures a principle of generation: committing to global information and forgetting local nuances. We show that a non-autoregressive deep state space model with a clear separation of global and local uncertainty can be built from only two ingredients: An independent noise source and a deterministic transition function. Recent advances on flow-based variational inference can be used to train an evidence lower-bound without resorting to annealing, auxiliary losses or similar measures. The result is a highly interpretable generative model on par with comparable auto-regressive models on the task of word generation.

NeurIPS Conference 2018 Conference Paper

Hyperbolic Neural Networks

  • Octavian Ganea
  • Gary Becigneul
  • Thomas Hofmann

Hyperbolic spaces have recently gained momentum in the context of machine learning due to their high capacity and tree-likeliness properties. However, the representational power of hyperbolic geometry is not yet on par with Euclidean geometry, firstly because of the absence of corresponding hyperbolic neural network layers. Here, we bridge this gap in a principled manner by combining the formalism of Möbius gyrovector spaces with the Riemannian geometry of the Poincaré model of hyperbolic spaces. As a result, we derive hyperbolic versions of important deep learning tools: multinomial logistic regression, feed-forward and recurrent neural networks. This allows to embed sequential data and perform classification in the hyperbolic space. Empirically, we show that, even if hyperbolic optimization tools are limited, hyperbolic sentence embeddings either outperform or are on par with their Euclidean variants on textual entailment and noisy-prefix recognition tasks.

NeurIPS Conference 2017 Conference Paper

Stabilizing Training of Generative Adversarial Networks through Regularization

  • Kevin Roth
  • Aurelien Lucchi
  • Sebastian Nowozin
  • Thomas Hofmann

Deep generative models based on Generative Adversarial Networks (GANs) have demonstrated impressive sample quality but in order to work they require a careful choice of architecture, parameter initialization, and selection of hyper-parameters. This fragility is in part due to a dimensional mismatch or non-overlapping support between the model distribution and the data distribution, causing their density ratio and the associated f -divergence to be undefined. We overcome this fundamental limitation and propose a new regularization approach with low computational cost that yields a stable GAN training procedure. We demonstrate the effectiveness of this regularizer accross several architectures trained on common benchmark image generation tasks. Our regularization turns GAN models into reliable building blocks for deep learning.

NeurIPS Conference 2016 Conference Paper

Adaptive Newton Method for Empirical Risk Minimization to Statistical Accuracy

  • Aryan Mokhtari
  • Hadi Daneshmand
  • Aurelien Lucchi
  • Thomas Hofmann
  • Alejandro Ribeiro

We consider empirical risk minimization for large-scale datasets. We introduce Ada Newton as an adaptive algorithm that uses Newton's method with adaptive sample sizes. The main idea of Ada Newton is to increase the size of the training set by a factor larger than one in a way that the minimization variable for the current training set is in the local neighborhood of the optimal argument of the next training set. This allows to exploit the quadratic convergence property of Newton's method and reach the statistical accuracy of each training set with only one iteration of Newton's method. We show theoretically that we can iteratively increase the sample size while applying single Newton iterations without line search and staying within the statistical accuracy of the regularized empirical risk. In particular, we can double the size of the training set in each iteration when the number of samples is sufficiently large. Numerical experiments on various datasets confirm the possibility of increasing the sample size by factor 2 at each iteration which implies that Ada Newton achieves the statistical accuracy of the full training set with about two passes over the dataset.

NeurIPS Conference 2015 Conference Paper

Variance Reduced Stochastic Gradient Descent with Neighbors

  • Thomas Hofmann
  • Aurelien Lucchi
  • Simon Lacoste-Julien
  • Brian McWilliams

Stochastic Gradient Descent (SGD) is a workhorse in machine learning, yet it is also known to be slow relative to steepest descent. Recently, variance reduction techniques such as SVRG and SAGA have been proposed to overcome this weakness. With asymptotically vanishing variance, a constant step size can be maintained, resulting in geometric convergence rates. However, these methods are either based on occasional computations of full gradients at pivot points (SVRG), or on keeping per data point corrections in memory (SAGA). This has the disadvantage that one cannot employ these methods in a streaming setting and that speed-ups relative to SGD may need a certain number of epochs in order to materialize. This paper investigates a new class of algorithms that can exploit neighborhood structure in the training data to share and re-use information about past stochastic gradients across data points. While not meant to be offering advantages in an asymptotic setting, there are significant benefits in the transient optimization phase, in particular in a streaming or single-epoch setting. We investigate this family of algorithms in a thorough analysis and show supporting experimental results. As a side-product we provide a simple and unified proof technique for a broad class of variance reduction algorithms.

NeurIPS Conference 2014 Conference Paper

Communication-Efficient Distributed Dual Coordinate Ascent

  • Martin Jaggi
  • Virginia Smith
  • Martin Takac
  • Jonathan Terhorst
  • Sanjay Krishnan
  • Thomas Hofmann
  • Michael Jordan

Communication remains the most significant bottleneck in the performance of distributed optimization algorithms for large-scale machine learning. In this paper, we propose a communication-efficient framework, COCOA, that uses local computation in a primal-dual setting to dramatically reduce the amount of necessary communication. We provide a strong convergence rate analysis for this class of algorithms, as well as experiments on real-world distributed datasets with implementations in Spark. In our experiments, we find that as compared to state-of-the-art mini-batch versions of SGD and SDCA algorithms, COCOA converges to the same. 001-accurate solution quality on average 25× as quickly.

IJCAI Conference 2007 Conference Paper

  • Lijuan Cai
  • Thomas Hofmann

Many real-world classification problems involve large numbers of overlapping categories that are arranged in a hierarchy or taxonomy. We propose to incorporate prior knowledge on category taxonomy directly into the learning architecture. We present two concrete multi-label classification methods, a generalized version of Perceptron and a hierarchical multi-label SVM learning. Our method works with arbitrary, not necessarily singly connected taxonomies, and can be applied more generally in settings where categories are characterized by attributes and relations that are not necessarily induced by a taxonomy. Experimental results on WIPO-alpha collection show that our hierarchical methods bring significant performance improvement.

JMLR Journal 2005 Journal Article

Large Margin Methods for Structured and Interdependent Output Variables

  • Ioannis Tsochantaridis
  • Thorsten Joachims
  • Thomas Hofmann
  • Yasemin Altun

Learning general functional dependencies between arbitrary input and output spaces is one of the key challenges in computational intelligence. While recent progress in machine learning has mainly focused on designing flexible and powerful input representations, this paper addresses the complementary issue of designing classification algorithms that can deal with more complex outputs, such as trees, sequences, or sets. More generally, we consider problems involving multiple dependent output variables, structured output spaces, and classification problems with class attributes. In order to accomplish this, we propose to appropriately generalize the well-known notion of a separation margin and derive a corresponding maximum-margin formulation. While this leads to a quadratic program with a potentially prohibitive, i.e. exponential, number of constraints, we present a cutting plane algorithm that solves the optimization problem in polynomial time for a large class of problems. The proposed method has important applications in areas such as computational biology, natural language processing, information retrieval/extraction, and optical character recognition. Experiments from various domains involving different types of output spaces emphasize the breadth and generality of our approach. [abs] [ pdf ][ bib ] &copy JMLR 2005. ( edit, beta )

NeurIPS Conference 2004 Conference Paper

Semi-supervised Learning on Directed Graphs

  • Dengyong Zhou
  • Thomas Hofmann
  • Bernhard Schölkopf

Given a directed graph in which some of the nodes are labeled, we inves- tigate the question of how to exploit the link structure of the graph to infer the labels of the remaining unlabeled nodes. To that extent we propose a regularization framework for functions de(cid: 2)ned over nodes of a directed graph that forces the classi(cid: 2)cation function to change slowly on densely linked subgraphs. A powerful, yet computationally simple classi(cid: 2)cation algorithm is derived within the proposed framework. The experimental evaluation on real-world Web classi(cid: 2)cation problems demonstrates en- couraging results that validate our approach.

IJCAI Conference 2003 Conference Paper

Hierarchical Semantic Classification: Word Sense Disambiguation with World Knowledge

  • Massimiliano Ciaramita
  • Thomas Hofmann
  • Mark Johnson

We present a learning architecture for lexical semantic classification problems that supplements task-specific training data with background data encoding general "world knowledge". The model compiles knowledge contained in a dictionaryontology into additional training data, and integrates task-specific and background data through a novel hierarchical learning architecture. Experiments on a word sense disambiguation task provide empirical evidence that this "hierarchical classifier" outperforms a state-of-the-art standard "flat" one.

NeurIPS Conference 2003 Conference Paper

Multiple Instance Learning via Disjunctive Programming Boosting

  • Stuart Andrews
  • Thomas Hofmann

Learning from ambiguous training data is highly relevant in many applications. We present a new learning algorithm for classification problems where labels are associated with sets of pattern instead of individual patterns. This encompasses multiple instance learn- ing as a special case. Our approach is based on a generalization of linear programming boosting and uses results from disjunctive programming to generate successively stronger linear relaxations of a discrete non-convex problem.

NeurIPS Conference 2002 Conference Paper

Discriminative Learning for Label Sequences via Boosting

  • Yasemin Altun
  • Thomas Hofmann
  • Mark Johnson

This paper investigates a boosting approach to discriminative learning of label sequences based on a sequence rank loss function. The proposed method combines many of the advantages of boost(cid: 173) ing schemes with the efficiency of dynamic programming methods and is attractive both, conceptually and computationally. In addi(cid: 173) tion, we also discuss alternative approaches based on the Hamming loss for label sequences. The sequence boosting algorithm offers an interesting alternative to methods based on HMMs and the more recently proposed Conditional Random Fields. Applications areas for the presented technique range from natural language processing and information extraction to computational biology. We include experiments on named entity recognition and part-of-speech tag(cid: 173) ging which demonstrate the validity and competitiveness of our approach.

NeurIPS Conference 2002 Conference Paper

Support Vector Machines for Multiple-Instance Learning

  • Stuart Andrews
  • Ioannis Tsochantaridis
  • Thomas Hofmann

This paper presents two new formulations of multiple-instance learning as a maximum margin problem. The proposed extensions of the Support Vector Machine (SVM) learning approach lead to mixed integer quadratic programs that can be solved heuristically. Our generalization of SVMs makes a state-of-the-art classification technique, including non-linear classification via kernels, available to an area that up to now has been largely dominated by special purpose methods. We present experimental results on a pharma(cid: 173) ceutical data set and on applications in automated image indexing and document categorization.

NeurIPS Conference 2000 Conference Paper

The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity

  • David Cohn
  • Thomas Hofmann

We describe a joint probabilistic model for modeling the contents and inter-connectivity of document collections such as sets of web pages or research paper archives. The model is based on a probabilistic factor decomposition and allows identifying principal topics of the collection as well as authoritative documents within those topics. Furthermore, the relationships between topics is mapped out in order to build a predictive model of link content. Among the many applications of this approach are information retrieval and search, topic identification, query disambigua(cid: 173) tion, focused web crawling, web authoring, and bibliometric analysis.

NeurIPS Conference 1999 Conference Paper

Learning the Similarity of Documents: An Information-Geometric Approach to Document Retrieval and Categorization

  • Thomas Hofmann

The project pursued in this paper is to develop from first information-geometric principles a general method for learning the similarity between text documents. Each individual docu(cid: 173) ment is modeled as a memoryless information source. Based on a latent class decomposition of the term-document matrix, a low(cid: 173) dimensional (curved) multinomial subfamily is learned. From this model a canonical similarity function - known as the Fisher kernel - is derived. Our approach can be applied for unsupervised and supervised learning problems alike. This in particular covers inter(cid: 173) esting cases where both, labeled and unlabeled data are available. Experiments in automated indexing and text categorization verify the advantages of the proposed method.

NeurIPS Conference 1998 Conference Paper

Learning from Dyadic Data

  • Thomas Hofmann
  • Jan Puzicha
  • Michael Jordan

Dyadzc data refers to a domain with two finite sets of objects in which observations are made for dyads, i. e. , pairs with one element from either set. This type of data arises naturally in many ap(cid: 173) plication ranging from computational linguistics and information retrieval to preference analysis and computer vision. In this paper, we present a systematic, domain-independent framework of learn(cid: 173) ing from dyadic data by statistical mixture models. Our approach covers different models with fiat and hierarchical latent class struc(cid: 173) tures. We propose an annealed version of the standard EM algo(cid: 173) rithm for model fitting which is empirically evaluated on a variety of data sets from different domains.

NeurIPS Conference 1997 Conference Paper

Active Data Clustering

  • Thomas Hofmann
  • Joachim Buhmann

Active data clustering is a novel technique for clustering of proxim(cid: 173) ity data which utilizes principles from sequential experiment design in order to interleave data generation and data analysis. The pro(cid: 173) posed active data sampling strategy is based on the expected value of information, a concept rooting in statistical decision theory. This is considered to be an important step towards the analysis of large(cid: 173) scale data sets, because it offers a way to overcome the inherent data sparseness of proximity data. '''Ie present applications to unsu(cid: 173) pervised texture segmentation in computer vision and information retrieval in document databases.

NeurIPS Conference 1994 Conference Paper

Multidimensional Scaling and Data Clustering

  • Thomas Hofmann
  • Joachim Buhmann

Visualizing and structuring pairwise dissimilarity data are difficult combinatorial op(cid: 173) timization problems known as multidimensional scaling or pairwise data clustering. Algorithms for embedding dissimilarity data set in a Euclidian space, for clustering these data and for actively selecting data to support the clustering process are discussed in the maximum entropy framework. Active data selection provides a strategy to discover structure in a data set efficiently with partially unknown data.

NeurIPS Conference 1993 Conference Paper

Central and Pairwise Data Clustering by Competitive Neural Networks

  • Joachim Buhmann
  • Thomas Hofmann

Data clustering amounts to a combinatorial optimization problem to re(cid: 173) duce the complexity of a data representation and to increase its precision. Central and pairwise data clustering are studied in the maximum en(cid: 173) tropy framework. For central clustering we derive a set of reestimation equations and a minimization procedure which yields an optimal num(cid: 173) ber of clusters, their centers and their cluster probabilities. A meanfield approximation for pairwise clustering is used to estimate assignment probabilities. A se1fconsistent solution to multidimensional scaling and pairwise clustering is derived which yields an optimal embedding and clustering of data points in a d-dimensional Euclidian space.