Author name cluster

Ehsan Amid

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

14 papers

2 author rows

ICLR Conference 2025 Conference Paper

Restructuring Vector Quantization with the Rotation Trick

Christopher Fifty
Ronald Guenther Junkins
Dennis Duan
Aniketh Iyengar
Jerry Weihong Liu
Ehsan Amid
Sebastian Thrun
Christopher Ré

Vector Quantized Variational AutoEncoders (VQ-VAEs) are designed to compress a continuous input to a discrete latent space and reconstruct it with minimal distortion. They operate by maintaining a set of vectors---often referred to as the codebook---and quantizing each encoder output to the nearest vector in the codebook. However, as vector quantization is non-differentiable, the gradient to the encoder flows _around_ the vector quantization layer rather than _through_ it in a straight-through approximation. This approximation may be undesirable as all information from the vector quantization operation is lost. In this work, we propose a way to propagate gradients through the vector quantization layer of VQ-VAEs. We smoothly transform each encoder output into its corresponding codebook vector via a rotation and rescaling linear transformation that is treated as a constant during backpropagation. As a result, the relative magnitude and angle between encoder output and codebook vector becomes encoded into the gradient as it propagates through the vector quantization layer and back to the encoder. Across 11 different VQ-VAE training paradigms, we find this restructuring improves reconstruction metrics, codebook utilization, and quantization error.

Details

ICLR Conference 2024 Conference Paper

Context-Aware Meta-Learning

Christopher Fifty
Dennis Duan
Ronald Guenther Junkins
Ehsan Amid
Jure Leskovec
Christopher Ré
Sebastian Thrun

Large Language Models like ChatGPT demonstrate a remarkable capacity to learn new concepts during inference without any fine-tuning. However, visual models trained to detect new objects during inference have been unable to replicate this ability, and instead either perform poorly or require meta-training and/or fine-tuning on similar objects. In this work, we propose a meta-learning algorithm that emulates Large Language Models by learning new visual concepts during inference without fine-tuning. Our approach leverages a frozen pre-trained feature extractor, and analogous to in-context learning, recasts meta-learning as sequence modeling over datapoints with known labels and a test datapoint with an unknown label. On 8 out of 11 meta-learning benchmarks, our approach---without meta-training or fine-tuning---exceeds or matches the state-of-the-art algorithm, P>M>F, which is meta-trained on these benchmarks.

Details

NeurIPS Conference 2024 Conference Paper

Hyperbolic Embeddings of Supervised Models

Richard Nock
Ehsan Amid
Frank Nielsen
Alexander Soen
Manfred K. Warmuth

Models of hyperbolic geometry have been successfully used in ML for two main tasks: embedding models in unsupervised learning ( e. g. hierarchies) and embedding data. To our knowledge, there are no approaches that provide embeddings for supervised models; even when hyperbolic geometry provides convenient properties for expressing popular hypothesis classes, such as decision trees (and ensembles). In this paper, we propose a full-fledged solution to the problem in three independent contributions. The first linking the theory of losses for class probability estimation to hyperbolic embeddings in Poincar\'e disk model. The second resolving an issue for a clean, unambiguous embedding of (ensembles of) decision trees in this model. The third showing how to smoothly tweak the Poincar\'e hyperbolic distance to improve its encoding and visualization properties near the border of the disk, a crucial region for our application, while keeping hyperbolicity. This last step has substantial independent interest as it is grounded in a generalization of Leibniz-Newton's fundamental Theorem of calculus.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Optimal Transport with Tempered Exponential Measures

Ehsan Amid
Frank Nielsen
Richard Nock
Manfred K. Warmuth

In the field of optimal transport, two prominent subfields face each other: (i) unregularized optimal transport, ``a-la-Kantorovich'', which leads to extremely sparse plans but with algorithms that scale poorly, and (ii) entropic-regularized optimal transport, ``a-la-Sinkhorn-Cuturi'', which gets near-linear approximation algorithms but leads to maximally un-sparse plans. In this paper, we show that an extension of the latter to tempered exponential measures, a generalization of exponential families with indirect measure normalization, gets to a very convenient middle ground, with both very fast approximation algorithms and sparsity, which is under control up to sparsity patterns. In addition, our formulation fits naturally in the unbalanced optimal transport problem setting.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

Boosting with Tempered Exponential Measures

Richard Nock
Ehsan Amid
Manfred Warmuth

One of the most popular ML algorithms, AdaBoost, can bederived from the dual of a relative entropyminimization problem subject to the fact that the positive weightson the examples sum to one. Essentially, harder examples receive higher probabilities. We generalize this setup to the recently introduced *temperedexponential measure*s (TEMs) where normalization is enforced on a specific power of the measure and not the measure itself. TEMs are indexed by a parameter $t$ and generalize exponential families ($t=1$). Our algorithm, $t$-AdaBoost, recovers AdaBoost as a special case ($t=1$). We show that $t$-AdaBoost retains AdaBoost's celebrated exponential convergence rate when $t\in [0, 1)$ while allowing a slight improvement of the rate's hidden constant compared to $t=1$. $t$-AdaBoost partially computes on a generalization of classical arithmetic over the reals and brings notable properties like guaranteed bounded leveraging coefficients for $t\in [0, 1)$. From the loss that $t$-AdaBoost minimizes (a generalization of the exponential loss), we show how to derive a new family of *tempered* losses for the induction of domain-partitioning classifiers like decision trees. Crucially, strict properness is ensured for all while their boosting rates span the full known spectrum. Experiments using $t$-AdaBoost+trees display that significant leverage can be achieved by tuning $t$.

PDF Details

ICLR Conference 2023 Conference Paper

Distributionally Robust Post-hoc Classifiers under Prior Shifts

Jiaheng Wei
Harikrishna Narasimhan
Ehsan Amid
Wen-Sheng Chu
Yang Liu 0018
Abhishek Kumar

The generalization ability of machine learning models degrades significantly when the test distribution shifts away from the training distribution. We investigate the problem of training models that are robust to shifts caused by changes in the distribution of class-priors or group-priors. The presence of skewed training priors can often lead to the models overfitting to spurious features. Unlike existing methods, which optimize for either the worst or the average performance over classes or groups, our work is motivated by the need for finer control over the robustness properties of the model. We present an extremely lightweight post-hoc approach that performs scaling adjustments to predictions from a pre-trained model, with the goal of minimizing a distributionally robust loss around a chosen target distribution. These adjustments are computed by solving a constrained optimization problem on a validation set and applied to the model during test time. Our constrained optimization objective is inspired from a natural notion of robustness to controlled distribution shifts. Our method comes with provable guarantees and empirically makes a strong case for distributional robust post-hoc classifiers. An empirical implementation is available at https://github.com/weijiaheng/Drops.

Details

TMLR Journal 2023 Journal Article

Layerwise Bregman Representation Learning of Neural Networks with Applications to Knowledge Distillation

Ehsan Amid
Rohan Anil
Christopher Fifty
Manfred K Warmuth

We propose a new method for layerwise representation learning of a trained neural network that conforms to the non-linearity of the layer's transfer function. In particular, we form a Bregman divergence based on the convex function induced by the layer's transfer function and construct an extension of the original Bregman PCA formulation by incorporating a mean vector and revising the normalization constraint on the principal directions. These modifications allow exporting the learned representation as a fixed layer with a non-linearity. As an application to knowledge distillation, we cast the learning problem for the student network as predicting the compression coefficients of the teacher's representations, which is then passed as the input to the imported layer. Our empirical findings indicate that our approach is substantially more effective for transferring information between networks than typical teacher-student training that uses the teacher's soft labels.

PDF Details

ICML Conference 2022 Conference Paper

Public Data-Assisted Mirror Descent for Private Model Training

Ehsan Amid
Arun Ganesh
Rajiv Mathews
Swaroop Ramaswamy
Shuang Song 0001
Thomas Steinke 0002
Vinith Menon Suriyakumar
Om Thakkar 0001

In this paper, we revisit the problem of using in-distribution public data to improve the privacy/utility trade-offs for differentially private (DP) model training. (Here, public data refers to auxiliary data sets that have no privacy concerns.) We design a natural variant of DP mirror descent, where the DP gradients of the private/sensitive data act as the linear term, and the loss generated by the public data as the mirror map. We show that, for linear regression with feature vectors drawn from a non-isotropic sub-Gaussian distribution, our algorithm, PDA-DPMD (a variant of mirror descent), provides population risk guarantees that are asymptotically better than the best known guarantees under DP (without having access to public data), when the number of public data samples is sufficiently large. We further show that our algorithm has natural “noise stability” properties that control the variance due to noise added to ensure DP. We demonstrate the efficacy of our algorithm by showing privacy/utility trade-offs on four benchmark datasets (StackOverflow, WikiText-2, CIFAR-10, and EMNIST). We show that our algorithm not only significantly improves over traditional DP-SGD, which does not have access to public data, but to our knowledge is the first to improve over DP-SGD on models that have been pre-trained with public data.

Details

NeurIPS Conference 2021 Conference Paper

Efficiently Identifying Task Groupings for Multi-Task Learning

Chris Fifty
Ehsan Amid
Zhe Zhao
Tianhe Yu
Rohan Anil
Chelsea Finn

Multi-task learning can leverage information learned by one task to benefit the training of other tasks. Despite this capacity, naively training all tasks together in one model often degrades performance, and exhaustively searching through combinations of task groupings can be prohibitively expensive. As a result, efficiently identifying the tasks that would benefit from training together remains a challenging design question without a clear solution. In this paper, we suggest an approach to select which tasks should train together in multi-task learning models. Our method determines task groupings in a single run by training all tasks together and quantifying the effect to which one task's gradient would affect another task's loss. On the large-scale Taskonomy computer vision dataset, we find this method can decrease test loss by 10. 0% compared to simply training all tasks together while operating 11. 6 times faster than a state-of-the-art task grouping method.

PDF Details

AAAI Conference 2020 Conference Paper

An Implicit Form of Krasulina’s k-PCA Update without the Orthonormality Constraint

Ehsan Amid
Manfred K. Warmuth

We shed new insights on the two commonly used updates for the online k-PCA problem, namely, Krasulina’s and Oja’s updates. We show that Krasulina’s update corresponds to a projected gradient descent step on the Stiefel manifold of orthonormal k-frames, while Oja’s update amounts to a gradient descent step using the unprojected gradient. Following these observations, we derive a more implicit form of Krasulina’s k-PCA update, i. e. a version that uses the information of the future gradient as much as possible. Most interestingly, our implicit Krasulina update avoids the costly QR-decomposition step by bypassing the orthonormality constraint. A related update, called the Sanger’s rule, can be seen as an explicit approximation of our implicit update. We show that the new update in fact corresponds to an online EM step applied to a probabilistic k-PCA model. The probabilistic view of the update allows us to combine multiple models in a distributed setting. We show experimentally that the implicit Krasulina update yields superior convergence while being signiﬁcantly faster. We also give strong evidence that the new update can beneﬁt from parallelism and is more stable w. r. t. tuning of the learning rate.

PDF Details

UAI Conference 2020 Conference Paper

Divergence-Based Motivation for Online EM and Combining Hidden Variable Models

Ehsan Amid
Manfred K. Warmuth

Expectation-Maximization (EM) is a prominent approach for parameter estimation of hidden (aka latent) variable models. Given the full batch of data, EM forms an upper-bound of the negative log-likelihood of the model at each iteration and updates to the minimizer of this upper-bound. We first provide a “model level” interpretation of the EM upper-bound as a sum of relative entropy divergences to a set of singleton models induced by the batch of observations. Our alternative motivation unifies the “observation level” and the “model level” view of the EM. As a result, we formulate an online version of the EM algorithm by adding an analogous inertia term which is a relative entropy divergence to the old model. Our motivation is more widely applicable than the previous approaches and leads to simple online updates for mixture of exponential distributions, hidden Markov models, and the first known online update for Kalman filters. Additionally, the finite sample form of the inertia term lets us derive online updates when there is no closed-form solution. Finally, we extend the analysis to the distributed setting where we motivate a systematic way of combining multiple hidden variable models. Experimentally, we validate the results on synthetic as well as real-world datasets.

Details

NeurIPS Conference 2020 Conference Paper

Reparameterizing Mirror Descent as Gradient Descent

Ehsan Amid
Manfred K. K. Warmuth

Most of the recent successful applications of neural networks have been based on training with gradient descent updates. However, for some small networks, other mirror descent updates learn provably more efficiently when the target is sparse. We present a general framework for casting a mirror descent update as a gradient descent update on a different set of parameters. In some cases, the mirror descent reparameterization can be described as training a modified network with standard backpropagation. The reparameterization framework is versatile and covers a wide range of mirror descent updates, even cases where the domain is constrained. Our construction for the reparameterization argument is done for the continuous versions of the updates. Finding general criteria for the discrete versions to closely track their continuous counterparts remains an interesting open problem.

PDF Details

NeurIPS Conference 2019 Conference Paper

Robust Bi-Tempered Logistic Loss Based on Bregman Divergences

Ehsan Amid
Manfred K. Warmuth
Rohan Anil
Tomer Koren

We introduce a temperature into the exponential function and replace the softmax output layer of the neural networks by a high-temperature generalization. Similarly, the logarithm in the loss we use for training is replaced by a low-temperature logarithm. By tuning the two temperatures, we create loss functions that are non-convex already in the single layer case. When replacing the last layer of the neural networks by our bi-temperature generalization of the logistic loss, the training becomes more robust to noise. We visualize the effect of tuning the two temperatures in a simple setting and show the efficacy of our method on large datasets. Our methodology is based on Bregman divergences and is superior to a related two-temperature method that uses the Tsallis divergence.

PDF Details

ICML Conference 2015 Conference Paper

Multiview Triplet Embedding: Learning Attributes in Multiple Maps

Ehsan Amid
Antti Ukkonen

For humans, it is usually easier to make statements about the similarity of objects in relative, rather than absolute terms. Moreover, subjective comparisons of objects can be based on a number of different and independent attributes. For example, objects can be compared based on their shape, color, etc. In this paper, we consider the problem of uncovering these hidden attributes given a set of relative distance judgments in the form of triplets. The attribute that was used to generate a particular triplet in this set is unknown. Such data occurs, e. g. , in crowdsourcing applications where the triplets are collected from a large group of workers. We propose the Multiview Triplet Embedding (MVTE) algorithm that produces a number of low-dimensional maps, each corresponding to one of the hidden attributes. The method can be used to assess how many different attributes were used to create the triplets, as well as to assess the difficulty of a distance comparison task, and find objects that have multiple interpretations in relation to the other objects.

Details