Author name cluster

Andreas Veit

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

6 papers

2 author rows

ICLR Conference 2023 Conference Paper

Teacher Guided Training: An Efficient Framework for Knowledge Transfer

Manzil Zaheer
Ankit Singh Rawat
Seungyeon Kim 0001
Chong You
Himanshu Jain
Andreas Veit
Rob Fergus
Sanjiv Kumar

The remarkable performance gains realized by large pretrained models, e.g., GPT-3, hinge on the massive amounts of data they are exposed to during training. Analogously, distilling such large models to compact models for efficient deployment also necessitates a large amount of (labeled or unlabeled) training data. In this paper, we propose the teacher-guided training (TGT) framework for training a high-quality compact model that leverages the knowledge acquired by pretrained generative models, while obviating the need to go through a large volume of data. TGT exploits the fact that the teacher has acquired a good representation of the underlying data domain, which typically corresponds to a much lower dimensional manifold than the input space. Furthermore, we can use the teacher to explore input space more efficiently through sampling or gradient-based methods; thus, making TGT especially attractive for limited data or long-tail settings. We formally capture this benefit of proposed data-domain exploration in our generalization bounds. We find that TGT can improve accuracy on several image classification benchmarks as well as a range of text classification and retrieval tasks.

Details

ICLR Conference 2021 Conference Paper

Coping with Label Shift via Distributionally Robust Optimisation

Jingzhao Zhang
Aditya Krishna Menon
Andreas Veit
Srinadh Bhojanapalli
Sanjiv Kumar
Suvrit Sra

The label shift problem refers to the supervised learning setting where the train and test label distributions do not match. Existing work addressing label shift usually assumes access to an unlabelled test sample. This sample may be used to estimate the test label distribution, and to then train a suitably re-weighted classifier. While approaches using this idea have proven effective, their scope is limited as it is not always feasible to access the target domain; further, they require repeated retraining if the model is to be deployed in multiple test environments. Can one instead learn a single classifier that is robust to arbitrary label shifts from a broad family? In this paper, we answer this question by proposing a model that minimises an objective based on distributionally robust optimisation (DRO). We then design and analyse a gradient descent-proximal mirror ascent algorithm tailored for large-scale problems to optimise the proposed objective. Finally, through experiments on CIFAR-100 and ImageNet, we show that our technique can significantly improve performance over a number of baselines in settings where label shift is present.

Details

ICLR Conference 2021 Conference Paper

Long-tail learning via logit adjustment

Aditya Krishna Menon
Sadeep Jayasumana
Ankit Singh Rawat
Himanshu Jain
Andreas Veit
Sanjiv Kumar

Real-world classification problems typically exhibit an imbalanced or long-tailed label distribution, wherein many labels have only a few associated samples. This poses a challenge for generalisation on such labels, and also makes naive learning biased towards dominant labels. In this paper, we present a statistical framework that unifies and generalises several recent proposals to cope with these challenges. Our framework revisits the classic idea of logit adjustment based on the label frequencies, which encourages a large relative margin between logits of rare positive versus dominant negative labels. This yields two techniques for long-tail learning, where such adjustment is either applied post-hoc to a trained model, or enforced in the loss during training. These techniques are statistically grounded, and practically effective on four real-world datasets with long-tailed label distributions.

Details

NeurIPS Conference 2020 Conference Paper

Why are Adaptive Methods Good for Attention Models?

Jingzhao Zhang
Sai Praneeth Karimireddy
Andreas Veit
Seungyeon Kim
Sashank Reddi
Sanjiv Kumar
Suvrit Sra

While stochastic gradient descent (SGD) is still the de facto algorithm in deep learning, adaptive methods like Clipped SGD/Adam have been observed to outperform SGD across important tasks, such as attention models. The settings under which SGD performs poorly in comparison to adaptive methods are not well understood yet. In this paper, we provide empirical and theoretical evidence that a heavy-tailed distribution of the noise in stochastic gradients is one cause of SGD's poor performance. We provide the first tight upper and lower convergence bounds for adaptive gradient methods under heavy-tailed noise. Further, we demonstrate how gradient clipping plays a key role in addressing heavy-tailed gradient noise. Subsequently, we show how clipping can be applied in practice by developing an adaptive coordinate-wise clipping algorithm (ACClip) and demonstrate its superior performance on BERT pretraining and finetuning tasks.

PDF Details

NeurIPS Conference 2016 Conference Paper

Residual Networks Behave Like Ensembles of Relatively Shallow Networks

Andreas Veit
Michael Wilber
Serge Belongie

In this work we propose a novel interpretation of residual networks showing that they can be seen as a collection of many paths of differing length. Moreover, residual networks seem to enable very deep networks by leveraging only the short paths during training. To support this observation, we rewrite residual networks as an explicit collection of paths. Unlike traditional models, paths through residual networks vary in length. Further, a lesion study reveals that these paths show ensemble-like behavior in the sense that they do not strongly depend on each other. Finally, and most surprising, most paths are shorter than one might expect, and only the short paths are needed during training, as longer paths do not contribute any gradient. For example, most of the gradient in a residual network with 110 layers comes from paths that are only 10-34 layers deep. Our results reveal one of the key characteristics that seem to enable the training of very deep networks: Residual networks avoid the vanishing gradient problem by introducing short paths which can carry gradient throughout the extent of very deep networks.

PDF Details

AAAI Conference 2013 Conference Paper

Multiagent Coordination for Energy Consumption Scheduling in Consumer Cooperatives

Andreas Veit
Ying Xu
Ronghuo Zheng
Nilanjan Chakraborty
Katia Sycara

A key challenge to create a sustainable and energyefficient society is in making consumer demand adaptive to energy supply, especially renewable supply. In this paper, we propose a partially-centralized organization of consumers, namely, a consumer cooperative for purchasing electricity from the market. We propose a novel multiagent coordination algorithm to shape the energy consumption of the cooperative. In the cooperative, a central coordinator buys the electricity for the whole group and consumers make their own consumption decisions based on their private consumption constraints and preferences. To coordinate individual consumers under incomplete information, we propose an iterative algorithm in which a virtual price signal is sent by the coordinator to induce consumers to shift demand. We prove that our algorithm converges to the central optimal solution. Additionally we analyze the convergence rate of the algorithm via simulations on randomly generated instances. The results indicate scalability with respect to the number of agents and consumption slots.

PDF Details