Generative Pretraining From Pixels

Mark Chen 0003; Alec Radford; Rewon Child; Jeffrey Wu 0003; Heewoo Jun; David Luan; Ilya Sutskever

Back to ICML

ICML 2020

Generative Pretraining From Pixels

Conference Paper Accepted Paper Artificial Intelligence · Machine Learning

Details

Abstract

Inspired by progress in unsupervised representation learning for natural language, we examine whether similar models can learn useful representations for images. We train a sequence Transformer to auto-regressively predict pixels, without incorporating knowledge of the 2D input structure. Despite training on low-resolution ImageNet without labels, we find that a GPT-2 scale model learns strong image representations as measured by linear probing, fine-tuning, and low-data classification. On CIFAR-10, we achieve 96. 3% accuracy with a linear probe, outperforming a supervised Wide ResNet, and 99. 0% accuracy with full fine-tuning, matching the top supervised pre-trained models. We are also competitive with self-supervised benchmarks on ImageNet when substituting pixels for a VQVAE encoding, achieving 69. 0% top-1 accuracy on a linear probe of our features.

Authors

Keywords

No keywords are indexed for this paper.

Context

Venue: International Conference on Machine Learning
Archive span: 1993-2025
Indexed papers: 16471
Paper id: 667961734327969844