Author name cluster

Quoc Le

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

27 papers

1 author row

AAAI Conference 2021 Conference Paper

AutoDropout: Learning Dropout Patterns to Regularize Deep Networks

Hieu Pham
Quoc Le

Neural networks are often over-parameterized and hence benefit from aggressive regularization. Conventional regularization methods, such as Dropout or weight decay, do not leverage the structures of the network’s inputs and hidden states. As a result, these methods are less effective than recent methods that leverage the structures, such as SpatialDropout and Drop- Block, which randomly drop the values at certain contiguous areas in the hidden states and setting them to zero. Although the locations of dropout areas are random, the patterns of SpatialDropout and DropBlock are manually designed and fixed. Here we propose AutoDropout, which automates the process of designing dropout patterns. In our method, a controller learns to generate a dropout pattern at every channel and layer of a target network, such as a ConvNet or a Transformer. The target network is then trained with the dropout pattern, and its resulting validation performance is used as a signal for the controller to learn from. We show that this method works well for both image recognition on CIFAR-10 and ImageNet, as well as language modeling on Penn Treebank and WikiText-2. The learned dropout patterns also transfers to different tasks and datasets, such as from language model on Penn Treebank to Engligh-French translation on WMT 2014. Our code will be available at: https: //github. com/googleresearch/google-research/tree/master/auto_dropout.

NeurIPS Conference 2020 Conference Paper

Evolving Normalization-Activation Layers

Hanxiao Liu
Andy Brock
Karen Simonyan
Quoc Le

Normalization layers and activation functions are fundamental components in deep networks and typically co-locate with each other. Here we propose to design them using an automated approach. Instead of designing them separately, we unify them into a single tensor-to-tensor computation graph, and evolve its structure starting from basic mathematical functions. Examples of such mathematical functions are addition, multiplication and statistical moments. The use of low-level mathematical functions, in contrast to the use of high-level modules in mainstream NAS, leads to a highly sparse and large search space which can be challenging for search methods. To address the challenge, we develop efficient rejection protocols to quickly filter out candidate layers that do not work well. We also use multi-objective evolution to optimize each layer's performance across many architectures to prevent overfitting. Our method leads to the discovery of EvoNorms, a set of new normalization-activation layers with novel, and sometimes surprising structures that go beyond existing design patterns. For example, some EvoNorms do not assume that normalization and activation functions must be applied sequentially, nor need to center the feature maps, nor require explicit activation functions. Our experiments show that EvoNorms work well on image classification models including ResNets, MobileNets and EfficientNets but also transfer well to Mask R-CNN with FPN/SpineNet for instance segmentation and to BigGAN for image synthesis, outperforming BatchNorm and GroupNorm based layers in many cases.

NeurIPS Conference 2020 Conference Paper

Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing

Zihang Dai
Guokun Lai
Yiming Yang
Quoc Le

With the success of language pretraining, it is highly desirable to develop more efficient architectures of good scalability that can exploit the abundant unlabeled data at a lower cost. To improve the efficiency, we examine the much-overlooked redundancy in maintaining a full-length token-level presentation, especially for tasks that only require a single-vector presentation of the sequence. With this intuition, we propose Funnel-Transformer which gradually compresses the sequence of hidden states to a shorter one and hence reduces the computation cost. More importantly, by re-investing the saved FLOPs from length reduction in constructing a deeper or wider model, we further improve the model capacity. In addition, to perform token-level predictions as required by common pretraining objectives, Funnel-Transformer is able to recover a deep representation for each token from the reduced hidden sequence via a decoder. Empirically, with comparable or fewer FLOPs, Funnel-Transformer outperforms the standard Transformer on a wide variety of sequence-level prediction tasks, including text classification, language understanding, and reading comprehension.

NeurIPS Conference 2020 Conference Paper

PyGlove: Symbolic Programming for Automated Machine Learning

Daiyi Peng
Xuanyi Dong
Esteban Real
Mingxing Tan
Yifeng Lu
Gabriel Bender
Hanxiao Liu
Adam Kraft

Neural networks are sensitive to hyper-parameter and architecture choices. Automated Machine Learning (AutoML) is a promising paradigm for automating these choices. Current ML software libraries, however, are quite limited in handling the dynamic interactions among the components of AutoML. For example, efficient NAS algorithms, such as ENAS and DARTS, typically require an implementation coupling between the search space and search algorithm, the two key components in AutoML. Furthermore, implementing a complex search flow, such as searching architectures within a loop of searching hardware configurations, is difficult. To summarize, changing the search space, search algorithm, or search flow in current ML libraries usually requires a significant change in the program logic. In this paper, we introduce a new way of programming AutoML based on symbolic programming. Under this paradigm, ML programs are mutable, thus can be manipulated easily by another program. As a result, AutoML can be reformulated as an automated process of symbolic manipulation. With this formulation, we decouple the triangle of the search algorithm, the search space and the child program. This decoupling makes it easy to change the search space and search algorithm (without and with weight sharing), as well as to add search capabilities to existing code and implement complex search flows. We then introduce PyGlove, a new Python library that implements this paradigm. Through case studies on ImageNet and NAS-Bench-101, we show that with PyGlove users can easily convert a static program into a search space, quickly iterate on the search spaces and search algorithms, and craft complex search flows to achieve better results.

NeurIPS Conference 2020 Conference Paper

RandAugment: Practical Automated Data Augmentation with a Reduced Search Space

Ekin Dogus Cubuk
Barret Zoph
Jon Shlens
Quoc Le

Recent work on automated data augmentation strategies has led to state-of-the-art results in image classification and object detection. An obstacle to a large-scale adoption of these methods is that they require a separate and expensive search phase. A common way to overcome the expense of the search phase was to use a smaller proxy task. However, it was not clear if the optimized hyperparameters found on the proxy task are also optimal for the actual task. In this work, we rethink the process of designing automated data augmentation strategies. We find that while previous work required searching for many augmentation parameters (e. g. magnitude and probability) independently for each augmentation operation, it is sufficient to only search for a single parameter that jointly controls all operations. Hence, we propose a search space that is vastly smaller (e. g. from 10^32 to 10^2 potential candidates). The smaller search space significantly reduces the computational expense of automated data augmentation and permits the removal of a separate proxy task. Despite the simplifications, our method achieves state-of-the-art performance on CIFAR-10, SVHN, and ImageNet. On EfficientNet-B7, we achieve 84. 7% accuracy, a 1. 0% increase over baseline augmentation and a 0. 4% improvement over AutoAugment on the ImageNet dataset. On object detection, the same method used for classification leads to 1. 0-1. 3% improvement over the baseline augmentation method on COCO. Code is available online.

NeurIPS Conference 2020 Conference Paper

Rethinking Pre-training and Self-training

Barret Zoph
Golnaz Ghiasi
Tsung-Yi Lin
Yin Cui
Hanxiao Liu
Ekin Dogus Cubuk
Quoc Le

Pre-training is a dominant paradigm in computer vision. For example, supervised ImageNet pre-training is commonly used to initialize the backbones of object detection and segmentation models. He et al. , however, show a striking result that ImageNet pre-training has limited impact on COCO object detection. Here we investigate self-training as another method to utilize additional data on the same setup and contrast it against ImageNet pre-training. Our study reveals the generality and flexibility of self-training with three additional insights: 1) stronger data augmentation and more labeled data further diminish the value of pre-training, 2) unlike pre-training, self-training is always helpful when using stronger data augmentation, in both low-data and high-data regimes, and 3) in the case that pre-training is helpful, self-training improves upon pre-training. For example, on the COCO object detection dataset, pre-training benefits when we use one fifth of the labeled data, and hurts accuracy when we use all labeled data. Self-training, on the other hand, shows positive improvements from +1. 3 to +3. 4AP across all dataset sizes. In other words, self-training works well exactly on the same setup that pre-training does not work (using ImageNet to help COCO). On the PASCAL segmentation dataset, which is a much smaller dataset than COCO, though pre-training does help significantly, self-training improves upon the pre-trained model. On COCO object detection, we achieve 53. 8AP, an improvement of +1. 7AP over the strongest SpineNet model. On PASCAL segmentation, we achieve 90. 5mIOU, an improvement of +1. 5mIOU over the previous state-of-the-art result by DeepLabv3+.

NeurIPS Conference 2020 Conference Paper

Unsupervised Data Augmentation for Consistency Training

Qizhe Xie
Zihang Dai
Eduard Hovy
Thang Luong
Quoc Le

Semi-supervised learning lately has shown much promise in improving deep learning models when labeled data is scarce. Common among recent approaches is the use of consistency training on a large amount of unlabeled data to constrain model predictions to be invariant to input noise. In this work, we present a new perspective on how to effectively noise unlabeled examples and argue that the quality of noising, specifically those produced by advanced data augmentation methods, plays a crucial role in semi-supervised learning. By substituting simple noising operations with advanced data augmentation methods such as RandAugment and back-translation, our method brings substantial improvements across six language and three vision tasks under the same consistency training framework. On the IMDb text classification dataset, with only 20 labeled examples, our method achieves an error rate of 4. 20, outperforming the state-of-the-art model trained on 25, 000 labeled examples. On a standard semi-supervised learning benchmark, CIFAR-10, our method outperforms all previous approaches and achieves an error rate of 5. 43 with only 250 examples. Our method also combines well with transfer learning, e. g. , when finetuning from BERT, and yields improvements in high-data regime, such as ImageNet, whether when there is only 10% labeled data or when a full labeled set with 1. 3M extra unlabeled examples is used. Code is available at https: //github. com/google-research/uda.

NeurIPS Conference 2019 Conference Paper

CondConv: Conditionally Parameterized Convolutions for Efficient Inference

Brandon Yang
Gabriel Bender
Quoc Le
Jiquan Ngiam

Convolutional layers are one of the basic building blocks of modern deep neural networks. One fundamental assumption is that convolutional kernels should be shared for all examples in a dataset. We propose conditionally parameterized convolutions (CondConv), which learn specialized convolutional kernels for each example. Replacing normal convolutions with CondConv enables us to increase the size and capacity of a network, while maintaining efficient inference. We demonstrate that scaling networks with CondConv improves the performance and inference cost trade-off of several existing convolutional neural network architectures on both classification and detection tasks. On ImageNet classification, our CondConv approach applied to EfficientNet-B0 achieves state-ofthe-art performance of 78. 3% accuracy with only 413M multiply-adds. Code and checkpoints for the CondConv Tensorflow layer and CondConv-EfficientNet models are available at: https: //github. com/tensorflow/tpu/tree/master/ models/official/efficientnet/condconv.

NeurIPS Conference 2019 Conference Paper

GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism

Yanping Huang
Youlong Cheng
Ankur Bapna
Orhan Firat
Dehao Chen
Mia Chen
HyoukJoong Lee
Jiquan Ngiam

Scaling up deep neural network capacity has been known as an effective approach to improving model quality for several different machine learning tasks. In many cases, increasing model capacity beyond the memory limit of a single accelerator has required developing special algorithms or infrastructure. These solutions are often architecture-specific and do not transfer to other machine learning tasks. To address the need for efficient and task-independent model parallelism, we introduce TensorPipe, a pipeline parallelism library that allows scaling any network that can be expressed as a sequence of layers. By pipelining different sub-sequences of layers on separate accelerators, TensorPipe provides the flexibility of scaling a variety of different networks to gigantic sizes efficiently. Moreover, TensorPipe utilizes a novel batch-splitting pipelining algorithm, resulting in almost linear speedup when a model is partitioned across multiple accelerators. We demonstrate the advantages of TensorPipe by training large-scale neural networks on two different tasks with distinct network architectures: (i)Image Classification: We train a 557-million-parameter AmoebaNet model and attain a top-1 accuracy of 84. 4% on ImageNet-2012, (ii)Multilingual Neural Machine Translation: We train a single 6-billion-parameter, 128-layer Transformer model on a corpus spanning over 100 languages and achieve better quality than all bilingual models.

NeurIPS Conference 2019 Conference Paper

High Fidelity Video Prediction with Large Stochastic Recurrent Neural Networks

Ruben Villegas
Arkanath Pathak
Harini Kannan
Dumitru Erhan
Quoc Le
Honglak Lee

Predicting future video frames is extremely challenging, as there are many factors of variation that make up the dynamics of how frames change through time. Previously proposed solutions require complex inductive biases inside network architectures with highly specialized computation, including segmentation masks, optical flow, and foreground and background separation. In this work, we question if such handcrafted architectures are necessary and instead propose a different approach: finding minimal inductive bias for video prediction while maximizing network capacity. We investigate this question by performing the first large-scale empirical study and demonstrate state-of-the-art performance by learning large models on three different datasets: one for modeling object interactions, one for modeling human motion, and one for modeling car driving.

NeurIPS Conference 2019 Conference Paper

Mixtape: Breaking the Softmax Bottleneck Efficiently

Zhilin Yang
Thang Luong
Russ Salakhutdinov
Quoc Le

The softmax bottleneck has been shown to limit the expressiveness of neural lan- guage models. Mixture of Softmaxes (MoS) is an effective approach to address such a theoretical limitation, but are expensive compared to softmax in terms of both memory and time. We propose Mixtape, an output layer that breaks the softmax bottleneck more efficiently with three novel techniques—logit space vector gating, sigmoid tree decomposition, and gate sharing. On four benchmarks including language modeling and machine translation, the Mixtape layer substantially improves the efficiency over the MoS layer by 3. 5x to 10. 5x while obtaining similar performance. A network equipped with Mixtape is only 20% to 34% slower than a softmax-based network with 10-30K vocabulary sizes, and outperforms softmax in perplexity and translation quality.

NeurIPS Conference 2019 Conference Paper

Saccader: Improving Accuracy of Hard Attention Models for Vision

Gamaleldin Elsayed
Simon Kornblith
Quoc Le

Although deep convolutional neural networks achieve state-of-the-art performance across nearly all image classification tasks, their decisions are difficult to interpret. One approach that offers some level of interpretability by design is \textit{hard attention}, which uses only relevant portions of the image. However, training hard attention models with only class label supervision is challenging, and hard attention has proved difficult to scale to complex datasets. Here, we propose a novel hard attention model, which we term Saccader. Key to Saccader is a pretraining step that requires only class labels and provides initial attention locations for policy gradient optimization. Our best models narrow the gap to common ImageNet baselines, achieving $75\%$ top-1 and $91\%$ top-5 while attending to less than one-third of the image.

NeurIPS Conference 2019 Conference Paper

XLNet: Generalized Autoregressive Pretraining for Language Understanding

Zhilin Yang
Zihang Dai
Yiming Yang
Jaime Carbonell
Russ Salakhutdinov
Quoc Le

With the capability of modeling bidirectional contexts, denoising autoencoding based pretraining like BERT achieves better performance than pretraining approaches based on autoregressive language modeling. However, relying on corrupting the input with masks, BERT neglects dependency between the masked positions and suffers from a pretrain-finetune discrepancy. In light of these pros and cons, we propose XLNet, a generalized autoregressive pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and (2) overcomes the limitations of BERT thanks to its autoregressive formulation. Furthermore, XLNet integrates ideas from Transformer-XL, the state-of-the-art autoregressive model, into pretraining. Empirically, under comparable experiment setting, XLNet outperforms BERT on 20 tasks, often by a large margin, including question answering, natural language inference, sentiment analysis, and document ranking.

NeurIPS Conference 2018 Conference Paper

DropBlock: A regularization method for convolutional networks

Golnaz Ghiasi
Tsung-Yi Lin
Quoc Le

Deep neural networks often work well when they are over-parameterized and trained with a massive amount of noise and regularization, such as weight decay and dropout. Although dropout is widely used as a regularization technique for fully connected layers, it is often less effective for convolutional layers. This lack of success of dropout for convolutional layers is perhaps due to the fact that activation units in convolutional layers are spatially correlated so information can still flow through convolutional networks despite dropout. Thus a structured form of dropout is needed to regularize convolutional networks. In this paper, we introduce DropBlock, a form of structured dropout, where units in a contiguous region of a feature map are dropped together. We found that applying DropbBlock in skip connections in addition to the convolution layers increases the accuracy. Also, gradually increasing number of dropped units during training leads to better accuracy and more robust to hyperparameter choices. Extensive experiments show that DropBlock works better than dropout in regularizing convolutional networks. On ImageNet classification, ResNet-50 architecture with DropBlock achieves $78. 13\%$ accuracy, which is more than $1. 6\%$ improvement on the baseline. On COCO detection, DropBlock improves Average Precision of RetinaNet from $36. 8\%$ to $38. 4\%$.

NeurIPS Conference 2018 Conference Paper

Memory Augmented Policy Optimization for Program Synthesis and Semantic Parsing

Chen Liang
Mohammad Norouzi
Jonathan Berant
Quoc Le
Ni Lao

We present Memory Augmented Policy Optimization (MAPO), a simple and novel way to leverage a memory buffer of promising trajectories to reduce the variance of policy gradient estimate. MAPO is applicable to deterministic environments with discrete actions, such as structured prediction and combinatorial optimization tasks. We express the expected return objective as a weighted sum of two terms: an expectation over the high-reward trajectories inside the memory buffer, and a separate expectation over trajectories outside the buffer. To make an efficient algorithm of MAPO, we propose: (1) memory weight clipping to accelerate and stabilize training; (2) systematic exploration to discover high-reward trajectories; (3) distributed sampling from inside and outside of the memory buffer to scale up training. MAPO improves the sample efficiency and robustness of policy gradient, especially on tasks with sparse rewards. We evaluate MAPO on weakly supervised program synthesis from natural language (semantic parsing). On the WikiTableQuestions benchmark, we improve the state-of-the-art by 2. 6%, achieving an accuracy of 46. 3%. On the WikiSQL benchmark, MAPO achieves an accuracy of 74. 9% with only weak supervision, outperforming several strong baselines with full supervision. Our source code is available at https: //goo. gl/TXBp4e

NeurIPS Conference 2016 Conference Paper

An Online Sequence-to-Sequence Model Using Partial Conditioning

Navdeep Jaitly
Quoc Le
Oriol Vinyals
Ilya Sutskever
David Sussillo
Samy Bengio

Sequence-to-sequence models have achieved impressive results on various tasks. However, they are unsuitable for tasks that require incremental predictions to be made as more data arrives or tasks that have long input sequences and output sequences. This is because they generate an output sequence conditioned on an entire input sequence. In this paper, we present a Neural Transducer that can make incremental predictions as more input arrives, without redoing the entire computation. Unlike sequence-to-sequence models, the Neural Transducer computes the next-step distribution conditioned on the partially observed input sequence and the partially generated sequence. At each time step, the transducer can decide to emit zero to many output symbols. The data can be processed using an encoder and presented as input to the transducer. The discrete decision to emit a symbol at every time step makes it difficult to learn with conventional backpropagation. It is however possible to train the transducer by using a dynamic programming algorithm to generate target discrete decisions. Our experiments show that the Neural Transducer works well in settings where it is required to produce output predictions as data come in. We also find that the Neural Transducer performs well for long sequences even when attention mechanisms are not used.

NeurIPS Conference 2015 Conference Paper

Semi-supervised Sequence Learning

Andrew Dai
Quoc Le

We present two approaches to use unlabeled data to improve Sequence Learningwith recurrent networks. The first approach is to predict what comes next in asequence, which is a language model in NLP. The second approach is to use asequence autoencoder, which reads the input sequence into a vector and predictsthe input sequence again. These two algorithms can be used as a “pretraining”algorithm for a later supervised sequence learning algorithm. In other words, theparameters obtained from the pretraining step can then be used as a starting pointfor other supervised training models. In our experiments, we find that long shortterm memory recurrent networks after pretrained with the two approaches becomemore stable to train and generalize better. With pretraining, we were able toachieve strong performance in many classification tasks, such as text classificationwith IMDB, DBpedia or image recognition in CIFAR-10.

NeurIPS Conference 2014 Conference Paper

Sequence to Sequence Learning with Neural Networks

Ilya Sutskever
Oriol Vinyals
Quoc Le

Deep Neural Networks (DNNs) are powerful models that have achieved excellent performance on difficult learning tasks. Although DNNs work well whenever large labeled training sets are available, they cannot be used to map sequences to sequences. In this paper, we present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. Our method uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector. Our main result is that on an English to French translation task from the WMT-14 dataset, the translations produced by the LSTM achieve a BLEU score of 34. 8 on the entire test set, where the LSTM's BLEU score was penalized on out-of-vocabulary words. Additionally, the LSTM did not have difficulty on long sentences. For comparison, a phrase-based SMT system achieves a BLEU score of 33. 3 on the same dataset. When we used the LSTM to rerank the 1000 hypotheses produced by the aforementioned SMT system, its BLEU score increases to 36. 5, which is close to the previous state of the art. The LSTM also learned sensible phrase and sentence representations that are sensitive to word order and are relatively invariant to the active and the passive voice. Finally, we found that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.

NeurIPS Conference 2012 Conference Paper

Large Scale Distributed Deep Networks

Jeffrey Dean
Greg Corrado
Rajat Monga
Kai Chen
Matthieu Devin
Mark Mao
Marc'Aurelio Ranzato
Andrew Senior

Recent work in unsupervised feature learning and deep learning has shown that being able to train large models can dramatically improve performance. In this paper, we consider the problem of training a deep network with billions of parameters using tens of thousands of CPU cores. We have developed a software framework called DistBelief that can utilize computing clusters with thousands of machines to train large models. Within this framework, we have developed two algorithms for large-scale distributed training: (i) Downpour SGD, an asynchronous stochastic gradient descent procedure supporting a large number of model replicas, and (ii) Sandblaster, a framework that supports for a variety of distributed batch optimization procedures, including a distributed implementation of L-BFGS. Downpour SGD and Sandblaster L-BFGS both increase the scale and speed of deep network training. We have successfully used our system to train a deep network 100x larger than previously reported in the literature, and achieves state-of-the-art performance on ImageNet, a visual object recognition task with 16 million images and 21k categories. We show that these same techniques dramatically accelerate the training of a more modestly sized deep network for a commercial speech recognition service. Although we focus on and report performance of these methods as applied to training large neural networks, the underlying algorithms are applicable to any gradient-based machine learning algorithm.

NeurIPS Conference 2011 Conference Paper

ICA with Reconstruction Cost for Efficient Overcomplete Feature Learning

Quoc Le
Alexandre Karpenko
Jiquan Ngiam
Andrew Ng

Independent Components Analysis (ICA) and its variants have been successfully used for unsupervised feature learning. However, standard ICA requires an orthonoramlity constraint to be enforced, which makes it difﬁcult to learn overcomplete features. In addition, ICA is sensitive to whitening. These properties make it challenging to scale ICA to high dimensional data. In this paper, we propose a robust soft reconstruction cost for ICA that allows us to learn highly overcomplete sparse features even on unwhitened data. Our formulation reveals formal connections between ICA and sparse autoencoders, which have previously been observed only empirically. Our algorithm can be used in conjunction with off-the-shelf fast unconstrained optimizers. We show that the soft reconstruction cost can also be used to prevent replicated features in tiled convolutional neural networks. Using our method to learn highly overcomplete sparse features and tiled convolutional neural networks, we obtain competitive performances on a wide variety of object recognition tasks. We achieve state-of-the-art test accuracies on the STL-10 and Hollywood2 datasets.

NeurIPS Conference 2010 Conference Paper

Tiled convolutional neural networks

Jiquan Ngiam
Zhenghao Chen
Daniel Chia
Pang Koh
Quoc Le
Andrew Ng

Convolutional neural networks (CNNs) have been successfully applied to many tasks such as digit and object recognition. Using convolutional (tied) weights signiﬁcantly reduces the number of parameters that have to be learned, and also allows translational invariance to be hard-coded into the architecture. In this paper, we consider the problem of learning invariances, rather than relying on hard-coding. We propose tiled convolution neural networks (Tiled CNNs), which use a regular “tiled” pattern of tied weights that does not require that adjacent hidden units share identical weights, but instead requires only that hidden units k steps away from each other to have tied weights. By pooling over neighboring units, this architecture is able to learn complex invariances (such as scale and rotational invariance) beyond translational invariance. Further, it also enjoys much of CNNs’ advantage of having a relatively small number of learned parameters (such as ease of learning and greater scalability). We provide an efficient learning algorithm for Tiled CNNs based on Topographic ICA, and show that learning complex invariant features allows us to achieve highly competitive results for both the NORB and CIFAR-10 datasets.

NeurIPS Conference 2009 Conference Paper

Measuring Invariances in Deep Networks

Ian Goodfellow
Honglak Lee
Quoc Le
Andrew Saxe
Andrew Ng

For many computer vision applications, the ideal image feature would be invariant to multiple confounding image properties, such as illumination and viewing angle. Recently, deep architectures trained in an unsupervised manner have been proposed as an automatic method for extracting useful features. However, outside of using these learning algorithms in a classiﬁer, they can be sometimes difﬁcult to evaluate. In this paper, we propose a number of empirical tests that directly measure the degree to which these learned features are invariant to different image transforms. We ﬁnd that deep autoencoders become invariant to increasingly complex image transformations with depth. This further justiﬁes the use of “deep” vs. “shallower” representations. Our performance metrics agree with existing measures of invariance. Our evaluation metrics can also be used to evaluate future work in unsupervised deep learning, and thus help the development of future algorithms.

NeurIPS Conference 2008 Conference Paper

Tighter Bounds for Structured Estimation

Olivier Chapelle
Chuong B.
Choon Teo
Quoc Le
Alex Smola

Large-margin structured estimation methods work by minimizing a convex upper bound of loss functions. While they allow for efficient optimization algorithms, these convex formulations are not tight and sacrifice the ability to accurately model the true loss. We present tighter non-convex bounds based on generalizing the notion of a ramp loss from binary classification to structured estimation. We show that a small modification of existing optimization algorithms suffices to solve this modified problem. On structured prediction tasks such as protein sequence alignment and web page ranking, our algorithm leads to improved accuracy.

NeurIPS Conference 2007 Conference Paper

Bundle Methods for Machine Learning

Quoc Le
Alex Smola
S. V. N. Vishwanathan

We present a globally convergent method for regularized risk minimization prob- lems. Our method applies to Support Vector estimation, regression, Gaussian Processes, and any other regularized risk minimization setting which leads to a convex optimization problem. SVMPerf can be shown to be a special case of our approach. In addition to the uniﬁed framework we present tight convergence bounds, which show that our algorithm converges in O(1/) steps to precision for general convex problems and in O(log(1/)) steps for continuously differen- tiable problems. We demonstrate in experiments the performance of our approach.

NeurIPS Conference 2007 Conference Paper

COFI RANK - Maximum Margin Matrix Factorization for Collaborative Ranking

Markus Weimer
Alexandros Karatzoglou
Quoc Le
Alex Smola

In this paper, we consider collaborative ﬁltering as a ranking problem. We present a method which uses Maximum Margin Matrix Factorization and optimizes rank- ing instead of rating. We employ structured output prediction to optimize directly for ranking scores. Experimental results show that our method gives very good ranking scores and scales well on collaborative ﬁltering tasks.

NeurIPS Conference 2006 Conference Paper

Learning to Rank with Nonsmooth Cost Functions

Christopher Burges
Robert Ragno
Quoc Le

The quality measures used in information retrieval are particularly difﬁcult to op- timize directly, since they depend on the model scores only through the sorted order of the documents returned for a given query. Thus, the derivatives of the cost with respect to the model parameters are either zero, or are undeﬁned. In this paper, we propose a class of simple, ﬂexible algorithms, called LambdaRank, which avoids these difﬁculties by working with implicit cost functions. We de- scribe LambdaRank using neural network models, although the idea applies to any differentiable function class. We give necessary and sufﬁcient conditions for the resulting implicit cost function to be convex, and we show that the general method has a simple mechanical interpretation. We demonstrate signiﬁcantly im- proved accuracy, over a state-of-the-art ranking algorithm, on several datasets. We also show that LambdaRank provides a method for signiﬁcantly speeding up the training phase of that ranking algorithm. Although this paper is directed towards ranking, the proposed method can be extended to any non-smooth and multivariate cost functions.

NeurIPS Conference 2005 Conference Paper

Large-Scale Multiclass Transduction

Thomas Gärtner
Quoc Le
Simon Burton
Alex Smola
Vishy Vishwanathan

We present a method for performing transductive inference on very large datasets. Our algorithm is based on multiclass Gaussian processes and is effective whenever the multiplication of the kernel matrix or its inverse with a vector can be computed sufﬁciently fast. This holds, for instance, for certain graph and string kernels. Transduction is achieved by varia- tional inference over the unlabeled data subject to a balancing constraint.