Arrow Research search

Author name cluster

Koray Kavukcuoglu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

30 papers
2 author rows

Possible papers

30

ICML Conference 2022 Conference Paper

Unified Scaling Laws for Routed Language Models

  • Aidan Clark
  • Diego de Las Casas
  • Aurelia Guy
  • Arthur Mensch
  • Michela Paganini
  • Jordan Hoffmann
  • Bogdan Damoc
  • Blake A. Hechtman

The performance of a language model has been shown to be effectively modeled as a power-law in its parameter count. Here we study the scaling behaviors of Routing Networks: architectures that conditionally use only a subset of their parameters while processing an input. For these models, parameter count and computational requirement form two independent axes along which an increase leads to better performance. In this work we derive and justify scaling laws defined on these two variables which generalize those known for standard language models and describe the performance of a wide range of routing architectures trained via three different techniques. Afterwards we provide two applications of these laws: first deriving an Effective Parameter Count along which all models scale at the same rate, and then using the scaling coefficients to give a quantitative comparison of the three routing techniques considered. Our analysis derives from an extensive evaluation of Routing Networks across five orders of magnitude of size, including models with hundreds of experts and hundreds of billions of parameters.

NeurIPS Conference 2020 Conference Paper

Bootstrap Your Own Latent - A New Approach to Self-Supervised Learning

  • Jean-Bastien Grill
  • Florian Strub
  • Florent Altché
  • Corentin Tallec
  • Pierre Richemond
  • Elena Buchatskaya
  • Carl Doersch
  • Bernardo Avila Pires

We introduce Bootstrap Your Own Latent (BYOL), a new approach to self-supervised image representation learning. BYOL relies on two neural networks, referred to as online and target networks, that interact and learn from each other. From an augmented view of an image, we train the online network to predict the target network representation of the same image under a different augmented view. At the same time, we update the target network with a slow-moving average of the online network. While state-of-the art methods intrinsically rely on negative pairs, BYOL achieves a new state of the art without them. BYOL reaches 74. 3% top-1 classification accuracy on ImageNet using the standard linear evaluation protocol with a standard ResNet-50 architecture and 79. 6% with a larger ResNet. We also show that BYOL performs on par or better than the current state of the art on both transfer and semi-supervised benchmarks.

ICML Conference 2018 Conference Paper

Efficient Neural Audio Synthesis

  • Nal Kalchbrenner
  • Erich Elsen
  • Karen Simonyan
  • Seb Noury
  • Norman Casagrande
  • Edward Lockhart
  • Florian Stimberg
  • Aäron van den Oord

Sequential models achieve state-of-the-art results in audio, visual and textual domains with respect to both estimating the data distribution and generating desired samples. Efficient sampling for this class of models at the cost of little to no loss in quality has however remained an elusive problem. With a focus on text-to-speech synthesis, we describe a set of general techniques for reducing sampling time while maintaining high output quality. We first describe a single-layer recurrent neural network, the WaveRNN, with a dual softmax layer that matches the quality of the state-of-the-art WaveNet model. The compact form of the network makes it possible to generate 24 kHz 16-bit audio 4 times faster than real time on a GPU. Secondly, we apply a weight pruning technique to reduce the number of weights in the WaveRNN. We find that, for a constant number of parameters, large sparse networks perform better than small dense networks and this relationship holds past sparsity levels of more than 96%. The small number of weights in a Sparse WaveRNN makes it possible to sample high-fidelity audio on a mobile phone CPU in real time. Finally, we describe a new dependency scheme for sampling that lets us trade a constant number of non-local, distant dependencies for the ability to generate samples in batches. The Batch WaveRNN produces 8 samples per step without loss of quality and offers orthogonal ways of further increasing sampling efficiency.

ICML Conference 2018 Conference Paper

IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures

  • Lasse Espeholt
  • Hubert Soyer
  • Rémi Munos
  • Karen Simonyan
  • Volodymyr Mnih
  • Tom Ward
  • Yotam Doron
  • Vlad Firoiu

In this work we aim to solve a large collection of tasks using a single reinforcement learning agent with a single set of parameters. A key challenge is to handle the increased amount of data and extended training time. We have developed a new distributed agent IMPALA (Importance Weighted Actor-Learner Architecture) that not only uses resources more efficiently in single-machine training but also scales to thousands of machines without sacrificing data efficiency or resource utilisation. We achieve stable learning at high throughput by combining decoupled acting and learning with a novel off-policy correction method called V-trace. We demonstrate the effectiveness of IMPALA for multi-task reinforcement learning on DMLab-30 (a set of 30 tasks from the DeepMind Lab environment (Beattie et al. , 2016)) and Atari57 (all available Atari games in Arcade Learning Environment (Bellemare et al. , 2013a)). Our results show that IMPALA is able to achieve better performance than previous agents with less data, and crucially exhibits positive transfer between tasks as a result of its multi-task approach.

NeurIPS Conference 2018 Conference Paper

Learning to Navigate in Cities Without a Map

  • Piotr Mirowski
  • Matt Grimes
  • Mateusz Malinowski
  • Karl Moritz Hermann
  • Keith Anderson
  • Denis Teplyashin
  • Karen Simonyan
  • Koray Kavukcuoglu

Navigating through unstructured environments is a basic capability of intelligent creatures, and thus is of fundamental interest in the study and development of artificial intelligence. Long-range navigation is a complex cognitive task that relies on developing an internal representation of space, grounded by recognisable landmarks and robust visual processing, that can simultaneously support continuous self-localisation ("I am here") and a representation of the goal ("I am going there"). Building upon recent research that applies deep reinforcement learning to maze navigation problems, we present an end-to-end deep reinforcement learning approach that can be applied on a city scale. Recognising that successful navigation relies on integration of general policies with locale-specific knowledge, we propose a dual pathway architecture that allows locale-specific features to be encapsulated, while still enabling transfer to multiple cities. A key contribution of this paper is an interactive navigation environment that uses Google Street View for its photographic content and worldwide coverage. Our baselines demonstrate that deep reinforcement learning agents can learn to navigate in multiple cities and to traverse to target destinations that may be kilometres away. A video summarizing our research and showing the trained agent in diverse city environments as well as on the transfer task is available at: https: //sites. google. com/view/learn-navigate-cities-nips18

ICML Conference 2018 Conference Paper

Parallel WaveNet: Fast High-Fidelity Speech Synthesis

  • Aäron van den Oord
  • Yazhe Li
  • Igor Babuschkin
  • Karen Simonyan
  • Oriol Vinyals
  • Koray Kavukcuoglu
  • George van den Driessche 0002
  • Edward Lockhart

The recently-developed WaveNet architecture is the current state of the art in realistic speech synthesis, consistently rated as more natural sounding for many different languages than any previous system. However, because WaveNet relies on sequential generation of one audio sample at a time, it is poorly suited to today’s massively parallel computers, and therefore hard to deploy in a real-time production setting. This paper introduces Probability Density Distillation, a new method for training a parallel feed-forward network from a trained WaveNet with no significant difference in quality. The resulting system is capable of generating high-fidelity speech samples at more than 20 times faster than real-time, a 1000x speed up relative to the original WaveNet, and capable of serving multiple English and Japanese voices in a production setting.

ICML Conference 2017 Conference Paper

Automated Curriculum Learning for Neural Networks

  • Alex Graves
  • Marc G. Bellemare
  • Jacob Menick
  • Rémi Munos
  • Koray Kavukcuoglu

We introduce a method for automatically selecting the path, or syllabus, that a neural network follows through a curriculum so as to maximise learning efficiency. A measure of the amount that the network learns from each data sample is provided as a reward signal to a nonstationary multi-armed bandit algorithm, which then determines a stochastic syllabus. We consider a range of signals derived from two distinct indicators of learning progress: rate of increase in prediction accuracy, and rate of increase in network complexity. Experimental results for LSTM networks on three curricula demonstrate that our approach can significantly accelerate learning, in some cases halving the time required to attain a satisfactory performance level.

ICML Conference 2017 Conference Paper

Decoupled Neural Interfaces using Synthetic Gradients

  • Max Jaderberg
  • Wojciech Czarnecki 0001
  • Simon Osindero
  • Oriol Vinyals
  • Alex Graves
  • David Silver 0001
  • Koray Kavukcuoglu

Training directed neural networks typically requires forward-propagating data through a computation graph, followed by backpropagating error signal, to produce weight updates. All layers, or more generally, modules, of the network are therefore locked, in the sense that they must wait for the remainder of the network to execute forwards and propagate error backwards before they can be updated. In this work we break this constraint by decoupling modules by introducing a model of the future computation of the network graph. These models predict what the result of the modelled subgraph will produce using only local information. In particular we focus on modelling error gradients: by using the modelled synthetic gradient in place of true backpropagated error gradients we decouple subgraphs, and can update them independently and asynchronously i. e. we realise decoupled neural interfaces. We show results for feed-forward models, where every layer is trained asynchronously, recurrent neural networks (RNNs) where predicting one’s future gradient extends the time over which the RNN can effectively model, and also a hierarchical RNN system with ticking at different timescales. Finally, we demonstrate that in addition to predicting gradients, the same framework can be used to predict inputs, resulting in models which are decoupled in both the forward and backwards pass – amounting to independent networks which co-learn such that they can be composed into a single functioning corporation.

ICML Conference 2017 Conference Paper

FeUdal Networks for Hierarchical Reinforcement Learning

  • Alexander Sasha Vezhnevets
  • Simon Osindero
  • Tom Schaul
  • Nicolas Heess
  • Max Jaderberg
  • David Silver 0001
  • Koray Kavukcuoglu

We introduce FeUdal Networks (FuNs): a novel architecture for hierarchical reinforcement learning. Our approach is inspired by the feudal reinforcement learning proposal of Dayan and Hinton, and gains power and efficacy by decoupling end-to-end learning across multiple levels – allowing it to utilise different resolutions of time. Our framework employs a Manager module and a Worker module. The Manager operates at a slower time scale and sets abstract goals which are conveyed to and enacted by the Worker. The Worker generates primitive actions at every tick of the environment. The decoupled structure of FuN conveys several benefits – in addition to facilitating very long timescale credit assignment it also encourages the emergence of sub-policies associated with different goals set by the Manager. These properties allow FuN to dramatically outperform a strong baseline agent on tasks that involve long-term credit assignment or memorisation.

NeurIPS Conference 2017 Conference Paper

Neural Discrete Representation Learning

  • Aaron van den Oord
  • Oriol Vinyals
  • Koray Kavukcuoglu

Learning useful representations without supervision remains a key challenge in machine learning. In this paper, we propose a simple yet powerful generative model that learns such discrete representations. Our model, the Vector Quantised-Variational AutoEncoder (VQ-VAE), differs from VAEs in two key ways: the encoder network outputs discrete, rather than continuous, codes; and the prior is learnt rather than static. In order to learn a discrete latent representation, we incorporate ideas from vector quantisation (VQ). Using the VQ method allows the model to circumvent issues of ``posterior collapse'' -— where the latents are ignored when they are paired with a powerful autoregressive decoder -— typically observed in the VAE framework. Pairing these representations with an autoregressive prior, the model can generate high quality images, videos, and speech as well as doing high quality speaker conversion and unsupervised learning of phonemes, providing further evidence of the utility of the learnt representations.

ICLR Conference 2017 Conference Paper

Reinforcement Learning with Unsupervised Auxiliary Tasks

  • Max Jaderberg
  • Volodymyr Mnih
  • Wojciech Czarnecki 0001
  • Tom Schaul
  • Joel Z. Leibo
  • David Silver 0001
  • Koray Kavukcuoglu

Deep reinforcement learning agents have achieved state-of-the-art results by directly maximising cumulative reward. However, environments contain a much wider variety of possible training signals. In this paper, we introduce an agent that also maximises many other pseudo-reward functions simultaneously by reinforcement learning. All of these tasks share a common representation that, like unsupervised learning, continues to develop in the absence of extrinsic rewards. We also introduce a novel mechanism for focusing this representation upon extrinsic rewards, so that learning can rapidly adapt to the most relevant aspects of the actual task. Our agent significantly outperforms the previous state-of-the-art on Atari, averaging 880\% expert human performance, and a challenging suite of first-person, three-dimensional \emph{Labyrinth} tasks leading to a mean speedup in learning of 10$\times$ and averaging 87\% expert human performance on Labyrinth.

ICML Conference 2017 Conference Paper

Understanding Synthetic Gradients and Decoupled Neural Interfaces

  • Wojciech Czarnecki 0001
  • Grzegorz Swirszcz
  • Max Jaderberg
  • Simon Osindero
  • Oriol Vinyals
  • Koray Kavukcuoglu

When training neural networks, the use of Synthetic Gradients (SG) allows layers or modules to be trained without update locking – without waiting for a true error gradient to be backpropagated – resulting in Decoupled Neural Interfaces (DNIs). This unlocked ability of being able to update parts of a neural network asynchronously and with only local information was demonstrated to work empirically in Jaderberg et al (2016). However, there has been very little demonstration of what changes DNIs and SGs impose from a functional, representational, and learning dynamics point of view. In this paper, we study DNIs through the use of synthetic gradients on feed-forward networks to better understand their behaviour and elucidate their effect on optimisation. We show that the incorporation of SGs does not affect the representational strength of the learning system for a neural network, and prove the convergence of the learning system for linear and deep linear models. On practical problems we investigate the mechanism by which synthetic gradient estimators approximate the true loss, and, surprisingly, how that leads to drastically different layer-wise representations. Finally, we also expose the relationship of using synthetic gradients to other error approximation techniques and find a unifying language for discussion and comparison.

ICML Conference 2017 Conference Paper

Video Pixel Networks

  • Nal Kalchbrenner
  • Aäron van den Oord
  • Karen Simonyan
  • Ivo Danihelka
  • Oriol Vinyals
  • Alex Graves
  • Koray Kavukcuoglu

We propose a probabilistic video model, the Video Pixel Network (VPN), that estimates the discrete joint distribution of the raw pixel values in a video. The model and the neural architecture reflect the time, space and color structure of video tensors and encode it as a four-dimensional dependency chain. The VPN approaches the best possible performance on the Moving MNIST benchmark, a leap over the previous state of the art, and the generated videos show only minor deviations from the ground truth. The VPN also produces detailed samples on the action-conditional Robotic Pushing benchmark and generalizes to the motion of novel objects.

ICML Conference 2016 Conference Paper

Asynchronous Methods for Deep Reinforcement Learning

  • Volodymyr Mnih
  • Adrià Puigdomènech Badia
  • Mehdi Mirza
  • Alex Graves
  • Timothy P. Lillicrap
  • Tim Harley
  • David Silver 0001
  • Koray Kavukcuoglu

We propose a conceptually simple and lightweight framework for deep reinforcement learning that uses asynchronous gradient descent for optimization of deep neural network controllers. We present asynchronous variants of four standard reinforcement learning algorithms and show that parallel actor-learners have a stabilizing effect on training allowing all four methods to successfully train neural network controllers. The best performing method, an asynchronous variant of actor-critic, surpasses the current state-of-the-art on the Atari domain while training for half the time on a single multi-core CPU instead of a GPU. Furthermore, we show that asynchronous actor-critic succeeds on a wide variety of continuous motor control problems as well as on a new task of navigating random 3D mazes using a visual input.

NeurIPS Conference 2016 Conference Paper

Attend, Infer, Repeat: Fast Scene Understanding with Generative Models

  • S. M. Ali Eslami
  • Nicolas Heess
  • Theophane Weber
  • Yuval Tassa
  • David Szepesvari
  • Koray Kavukcuoglu
  • Geoffrey Hinton

We present a framework for efficient inference in structured image models that explicitly reason about objects. We achieve this by performing probabilistic inference using a recurrent neural network that attends to scene elements and processes them one at a time. Crucially, the model itself learns to choose the appropriate number of inference steps. We use this scheme to learn to perform inference in partially specified 2D models (variable-sized variational auto-encoders) and fully specified 3D models (probabilistic renderers). We show that such models learn to identify multiple objects - counting, locating and classifying the elements of a scene - without any supervision, e. g. , decomposing 3D images with various numbers of objects in a single forward pass of a neural network at unprecedented speed. We further show that the networks produce accurate inferences when compared to supervised counterparts, and that their structure leads to improved generalization.

NeurIPS Conference 2016 Conference Paper

Conditional Image Generation with PixelCNN Decoders

  • Aaron van den Oord
  • Nal Kalchbrenner
  • Lasse Espeholt
  • Koray Kavukcuoglu
  • Oriol Vinyals
  • Alex Graves

This work explores conditional image generation with a new image density model based on the PixelCNN architecture. The model can be conditioned on any vector, including descriptive labels or tags, or latent embeddings created by other networks. When conditioned on class labels from the ImageNet database, the model is able to generate diverse, realistic scenes representing distinct animals, objects, landscapes and structures. When conditioned on an embedding produced by a convolutional network given a single image of an unseen face, it generates a variety of new portraits of the same person with different facial expressions, poses and lighting conditions. We also show that conditional PixelCNN can serve as a powerful decoder in an image autoencoder. Additionally, the gated convolutional layers in the proposed model improve the log-likelihood of PixelCNN to match the state-of-the-art performance of PixelRNN on ImageNet, with greatly reduced computational cost.

ICML Conference 2016 Conference Paper

Exploiting Cyclic Symmetry in Convolutional Neural Networks

  • Sander Dieleman
  • Jeffrey De Fauw
  • Koray Kavukcuoglu

Many classes of images exhibit rotational symmetry. Convolutional neural networks are sometimes trained using data augmentation to exploit this, but they are still required to learn the rotation equivariance properties from the data. Encoding these properties into the network architecture, as we are already used to doing for translation equivariance by using convolutional layers, could result in a more efficient use of the parameter budget by relieving the model from learning them. We introduce four operations which can be inserted into neural network models as layers, and which can be combined to make these models partially equivariant to rotations. They also enable parameter sharing across different orientations. We evaluate the effect of these architectural modifications on three datasets which exhibit rotational symmetry and demonstrate improved performance with smaller models.

NeurIPS Conference 2016 Conference Paper

Interaction Networks for Learning about Objects, Relations and Physics

  • Peter Battaglia
  • Razvan Pascanu
  • Matthew Lai
  • Danilo Jimenez Rezende
  • Koray Kavukcuoglu

Reasoning about objects, relations, and physics is central to human intelligence, and a key goal of artificial intelligence. Here we introduce the interaction network, a model which can reason about how objects in complex systems interact, supporting dynamical predictions, as well as inferences about the abstract properties of the system. Our model takes graphs as input, performs object- and relation-centric reasoning in a way that is analogous to a simulation, and is implemented using deep neural networks. We evaluate its ability to reason about several challenging physical domains: n-body problems, rigid-body collision, and non-rigid dynamics. Our results show it can be trained to accurately simulate the physical trajectories of dozens of objects over thousands of time steps, estimate abstract quantities such as energy, and generalize automatically to systems with different numbers and configurations of objects and relations. Our interaction network implementation is the first general-purpose, learnable physics engine, and a powerful general framework for reasoning about object and relations in a wide variety of complex real-world domains.

NeurIPS Conference 2016 Conference Paper

Matching Networks for One Shot Learning

  • Oriol Vinyals
  • Charles Blundell
  • Timothy Lillicrap
  • Koray Kavukcuoglu
  • Daan Wierstra

Learning from a few examples remains a key challenge in machine learning. Despite recent advances in important domains such as vision and language, the standard supervised deep learning paradigm does not offer a satisfactory solution for learning new concepts rapidly from little data. In this work, we employ ideas from metric learning based on deep neural features and from recent advances that augment neural networks with external memories. Our framework learns a network that maps a small labelled support set and an unlabelled example to its label, obviating the need for fine-tuning to adapt to new class types. We then define one-shot learning problems on vision (using Omniglot, ImageNet) and language tasks. Our algorithm improves one-shot accuracy on ImageNet from 82. 2% to 87. 8% and from 88% accuracy to 95% accuracy on Omniglot compared to competing approaches. We also demonstrate the usefulness of the same model on language modeling by introducing a one-shot task on the Penn Treebank.

ICML Conference 2016 Conference Paper

Pixel Recurrent Neural Networks

  • Aäron van den Oord
  • Nal Kalchbrenner
  • Koray Kavukcuoglu

Modeling the distribution of natural images is a landmark problem in unsupervised learning. This task requires an image model that is at once expressive, tractable and scalable. We present a deep neural network that sequentially predicts the pixels in an image along the two spatial dimensions. Our method models the discrete probability of the raw pixel values and encodes the complete set of dependencies in the image. Architectural novelties include fast two-dimensional recurrent layers and an effective use of residual connections in deep recurrent networks. We achieve log-likelihood scores on natural images that are considerably better than the previous state of the art. Our main results also provide benchmarks on the diverse ImageNet dataset. Samples generated from the model appear crisp, varied and globally coherent.

NeurIPS Conference 2016 Conference Paper

Strategic Attentive Writer for Learning Macro-Actions

  • Alexander Vezhnevets
  • Volodymyr Mnih
  • Simon Osindero
  • Alex Graves
  • Oriol Vinyals
  • John Agapiou
  • Koray Kavukcuoglu

We present a novel deep recurrent neural network architecture that learns to build implicit plans in an end-to-end manner purely by interacting with an environment in reinforcement learning setting. The network builds an internal plan, which is continuously updated upon observation of the next input from the environment. It can also partition this internal representation into contiguous sub-sequences by learning for how long the plan can be committed to -- i. e. followed without replaning. Combining these properties, the proposed model, dubbed STRategic Attentive Writer (STRAW) can learn high-level, temporally abstracted macro-actions of varying lengths that are solely learnt from data without any prior information. These macro-actions enable both structured exploration and economic computation. We experimentally demonstrate that STRAW delivers strong improvements on several ATARI games by employing temporally extended planning strategies (e. g. Ms. Pacman and Frostbite). It is at the same time a general algorithm that can be applied on any sequence data. To that end, we also show that when trained on text prediction task, STRAW naturally predicts frequent n-grams (instead of macro-actions), demonstrating the generality of the approach.

NeurIPS Conference 2015 Conference Paper

Natural Neural Networks

  • Guillaume Desjardins
  • Karen Simonyan
  • Razvan Pascanu
  • Koray Kavukcuoglu

We introduce Natural Neural Networks, a novel family of algorithms that speed up convergence by adapting their internal representation during training to improve conditioning of the Fisher matrix. In particular, we show a specific example that employs a simple and efficient reparametrization of the neural network weights by implicitly whitening the representation obtained at each layer, while preserving the feed-forward computation of the network. Such networks can be trained efficiently via the proposed Projected Natural Gradient Descent algorithm (PRONG), which amortizes the cost of these reparametrizations over many parameter updates and is closely related to the Mirror Descent online learning algorithm. We highlight the benefits of our method on both unsupervised and supervised learning tasks, and showcase its scalability by training on the large-scale ImageNet Challenge dataset.

NeurIPS Conference 2015 Conference Paper

Spatial Transformer Networks

  • Max Jaderberg
  • Karen Simonyan
  • Andrew Zisserman
  • Koray Kavukcuoglu

Convolutional Neural Networks define an exceptionallypowerful class of model, but are still limited by the lack of abilityto be spatially invariant to the input data in a computationally and parameterefficient manner. In this work we introduce a new learnable module, theSpatial Transformer, which explicitly allows the spatial manipulation ofdata within the network. This differentiable module can be insertedinto existing convolutional architectures, giving neural networks the ability toactively spatially transform feature maps, conditional on the feature map itself, without any extra training supervision or modification to the optimisation process. We show that the useof spatial transformers results in models which learn invariance to translation, scale, rotation and more generic warping, resulting in state-of-the-artperformance on several benchmarks, and for a numberof classes of transformations.

ICML Conference 2015 Conference Paper

Weight Uncertainty in Neural Network

  • Charles Blundell
  • Julien Cornebise
  • Koray Kavukcuoglu
  • Daan Wierstra

We introduce a new, efficient, principled and backpropagation-compatible algorithm for learning a probability distribution on the weights of a neural network, called Bayes by Backprop. It regularises the weights by minimising a compression cost, known as the variational free energy or the expected lower bound on the marginal likelihood. We show that this principled kind of regularisation yields comparable performance to dropout on MNIST classification. We then demonstrate how the learnt uncertainty in the weights can be used to improve generalisation in non-linear regression problems, and how this weight uncertainty can be used to drive the exploration-exploitation trade-off in reinforcement learning.

NeurIPS Conference 2014 Conference Paper

Recurrent Models of Visual Attention

  • Volodymyr Mnih
  • Nicolas Heess
  • Alex Graves
  • Koray Kavukcuoglu

Applying convolutional neural networks to large images is computationally expensive because the amount of computation scales linearly with the number of image pixels. We present a novel recurrent neural network model that is capable of extracting information from an image or video by adaptively selecting a sequence of regions or locations and only processing the selected regions at high resolution. Like convolutional neural networks, the proposed model has a degree of translation invariance built-in, but the amount of computation it performs can be controlled independently of the input image size. While the model is non-differentiable, it can be trained using reinforcement learning methods to learn task-specific policies. We evaluate our model on several image classification tasks, where it significantly outperforms a convolutional neural network baseline on cluttered images, and on a dynamic visual control problem, where it learns to track a simple object without an explicit training signal for doing so.

NeurIPS Conference 2013 Conference Paper

Learning word embeddings efficiently with noise-contrastive estimation

  • Andriy Mnih
  • Koray Kavukcuoglu

Continuous-valued word embeddings learned by neural language models have recently been shown to capture semantic and syntactic information about words very well, setting performance records on several word similarity tasks. The best results are obtained by learning high-dimensional embeddings from very large quantities of data, which makes scalability of the training method a critical factor. We propose a simple and scalable new approach to learning word embeddings based on training log-bilinear models with noise-contrastive estimation. Our approach is simpler, faster, and produces better results than the current state-of-the art method of Mikolov et al. (2013a). We achieve results comparable to the best ones reported, which were obtained on a cluster, using four times less data and more than an order of magnitude less computing time. We also investigate several model types and find that the embeddings learned by the simpler models perform at least as well as those learned by the more complex ones.

NeurIPS Conference 2012 Conference Paper

Learning the Dependency Structure of Latent Factors

  • Yunlong He
  • Yanjun Qi
  • Koray Kavukcuoglu
  • Haesun Park

In this paper, we study latent factor models with the dependency structure in the latent space. We propose a general learning framework which induces sparsity on the undirected graphical model imposed on the vector of latent factors. A novel latent factor model SLFA is then proposed as a matrix factorization problem with a special regularization term that encourages collaborative reconstruction. The main benefit (novelty) of the model is that we can simultaneously learn the lower-dimensional representation for data and model the pairwise relationships between latent factors explicitly. An on-line learning algorithm is devised to make the model feasible for large-scale learning problems. Experimental results on two synthetic data and two real-world data sets demonstrate that pairwise relationships and latent factors learned by our model provide a more structured way of exploring high-dimensional data, and the learned representations achieve the state-of-the-art classification performance.

JMLR Journal 2011 Journal Article

Natural Language Processing (Almost) from Scratch

  • Ronan Collobert
  • Jason Weston
  • Léon Bottou
  • Michael Karlen
  • Koray Kavukcuoglu
  • Pavel Kuksa

We propose a unified neural network architecture and learning algorithm that can be applied to various natural language processing tasks including part-of-speech tagging, chunking, named entity recognition, and semantic role labeling. This versatility is achieved by trying to avoid task-specific engineering and therefore disregarding a lot of prior knowledge. Instead of exploiting man-made input features carefully optimized for each task, our system learns internal representations on the basis of vast amounts of mostly unlabeled training data. This work is then used as a basis for building a freely available tagging system with good performance and minimal computational requirements. [abs] [ pdf ][ bib ] &copy JMLR 2011. ( edit, beta )

NeurIPS Conference 2010 Conference Paper

Learning Convolutional Feature Hierarchies for Visual Recognition

  • Koray Kavukcuoglu
  • Pierre Sermanet
  • Y-Lan Boureau
  • Karol Gregor
  • Michael Mathieu
  • Yann Cun

We propose an unsupervised method for learning multi-stage hierarchies of sparse convolutional features. While sparse coding has become an increasingly popular method for learning visual features, it is most often trained at the patch level. Applying the resulting filters convolutionally results in highly redundant codes because overlapping patches are encoded in isolation. By training convolutionally over large image windows, our method reduces the redudancy between feature vectors at neighboring locations and improves the efficiency of the overall representation. In addition to a linear decoder that reconstructs the image from sparse features, our method trains an efficient feed-forward encoder that predicts quasi-sparse features from the input. While patch-based training rarely produces anything but oriented edge detectors, we show that convolutional training produces highly diverse filters, including center-surround filters, corner detectors, cross detectors, and oriented grating detectors. We show that using these filters in multi-stage convolutional network architecture improves performance on a number of visual recognition and detection tasks.