Author name cluster

Yann Cun

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

9 papers

1 author row

NeurIPS Conference 2011 Conference Paper

Structured sparse coding via lateral inhibition

Arthur Szlam
Karol Gregor
Yann Cun

This work describes a conceptually simple method for structured sparse coding and dictionary design. Supposing a dictionary with K atoms, we introduce a structure as a set of penalties or interactions between every pair of atoms. We describe modifications of standard sparse coding algorithms for inference in this setting, and describe experiments showing that these algorithms are efficient. We show that interesting dictionaries can be learned for interactions that encode tree structures or locally connected structures. Finally, we show that our framework allows us to learn the values of the interactions from the data, rather than having them pre-specified.

PDF Details

NeurIPS Conference 2010 Conference Paper

Learning Convolutional Feature Hierarchies for Visual Recognition

Koray Kavukcuoglu
Pierre Sermanet
Y-Lan Boureau
Karol Gregor
Michael Mathieu
Yann Cun

We propose an unsupervised method for learning multi-stage hierarchies of sparse convolutional features. While sparse coding has become an increasingly popular method for learning visual features, it is most often trained at the patch level. Applying the resulting filters convolutionally results in highly redundant codes because overlapping patches are encoded in isolation. By training convolutionally over large image windows, our method reduces the redudancy between feature vectors at neighboring locations and improves the efficiency of the overall representation. In addition to a linear decoder that reconstructs the image from sparse features, our method trains an efficient feed-forward encoder that predicts quasi-sparse features from the input. While patch-based training rarely produces anything but oriented edge detectors, we show that convolutional training produces highly diverse filters, including center-surround filters, corner detectors, cross detectors, and oriented grating detectors. We show that using these filters in multi-stage convolutional network architecture improves performance on a number of visual recognition and detection tasks.

PDF Details

NeurIPS Conference 2010 Conference Paper

Regularized estimation of image statistics by Score Matching

Durk Kingma
Yann Cun

Score Matching is a recently-proposed criterion for training high-dimensional density models for which maximum likelihood training is intractable. It has been applied to learning natural image statistics but has so-far been limited to simple models due to the difficulty of differentiating the loss with respect to the model parameters. We show how this differentiation can be automated with an extended version of the double-backpropagation algorithm. In addition, we introduce a regularization term for the Score Matching loss that enables its use for a broader range of problem by suppressing instabilities that occur with finite training sample sizes and quantized input values. Results are reported for image denoising and super-resolution.

PDF Details

NeurIPS Conference 2007 Conference Paper

Sparse Feature Learning for Deep Belief Networks

Marc'Aurelio Ranzato
Y-Lan Boureau
Yann Cun

Unsupervised learning algorithms aim to discover the structure hidden in the data, and to learn representations that are more suitable as input to a supervised machine than the raw input. Many unsupervised methods are based on reconstructing the input from the representation, while constraining the representation to have certain desirable properties (e. g. low dimension, sparsity, etc). Others are based on approximating density by stochastically reconstructing the input from the representation. We describe a novel and efficient algorithm to learn sparse representations, and compare it theoretically and experimentally with a similar machines trained probabilistically, namely a Restricted Boltzmann Machine. We propose a simple criterion to compare and select different unsupervised machines based on the trade-off between the reconstruction error and the information content of the representation. We demonstrate this method by extracting features from a dataset of handwritten numerals, and from a dataset of natural image patches. We show that by stacking multiple levels of such machines and by training sequentially, high-order dependencies between the input variables can be captured.

PDF Details

NeurIPS Conference 2006 Conference Paper

Efficient Learning of Sparse Representations with an Energy-Based Model

Marc'Aurelio Ranzato
Christopher Poultney
Sumit Chopra
Yann Cun

We describe a novel unsupervised method for learning sparse, overcomplete features. The model uses a linear encoder, and a linear decoder preceded by a sparsifying non-linearity that turns a code vector into a quasi-binary sparse code vector. Given an input, the optimal code minimizes the distance between the output of the decoder and the input patch while being as similar as possible to the encoder output. Learning proceeds in a two-phase EM-like fashion: (1) compute the minimum-energy code vector, (2) adjust the parameters of the encoder and decoder so as to decrease the energy. The model produces "stroke detectors" when trained on handwritten numerals, and Gabor-like filters when trained on natural image patches. Inference and learning are very fast, requiring no preprocessing, and no expensive sampling. Using the proposed unsupervised method to initialize the first layer of a convolutional network, we achieved an error rate slightly lower than the best reported result on the MNIST dataset. Finally, an extension of the method is described to learn topographical filter maps.

PDF Details

NeurIPS Conference 2005 Conference Paper

Off-Road Obstacle Avoidance through End-to-End Learning

Urs Muller
Jan Ben
Eric Cosatto
Beat Flepp
Yann Cun

We describe a vision-based obstacle avoidance system for off-road mobile robots. The system is trained from end to end to map raw in put images to steering angles. It is trained in supervised mode to predict the steering angles provided by a human driver during training r uns collected in a wide variety of terrains, weather conditions, lighting conditions, and obstacle types. The robot is a 50cm off-road truck, with two f orwardpointing wireless color cameras. A remote computer process es the video and controls the robot via radio. The learning system is a lar ge 6-layer convolutional network whose input is a single left/right pair of unprocessed low-resolution images. The robot exhibits an excellent ability to detect obstacles and navigate around them in real time at speeds of 2 m/s.

PDF Details

NeurIPS Conference 2004 Conference Paper

Synergistic Face Detection and Pose Estimation with Energy-Based Models

Margarita Osadchy
Matthew Miller
Yann Cun

We describe a novel method for real-time, simultaneous multi-view face detection and facial pose estimation. The method employs a convolu- tional network to map face images to points on a manifold, parametrized by pose, and non-face images to points far from that manifold. This network is trained by optimizing a loss function of three variables: im- age, pose, and face/non-face label. We test the resulting system, in a single configuration, on three standard data sets one for frontal pose, one for rotated faces, and one for profiles and find that its performance on each set is comparable to previous multi-view face detectors that can only handle one form of pose variation. We also show experimentally that the system's accuracy on both face detection and pose estimation is improved by training for the two tasks together. 1 Introduction The detection of human faces in natural images and videos is a key component in a wide variety of applications of human-computer interaction, search and indexing, security, and surveillance. Many real-world applications would profit from multi-view detectors that can detect faces under a wide range of poses: looking left or right (yaw axis), up or down (pitch axis), or tilting left or right (roll axis). In this paper we describe a novel method that not only detects faces independently of their poses, but simultaneously estimates those poses. The system is highly-reliable, runs at near real time (5 frames per second on standard hardware), and is robust against variations in yaw (90), roll (45), and pitch (60). The method is motivated by the idea that multi-view face detection and pose estimation are so closely related that they should not be performed separately. The tasks are related in the sense that they must be robust against the same sorts of variation: skin color, glasses, facial hair, lighting, scale, expressions, etc. We suspect that, when trained together, each task can serve as an inductive bias for the other, yielding better generalization or requiring fewer training examples [2]. To exploit the synergy between these two tasks, we train a convolutional network to map face images to points on a face manifold, and non-face images to points far away from that manifold. The manifold is parameterized by facial pose. Conceptually, we can view the pose parameter as a latent variable that can be inferred through an energy-minimization process [4]. To train the machine we derive a new type of discriminative loss function that is tailored to such detection tasks. Previous Work: Learning-based approaches to face detection abound, including real-time methods [16], and approaches based on convolutional networks [15, 3]. Most multi-view systems take a view-based approach, which involves building separate detectors for differ- ent views and either applying them in parallel [10, 14, 13, 7] or using a pose estimator to select a detector [5]. Another approach is to estimate and correct in-plane rotations before applying a single pose-specific detector [12]. Closer to our approach is that of [8], in which a number of Support Vector Regressors are trained to approximate smooth functions, each of which has a maximum for a face at a particular pose. Another machine is trained to con- vert the resulting values to estimates of poses, and a third is trained to convert the values into a face/non-face score. The resulting system is very slow. 2 Integrating face detection and pose estimation To exploit the posited synergy between face detection and pose estimation, we must design a system that integrates the solutions to the two problems. We hope to obtain better results on both tasks, so this should not be a mere cascaded system in which the answer to one problem is used to assist in solving the other. Both answers must be derived from one underlying analysis of the input, and both tasks must be trained together. Our approach is to build a trainable system that can map raw images X to points in a low-dimensional space. In that space, we pre-define a face manifold F (Z) that we para- meterize by the pose Z. We train the system to map face images with known poses to the corresponding points on the manifold. We also train it to map non-face images to points far away from the manifold. Proximity to the manifold then tells us whether or not an image is a face, and projection to the manifold yields an estimate of the pose. Parameterizing the Face Manifold: We will now describe the details of the parameter- izations of the face manifold. Let's start with the simplest case of one pose parameter Z =, representing, say, yaw. If we want to preserve the natural topology and geometry of the problem, the face manifold under yaw variations in the interval [-90, 90] should be a half circle (with constant curvature). We embed this half-circle in a three-dimensional space using three equally-spaced shifted cosines. Fi() = cos( - i); i = 1, 2, 3; = [-, ] (1) 2 2 When we run the network on an image X, it outputs a vector G(X) with three components that can be decoded analytically into corresponding pose angle: 3 G = arctan i=1 i(X ) cos(i) 3 (2) G i=1 i(X ) sin(i) The point on the manifold closest to G(X) is just F (). The same idea can be applied to any number of pose parameters. Let us consider the set of all faces with yaw in [-90, 90] and roll in [-45, 45]. In an abstract way, this set is isomorphic to a portion of the surface of a sphere. Consequently, we encode the pose with the product of the cosines of the two angles: Fij(, ) = cos( - i) cos( - j); i, j = 1, 2, 3; (3) For convenience we rescale the roll angles to the range of [-90, 90]. With these parame- terizations, the manifold has constant curvature, which ensures that the effect of errors will be the same regardless of pose. Given nine components of the network's output Gij(X), we compute the corresponding pose angles as follows: cc = G G ij ij (X ) cos(i) cos(j ); cs = ij ij (X ) cos(i) sin(j ) sc = G G ij ij (X ) sin(i) cos(j ); ss = ij ij (X ) sin(i) sin(j ) (4) = 0. 5(atan2(cs + sc, cc - ss) + atan2(sc - cs, cc + ss)) = 0. 5(atan2(cs + sc, cc - ss) - atan2(sc - cs, cc + ss)) Note that the dimension of the face manifold is much lower than that of the embedding space. This gives ample space to represent non-faces away from the manifold. 3 Learning Machine To build a learning machine for the proposed approach we refer to the Minimum Energy Machine framework described in [4]. Energy Minimization Framework: We can view our system as a scalar-value function EW (Y, Z, X), where X and Z are as defined above, Y is a binary label (Y = 1 for face, Y = 0 for a non-face), and W is a parameter vector subject to learning. EW (Y, Z, X) can be interpreted as an energy function that measures the degree of compatibility between X, Z, Y. If X is a face with pose Z, then we want: EW (1, Z, X) EW (0, Z, X) for any pose Z, and EW (1, Z, X) EW (1, Z, X) for any pose Z = Z. Operating the machine consists in clamping X to the observed value (the image), and finding the values of Z and Y that minimize EW (Y, Z, X): (Y, Z) = argminY {Y }, Z{Z}EW (Y, Z, X) (5) where {Y } = {0, 1} and {Z} = [-90, 90][-45, 45] for yaw and roll variables. Although this inference process can be viewed probabilistically as finding the most likely configu- ration of Y and Z according to a model that attributes high probabilities to low-energy configurations (e. g. a Gibbs distribution), we view it as a non probabilistic decision mak- ing process. In other words, we make no assumption as to the finiteness of integrals over {Y }and {Z}that would be necessary for a properly normalized probabilistic model. This gives us considerable flexibility in the choice of the internal architecture of EW (Y, Z, X). Our energy function for a face EW (1, Z, X) is defined as the distance between the point produced by the network GW (X) and the point with pose Z on the manifold F (Z): EW (1, Z, X) = GW (X) - F (Z) (6) The energy function for a non-face EW (0, Z, X) is equal to a constant T that we can interpret as a threshold (it is independent of Z and X). The complete energy function is: EW (Y, Z, X) = Y GW (X) - F (Z) + (1 - Y )T (7) The architecture of the machine is depicted in Figure 1. Operating this machine (find- ing the output label and pose with the smallest energy) comes down to first finding: Z = argminZ{Z}||GW (X) - F (Z)||, and then comparing this minimum distance, GW (X) - F (Z), to the threshold T. If it smaller than T, then X is classified as a face, otherwise X is classified as a non-face. This decision is implemented in the architecture as a switch, that depends upon the binary variable Y. Convolutional Network: We employ a Convolutional Network as the basic architecture for the GW (X) image-to-face-space mapping function. Convolutional networks [6] are "end- to-end" trainable system that can operate on raw pixel images and learn low-level features and high-level representation in an integrated fashion. Convolutional nets are advantageous because they easily learn the types of shift-invariant local features that are relevant to image recognition; and more importantly, they can be replicated over large images (swept over every location) at a fraction of the cost of replicating more traditional classifiers [6]. This is a considerable advantage for building real-time systems. We employ a network architecture similar to LeNet5 [6]. The difference is in the number of maps. In our architecture we have 8 feature maps in the bottom convolutional and subsampling layers and 20 maps in the next two layers. The last layer has 9 outputs to encode two pose parameters. Training with a Discriminative Loss Function for Detection: We define the loss function as follows: 1 1 L(W ) = L L |S 1(W, Zi, X i) + 0(W, X i) (8) 1| |S iS 0| 1 iS0 Figure 1: Architecture of the Minimum Energy Machine. where S1is the set of training faces, S0the set of non-faces, L1(W, Zi, Xi) and L0(W, Xi) are loss functions for a face sample (with a known pose) and non-face, respectively1. The loss L(W ) should be designed so that its minimization for a particular positive training sample (Xi, Zi, 1), will make EW (1, Zi, Xi) EW (0, Z, Xi) = T for any Z. To satisfy this, it is sufficient to make EW (1, Z, Xi) > T. Let W be the current parameter value, and W be the parameter value after an update caused by a single sample. To cause the machine to achieve the desired behavior, we need the parameter update to decrease the difference between the energy of the desired label and the energy of the undesired label. In our case, since EW (0, Z, X) = T is constant, the following condition on the update is sufficient to ensure the desired behavior: Condition 1. for a face example (X, Z, 1), we must have: EW (1, Z, X) EW (1, Z, X) We choose the following forms for L1 and L0: L1(W, 1, Z, X) = EW (1, Z, X)2; L0(W, 0, X) = K exp[-E(1, Z, X)] (9) where K is a positive constant. Next we show that minimizing (9) with an incremental gradient-based algorithm will satisfy condition 1. With gradient-based optimization algorithms, the parameter update formula is of the form: W = W - W = -A L W. where A is a judiciously chosen symmetric positive semi-definite matrix, and is a small positive constant. For Y = 1 (face): An update step will change the parameter by W = -A EW (1, Z, X)2 = W -2EW (1, Z, X)A EW (1, Z, X) W. To first order (for small values of ), the resulting change in EW (1, Z, X) is given by: E T T W (1, Z, X ) E E W = -2E W (1, Z, X ) A W (1, Z, X ) 0 (it's a distance), and the quadratic form is positive. Therefore EW (1, Z, X) 100 100 95 95 90 90 85 85 80 80 75 75 70 70 65 65 Percentage of faces detected 60 Pose + detection 60 Pose + detection Detection only Pose only 55 Percentage of yaws correctly estimated 55 50 50 0 2 4 6 8 10 12 14 16 18 20 0 5 10 15 20 25 30 False positive rate Yaw-error tolerance (degrees) Figure 2: Synergy test. Left: ROC curves for the pose-plus-detection and detection-only networks. Right: frequency with which the pose-plus-detection and pose-only networks correctly estimated the yaws within various error tolerances. For Y = 0 (non-face): An update step will change the parameter by W = -A K exp[-E(1, Z, X)] = K exp[-E W W (1, Z, X )] EW (1, Z, X) W. To first order (for small values of ), the resulting change in EW (1, Z, X) is given by: E T T W (1, Z, X ) E E W = K exp[-E W (1, Z, X ) A W (1, Z, X ) > 0 W W (1, Z, X )] W W Therefore EW (1, Z, X) > EW (1, Z, X). Running the Machine: Our detection system works on grayscale images and it applies the network to each image at a range of scales, stepping by a factor of 2. The network is replicated over the image at each scale, stepping by 4 pixels in x and y (this step size is a consequence of having two, 2x2 subsampling layers). At each scale and location, the network outputs are compared to the closest point on the manifold, and the system collects a list of all instances closer than our detection threshold. Finally, after examining all scales, the system identifies groups of overlapping detections in the list and discards all but the strongest (closest to the manifold) from each group. No attempt is made to combine detections or apply any voting scheme. We have implemented the system in C. The system can detect, locate, and estimate the pose of faces that are between 40 and 250 pixels high in a 640 480 image at roughly 5 frames per second on a 2. 4GHz Pentium 4. 4 Experiments and results Using the above architecture, we built a detector to locate faces and estimate two pose parameters: yaw from left to right profile, and in-plane rotation from -45 to 45 degrees. The machine was trained to be robust against pitch variation. In this section, we first describe the training regimen for this network, and then give the results of two sets of experiments. The first set of experiments tests whether training for the two tasks together improves performance on both. The second set allows comparisons between our system and other published multi-view detectors. Training: Our training set consisted of 52, 850, 32x32-pixel faces from natural images collected at NEC Labs and hand annotated with appropriate facial poses (see [9] for a description of how the annotation was done). These faces were selected from a much larger annotated set to yield a roughly uniform distribution of poses from left profile to right profile, with as much variation in pitch as we could obtain. Our initial negative training data consisted of 52, 850 image patches chosen randomly from non-face areas of a variety of images. For our second set of tests, we replaced half of these with image patches obtained by running the initial version of the detector on our training images and collecting false detections. Each training image was used 5 times during training, with random variations 100 100 95 95 90 90 85 85 80 80 75 75 70 70 65 65 Frontal Percentage of faces detected 60 Rotated in plane 60 In-plane rotation Profile Yaw 55 Percentage of poses correctly estimated 55 50 50 0 0. 5 1 1. 5 2 2. 5 3 3. 5 4 4. 5 5 0 5 10 15 20 25 30 False positives per image Pose-error tolerance (degrees) Figure 3: Results on standard data sets. Left: ROC curves for our detector on the three data sets. The x axis is the average number of false positives per image over all three sets, so each point corresponds to a single detection threshold. Right: frequency with which yaw and roll are estimated within various error tolerances. in scale (from x 2 to x(1 + 2)), in-plane rotation (45), brightness (20), contrast (from 0. 8 to 1. 3). To train the network, we made 9 passes through this data, though it mostly converged after about the first 6 passes. Training was performed using LUSH [1], and the total training time was about 26 hours on a 2Ghz Pentium 4. At the end of training, the network had converged to an equal error rate of 5% on the training data and 6% on a separate test set of 90, 000 images. Synergy tests: The goal of the synergy test was to verify that both face detection and pose estimation benefit from learning and running in parallel. To test this claim we built three networks with almost identical architectures, but trained to perform different tasks. The first one was trained for simultaneous face detection and pose estimation (combined), the second was trained for detection only and the third for pose estimation only. The "detection only" network had only one output for indicating whether or not its input was a face. The "pose only" network was identical to the combined network, but trained on faces only (no negative examples). Figure 2 shows the results of running these networks on our 10, 000 test images. In both these graphs, we see that the pose-plus-detection network had better performance, confirming that training for each task benefits the other. Standard data sets: There is no standard data set that tests all the poses our system is designed to detect. There are, however, data sets that have been used to test more restricted face detectors, each set focusing on a particular variation in pose. By testing a single detector with all of these sets, we can compare our performance against published systems. As far as we know, we are the first to publish results for a single detector on all these data sets. The details of these sets are described below: MIT+CMU [14, 11] 130 images for testing frontal face detectors. We count 517 faces in this set, but the standard tests only use a subset of 507 faces, because 10 faces are in the wrong pose or otherwise not suitable for the test. (Note: about 2% of the faces in the standard subset are badly-drawn cartoons, which we do not intend our system to detect. Nevertheless, we include them in the results we report. ) TILTED [12] 50 images of frontal faces with in-plane rotations. 223 faces out of 225 are in the standard subset. (Note: about 20% of the faces in the standard subset are outside of the 45 rotation range for which our system is designed. Again, we still include these in our results. ) PROFILE [13] 208 images of faces in profile. There seems to be some disagreement about the number of faces in the standard set of annotations: [13] reports using 347 faces of the 462 that we found, [5] reports using 355, and we found 353 annotations. However, these discrepencies should not significantly effect the reported results. We counted a face as being detected if 1) at least one detection lay within a circle centered on the midpoint between the eyes, with a radius equal to 1. 25 times the distance from that point to the midpoint of the mouth, and 2) that detection came at a scale within a factor of Figure 4: Some example face detections. Each white box shows the location of a detected face. The angle of each box indicates the estimated in-plane rotation. The black crosshairs within each box indicate the estimated yaw. Data set TILTED PROFILE MIT+CMU False positives per image 4. 42 26. 90. 47 3. 36. 50 1. 28 Our detector 90% 97% 67% 83% 83% 88% Jones & Viola [5] (tilted) 90% 95% x x Jones & Viola [5] (profile) x 70% 83% x Rowley et al [11] 89% 96% x x Schneiderman & Kanade [13] x 86% 93% x Table 1: Comparisons of our results with other multi-view detectors. Each column shows the detec- tion rates for a given average number of false positives per image (these rates correspond to those for which other authors have reported results). Results for real-time detectors are shown in bold. Note that ours is the only single detector that can be tested on all data sets simultaneously. two of the correct scale for the face's size. We counted a detection as a false positive if it did not lie within this range for any of the faces in the image, including those faces not in the standard subset. The left graph in Figure 3 shows ROC curves for our detector on the three data sets. Figure 4 shows a few results on various poses. Table 1 shows our detection rates compared against other systems for which results were given on these data sets. The table shows that our results on the TILTED and PROFILE sets are similar to those of the two Jones & Viola detectors, and even approach those of the Rowley et al and Schneiderman & Kanade non- real-time detectors. Those detectors, however, are not designed to handle all variations in pose, and do not yield pose estimates. The right side of Figure 3 shows our performance at pose estimation. To make this graph, we fixed the detection threshold at a value that resulted in about 0. 5 false positives per image over all three data sets. We then compared the pose estimates for all detected faces (including those not in the standard subsets) against our manual pose annotations. Note that this test is more difficult than typical tests of pose estimation systems, where faces are first localized by hand. When we hand-localize these faces, 89% of yaws and 100% of in-plane rotations are correctly estimated to within 15.

PDF Details

NeurIPS Conference 2003 Conference Paper

Large Scale Online Learning

Léon Bottou
Yann Cun

We consider situations where training data is abundant and computing resources are comparatively scarce. We argue that suitably designed on- line learning algorithms asymptotically outperform any batch learning algorithm. Both theoretical and experimental evidences are presented.

PDF Details

NeurIPS Conference 2002 Conference Paper

Real Time Voice Processing with Audiovisual Feedback: Toward Autonomous Agents with Perfect Pitch

Lawrence Saul
Daniel Lee
Charles Isbell
Yann Cun

We have implemented a real time front end for detecting voiced speech and estimating its fundamental frequency. The front end performs the signal processing for voice-driven agents that attend to the pitch contours of human speech and provide continuous audiovisual feedback. The al- gorithm we use for pitch tracking has several distinguishing features: it makes no use of FFTs or autocorrelation at the pitch period; it updates the pitch incrementally on a sample-by-sample basis; it avoids peak picking and does not require interpolation in time or frequency to obtain high res- olution estimates; and it works reliably over a four octave range, in real time, without the need for postprocessing to produce smooth contours. The algorithm is based on two simple ideas in neural computation: the introduction of a purposeful nonlinearity, and the error signal of a least squares ﬁt. The pitch tracker is used in two real time multimedia applica- tions: a voice-to-MIDI player that synthesizes electronic music from vo- calized melodies, and an audiovisual Karaoke machine with multimodal feedback. Both applications run on a laptop and display the user’s pitch scrolling across the screen as he or she sings into the computer.

PDF Details