Summary of Krizhevsky et. al.‘s 2012 paper


Current approaches to object recognition make essential use of machine learning methods. Simple recognition tasks can be solved quite well with datasets of this size, especially if they are augmented with label-preserving transformations.

However, the immense complexity of the object recognition task means that this problem cannot be specified even by a dataset as large as ImageNet, so our model should also have lots of prior knowledge to compensate for all the data we don’t have. Luckily, current GPUs, paired with a highly-optimized implementation of 2D convolution, are powerful enough to facilitate the training of interestingly-large CNNs, and recent datasets such as ImageNet
contain enough labeled examples to train such models without severe overfitting.

We wrote a highly-optimized GPU implementation of 2D convolution and all the other operations inherent in training convolutional neural networks, which we make available publicly1. The size of our network made overfitting a significant problem, even with 1.2 million labeled training examples, so we used several effective techniques for preventing overfitting, which are described in Section 4.


They used a dataset of over 15 million labeled high-resolution images belonging to roughly 22,000 categories. The images were collected from the web and labeled by human labelers using Amazon’s Mechanical Turk crowd-sourcing tool. In their architecture, they used the ReLU Nonlinearity for the CNN, training it with multiple GPUs. They also used the local response normalization and overlapping pooling techniques. And in order to reduce overfitting, they’ve choosed to use data augmentation and dropout techniques. They trained their models using stochastic gradient descent with a batch size of 128 examples, momentum of 0.9, and weight decay of 0.0005. They found that this small amount of weight decay was important for the model to learn. In other words, weight decay here is not merely a regularizer: it reduces the model’s training error


Our results on ILSVRC-2010 are summarized in Table 1. Since the ILSVRC-2012 test set labels are not publicly available, we cannot report test error rates for all the models that we tried.

The CNN described in this paper achieves a top-5 error rate of 18.2%. Averaging the predictions of five similar CNNs gives an error rate of 16.4%.

In italics are best results achieved by others. Our top-1 and top-5 error rates
on this dataset are 67.4% and 40.9%, attained by the net described above but with an additional, sixth convolutional layer over the last pooling layer.


Their results show that a large, deep convolutional neural network is capable of achieving record-breaking results on a highly challenging dataset using purely supervised learning. They still have many orders of magnitude to go in order to match the infero-temporal pathway of the human visual system.

Personal Notes

Interesting article which describe different procedures to use in machine learning. This takes up the different projects that we have learned so far at Holberton for the moment. I’m hoping to see further more examples with Holberton School.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store