【推荐】ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

2017 年 12 月 17 日 机器学习研究会

点击上方 “机器学习研究会”可以订阅

摘要

转自：爱可可-爱生活

Convolutional neural networks are fantastic for visual recognition tasks. Good ConvNets are beasts with millions of parameters and many hidden layers. In fact, a bad rule of thumb is: ‘higher the number of hidden layers, better the network’. AlexNet, VGG, Inception, ResNet are some of the popular networks. Why do these networks work so well? How are they designed? Why do they have the structures they have? One wonders. The answer to these questions is not trivial and certainly, can’t be covered in one blog post. However, in this blog, I shall try to discuss some of these questions. Network architecture design is a complicated process and will take a while to learn and even longer to experiment designing on your own. But first, let’s put things in perspective:

Why are ConvNets beating traditional computer vision?

Image classification is the task of classifying a given image into one of the pre-defined categories. Traditional pipeline for image classification involves two modules: viz. feature extraction andclassification.

Feature extraction involves extracting a higher level of information from raw pixel values that can capture the distinction among the categories involved. This feature extraction is done in an unsupervised manner wherein the classes of the image have nothing to do with information extracted from pixels. Some of the traditional and widely used features are GIST, HOG, SIFT, LBP etc. After the feature is extracted, a classification module is trained with the images and their associated labels. A few examples of this module are SVM, Logistic Regression, Random Forest, decision trees etc.

The problem with this pipeline is that feature extraction cannot be tweaked according to the classes and images. So if the chosen feature lacks the representation required to distinguish the categories, the accuracy of the classification model suffers a lot, irrespective of the type of classification strategy employed. A common theme among the state of the art following the traditional pipeline has been, to pick multiple feature extractors and club them inventively to get a better feature. But this involves too many heuristics as well as manual labor to tweak parameters according to the domain to reach a decent level of accuracy. By decent I mean, reaching close to human level accuracy. That’s why it took years to build a good computer vision system(like OCR, face verification, image classifiers, object detectors etc), that can work with a wide variety of data encountered during practical application, using traditional computer vision. We once produced better results using ConvNets for a company(a client of my start-up) in 6 weeks, which took them close to a year to achieve using traditional computer vision.

Another problem with this method is that it is completely different from how we humans learn to recognize things. Just after birth, a child is incapable of perceiving his surroundings, but as he progresses and processes data, he learns to identify things. This is the philosophy behind deep learning, wherein no hard-coded feature extractor is built in. It combines the extraction and classification modules into one integrated system and it learns to extract, by discriminating representations from the images and classify them based on supervised data.

One such system is multilayer perceptrons aka neural networks which are multiple layers of neurons densely connected to each other. A deep vanilla neural network has such a large number of parameters involved that it is impossible to train such a system without overfitting the model due to the lack of a sufficient number of training examples. But with Convolutional Neural Networks(ConvNets), the task of training the whole network from the scratch can be carried out using a large dataset like ImageNet. The reason behind this is, sharing of parameters between the neurons and sparse connections in convolutional layers. It can be seen in this figure 2. In the convolution operation, the neurons in one layer are only locally connected to the input neurons and the set of parameters are shared across the 2-D feature map.

链接：

http://cv-tricks.com/cnn/understand-resnet-alexnet-vgg-inception/

原文链接：

https://m.weibo.cn/1402400261/4186005294906515

“完整内容”请点击【阅读原文】

↓↓↓

登录查看更多