## Capsule Networks解析

2017 年 11 月 12 日 机器学习研究会

## What are Capsule Networks and why do they exist?

The Capsule Network is a new type of neural network architecture conceptualized byGeoffrey Hinton, the motivation behind Capsule Networks is to address some of the short comings of Convolutional Neural Networks (ConvNets), which are listed below:

#### Problem 1: ConvNets are Translation Invariant [1]

What does that even mean? Imagine that we had a model that predicts cats. You show it an image of a cat, it predicts that it’s a cat. You show it the same image, but shifted to the left, it still thinks that it’s a cat without predicting any additional information.

##### Figure 1.0: Translation Invariance

What we want to strive for is translation equivariance. That means that when you show it an image of a cat shifted to the right, it predicts that it’s a cat shifted to the right. You show it the same cat but shifted towards the left, it predicts that its a cat shifted towards the left.

##### Figure 1.1: Translation Equivariance

Why is this a problem? ConvNets are unable to identify the position of one object relative to another, they can only identify if the object exists in a certain region, or not. This results in difficulty correctly identifying objects that have sub-objects that hold positional relationships relative to one another.

For example, a bunch of randomly assembled face parts will look like a face to a ConvNet, because all the key features are there:

https://kndrck.co/posts/capsule_networks_explained/

https://m.weibo.cn/1402400261/4173136671070127

“完整内容”请点击【阅读原文】
↓↓↓

### 相关内容

Explanation：网络。 Publisher：Wiley。 SIT： http://dblp.uni-trier.de/db/journals/networks/

In recent years, the biggest advances in major Computer Vision tasks, such as object recognition, handwritten-digit identification, facial recognition, and many others., have all come through the use of Convolutional Neural Networks (CNNs). Similarly, in the domain of Natural Language Processing, Recurrent Neural Networks (RNNs), and Long Short Term Memory networks (LSTMs) in particular, have been crucial to some of the biggest breakthroughs in performance for tasks such as machine translation, part-of-speech tagging, sentiment analysis, and many others. These individual advances have greatly benefited tasks even at the intersection of NLP and Computer Vision, and inspired by this success, we studied some existing neural image captioning models that have proven to work well. In this work, we study some existing captioning models that provide near state-of-the-art performances, and try to enhance one such model. We also present a simple image captioning model that makes use of a CNN, an LSTM, and the beam search1 algorithm, and study its performance based on various qualitative and quantitative metrics.

Capsule Networks preserve the hierarchical spatial relationships between objects, and thereby bears a potential to surpass the performance of traditional Convolutional Neural Networks (CNNs) in performing tasks like image classification. A large body of work has explored adversarial examples for CNNs, but their effectiveness on Capsule Networks has not yet been well studied. In our work, we perform an analysis to study the vulnerabilities in Capsule Networks to adversarial attacks. These perturbations, added to the test inputs, are small and imperceptible to humans, but can fool the network to mispredict. We propose a greedy algorithm to automatically generate targeted imperceptible adversarial examples in a black-box attack scenario. We show that this kind of attacks, when applied to the German Traffic Sign Recognition Benchmark (GTSRB), mislead Capsule Networks. Moreover, we apply the same kind of adversarial attacks to a 5-layer CNN and a 9-layer CNN, and analyze the outcome, compared to the Capsule Networks to study differences in their behavior.

During the last decade, Convolutional Neural Networks (CNNs) have become the de facto standard for various Computer Vision and Machine Learning operations. CNNs are feed-forward Artificial Neural Networks (ANNs) with alternating convolutional and subsampling layers. Deep 2D CNNs with many hidden layers and millions of parameters have the ability to learn complex objects and patterns providing that they can be trained on a massive size visual database with ground-truth labels. With a proper training, this unique ability makes them the primary tool for various engineering applications for 2D signals such as images and video frames. Yet, this may not be a viable option in numerous applications over 1D signals especially when the training data is scarce or application-specific. To address this issue, 1D CNNs have recently been proposed and immediately achieved the state-of-the-art performance levels in several applications such as personalized biomedical data classification and early diagnosis, structural health monitoring, anomaly detection and identification in power electronics and motor-fault detection. Another major advantage is that a real-time and low-cost hardware implementation is feasible due to the simple and compact configuration of 1D CNNs that perform only 1D convolutions (scalar multiplications and additions). This paper presents a comprehensive review of the general architecture and principals of 1D CNNs along with their major engineering applications, especially focused on the recent progress in this field. Their state-of-the-art performance is highlighted concluding with their unique properties. The benchmark datasets and the principal 1D CNN software used in those applications are also publically shared in a dedicated website.

In this paper, we propose Generative Adversarial Network (GAN) architectures that use Capsule Networks for image-synthesis. Based on the principal of positional-equivariance of features, Capsule Network's ability to encode spatial relationships between the features of the image helps it become a more powerful critic in comparison to Convolutional Neural Networks (CNNs) used in current architectures for image synthesis. Our proposed GAN architectures learn the data manifold much faster and therefore, synthesize visually accurate images in significantly lesser number of training samples and training epochs in comparison to GANs and its variants that use CNNs. Apart from analyzing the quantitative results corresponding the images generated by different architectures, we also explore the reasons for the lower coverage and diversity explored by the GAN architectures that use CNN critics.

Image captioning is an interdisciplinary research problem that stands between computer vision and natural language processing. The task is to generate a textual description of the content of an image. The typical model used for image captioning is an encoder-decoder deep network, where the encoder captures the essence of an image while the decoder is responsible for generating a sentence describing the image. Attention mechanisms can be used to automatically focus the decoder on parts of the image which are relevant to predict the next word. In this paper, we explore different decoders and attentional models popular in neural machine translation, namely attentional recurrent neural networks, self-attentional transformers, and fully-convolutional networks, which represent the current state of the art of neural machine translation. The image captioning module is available as part of SOCKEYE at https://github.com/awslabs/sockeye which tutorial can be found at https://awslabs.github.io/sockeye/image_captioning.html .

A key component to the success of deep learning is the availability of massive amounts of training data. Building and annotating large datasets for solving medical image classification problems is today a bottleneck for many applications. Recently, capsule networks were proposed to deal with shortcomings of Convolutional Neural Networks (ConvNets). In this work, we compare the behavior of capsule networks against ConvNets under typical datasets constraints of medical image analysis, namely, small amounts of annotated data and class-imbalance. We evaluate our experiments on MNIST, Fashion-MNIST and medical (histological and retina images) publicly available datasets. Our results suggest that capsule networks can be trained with less amount of data for the same or better performance and are more robust to an imbalanced class distribution, which makes our approach very promising for the medical imaging community.

In this paper, we propose the Self-Attention Generative Adversarial Network (SAGAN) which allows attention-driven, long-range dependency modeling for image generation tasks. Traditional convolutional GANs generate high-resolution details as a function of only spatially local points in lower-resolution feature maps. In SAGAN, details can be generated using cues from all feature locations. Moreover, the discriminator can check that highly detailed features in distant portions of the image are consistent with each other. Furthermore, recent work has shown that generator conditioning affects GAN performance. Leveraging this insight, we apply spectral normalization to the GAN generator and find that this improves training dynamics. The proposed SAGAN achieves the state-of-the-art results, boosting the best published Inception score from 36.8 to 52.52 and reducing Frechet Inception distance from 27.62 to 18.65 on the challenging ImageNet dataset. Visualization of the attention layers shows that the generator leverages neighborhoods that correspond to object shapes rather than local regions of fixed shape.

Word Sense Disambiguation (WSD) aims to identify the correct meaning of polysemous words in the particular context. Lexical resources like WordNet which are proved to be of great help for WSD in the knowledge-based methods. However, previous neural networks for WSD always rely on massive labeled data (context), ignoring lexical resources like glosses (sense definitions). In this paper, we integrate the context and glosses of the target word into a unified framework in order to make full use of both labeled data and lexical knowledge. Therefore, we propose GAS: a gloss-augmented WSD neural network which jointly encodes the context and glosses of the target word. GAS models the semantic relationship between the context and the gloss in an improved memory network framework, which breaks the barriers of the previous supervised methods and knowledge-based methods. We further extend the original gloss of word sense via its semantic relations in WordNet to enrich the gloss information. The experimental results show that our model outperforms the state-of-theart systems on several English all-words WSD datasets.

We present Generative Adversarial Capsule Network (CapsuleGAN), a framework that uses capsule networks (CapsNets) instead of the standard convolutional neural networks (CNNs) as discriminators within the generative adversarial network (GAN) setting, while modeling image data. We provide guidelines for designing CapsNet discriminators and the updated GAN objective function, which incorporates the CapsNet margin loss, for training CapsuleGAN models. We show that CapsuleGAN outperforms convolutional-GAN at modeling image data distribution on the MNIST dataset of handwritten digits, evaluated on the generative adversarial metric and at semi-supervised image classification.

14+阅读 · 2018年1月2日

15+阅读 · 2017年12月17日

17+阅读 · 2017年12月17日

9+阅读 · 2017年11月24日

17+阅读 · 2017年11月5日

5+阅读 · 2017年9月21日

24+阅读 · 2017年9月8日

9+阅读 · 2017年9月3日

14+阅读 · 2017年8月15日

46+阅读 · 2020年3月19日

86+阅读 · 2020年3月18日

59+阅读 · 2020年2月29日

35+阅读 · 2019年10月11日

76+阅读 · 2019年10月11日

47+阅读 · 2019年10月10日

50+阅读 · 2019年10月10日

Elaina Tan,Lakshay Sharma
4+阅读 · 2019年7月2日
3+阅读 · 2019年5月24日
Serkan Kiranyaz,Onur Avci,Osama Abdeljaber,Turker Ince,Moncef Gabbouj,Daniel J. Inman
4+阅读 · 2019年5月9日
Han Hu,Zheng Zhang,Zhenda Xie,Stephen Lin
3+阅读 · 2019年4月25日