解释中层革命级CNN原始演讲培训的有线电视新闻网 (Interpreting intermediate convolutional layers of CNNs trained on raw speech)

This paper presents a technique to interpret and visualize intermediate layers in CNNs trained on raw speech data in an unsupervised manner. We show that averaging over feature maps after ReLU activation in each convolutional layer yields interpretable time-series data. The proposed technique enables acoustic analysis of intermediate convolutional layers. To uncover how meaningful representation in speech gets encoded in intermediate layers of CNNs, we manipulate individual latent variables to marginal levels outside of the training range. We train and probe internal representations on two models -- a bare GAN architecture and a ciwGAN extension which forces the Generator to output informative data and results in emergence of linguistically meaningful representations. Interpretation and visualization is performed for three basic acoustic properties of speech: periodic vibration (corresponding to vowels), aperiodic noise vibration (corresponding to fricatives), and silence (corresponding to stops). We also argue that the proposed technique allows acoustic analysis of intermediate layers that parallels the acoustic analysis of human speech data: we can extract F0, intensity, duration, formants, and other acoustic properties from intermediate layers in order to test where and how CNNs encode various types of information. The models are trained on two speech processes with different degrees of complexity: a simple presence of [s] and a computationally complex presence of reduplication (copied material). Observing the causal effect between interpolation and the resulting changes in intermediate layers can reveal how individual variables get transformed into spikes in activation in intermediate layers. Using the proposed technique, we can analyze how linguistically meaningful units in speech get encoded in different convolutional layers.

翻译：本文展示了一种技术,以不受监督的方式解释和直观地判有线电视新闻网的中间层。我们显示,在RELU启动后,每个卷层的原始语言数据中,平均比特写地图产生可解释的时间序列数据。拟议技术使得对中间卷层的声学分析成为了中间层的声学分析。为了发现在CNN中间层的语音中有意义的表达方式是如何被编码的,我们把个别潜伏变量转移到培训范围以外的边缘水平。我们还在两种模型上培训和探测内部演示 -- -- 一个裸露的GAN架构和一个ciwGAN扩展,迫使发电机输出信息数据,并导致出现语言上有意义的表达方式。为三种基本的语音特性进行解释和可视化:定期振动(对正反调),周期性噪音振动(对调)和沉默(对中层的暗变异性),我们可以通过对语言数据进行声学分析的方式对中间层进行声学分析:我们可以提取F0、持续时间、制成数据,以及从中间层进行其他声学特性分析,从中间层进行三次的声学特性分析,以测试各种变变变变,从而测试各种调的内,在深度中,在深度中,在深度上进行各种调变变变变变变变变变变,以测试,在结构中,在结构中,在深度中,在深度中,在深度上,在深度上,在深度上,在深度上,在深度上,在深度上,在深度上,在深度上,在深度上,在深度上,在深度上,在深度上,在深度上,在深度上,在深度上,在深度上,在深度上,在结构变变变变变变变变变变,在深度上,在深度上,在深度上,在深度上,在深度上,在深度,在深度上,进行。。。,在深度中,进行。。,在深度上,进行。。。。。。。。。,在深度上,在深度上,在深度上,在深度上,在深度上,在深度上,进行,在深度上,在深度,在深度上,在深度上,在深度上,在深度中,在深度中,在深度中,在深度,在深度,在深度,在深度,在