解释中层革命级CNN原始演讲培训的有线电视新闻网 (Interpreting intermediate convolutional layers of CNNs trained on raw speech)

This paper presents a technique to interpret and visualize intermediate layers in CNNs trained on raw speech data in an unsupervised manner. We show that averaging over feature maps after ReLU activation in each convolutional layer yields interpretable time-series data. The proposed technique enables acoustic analysis of intermediate convolutional layers. To uncover how meaningful representation in speech gets encoded in intermediate layers of CNNs, we manipulate individual latent variables to marginal levels outside of the training range. We train and probe internal representations on two models -- a bare WaveGAN architecture and a ciwGAN extension which forces the Generator to output informative data and results in emergence of linguistically meaningful representations. Interpretation and visualization is performed for three basic acoustic properties of speech: periodic vibration (corresponding to vowels), aperiodic noise vibration (corresponding to fricatives), and silence (corresponding to stops). We also argue that the proposed technique allows acoustic analysis of intermediate layers that parallels the acoustic analysis of human speech data: we can extract F0, intensity, duration, formants, and other acoustic properties from intermediate layers in order to test where and how CNNs encode various types of information. The models are trained on two speech processes with different degrees of complexity: a simple presence of [s] and a computationally complex presence of reduplication (copied material). Observing the causal effect between interpolation and the resulting changes in intermediate layers can reveal how individual variables get transformed into spikes in activation in intermediate layers. Using the proposed technique, we can analyze how linguistically meaningful units in speech get encoded in different convolutional layers.

翻译：本文展示了一种以不受监督的方式解释和可视化CNN中受过原始语言数据培训的有线电视新闻网中间层的技术。我们显示,在ReLU激活每个卷层后,平均通过地貌图得出可解释的时间序列数据。拟议的技术可以对中间革命层进行声学分析。要发现语音中有意义的表达方式如何在CNN中间层编码,我们将个别潜伏变量操纵到培训范围以外的边缘水平。我们用两种模型来培训和探测内部表达方式 -- -- 一个是光的WaveGAN架构和一个ciwGAN扩展,它迫使发电机输出信息性数据,并导致出现具有语言意义的表达方式。为三种基本的语音特性进行解释和可视化: 定期振动(对正对调),周期性噪声震动(对调)在CNNPN的中间层进行编码, 沉默(对停止) 。我们还认为,拟议的技术允许对中间层进行声学分析,从而可以将人类语音数据进行声学分析:我们可以提取F0的强度、持续时间、形成形态,以及其它声学特性特性特性特性特性特性, 用来测试不同层次的深度,以测试各种结构的层次的内, 和结构结构结构的内, 和结构的变变变变的内,在结构内,在结构内进行不同的变变变的内,在结构上,在结构内,在结构内,在结构上,在结构内进行不同的变的变的变的变的变的变的变。