解释关于波形训练的具有基因特征的有线电视新闻网的中间进化层 (Interpreting intermediate convolutional layers of generative CNNs trained on waveforms)

This paper presents a technique to interpret and visualize intermediate layers in generative CNNs trained on raw speech data in an unsupervised manner. We argue that averaging over feature maps after ReLU activation in each transpose convolutional layer yields interpretable time-series data. This technique allows for acoustic analysis of intermediate layers that parallels the acoustic analysis of human speech data: we can extract F0, intensity, duration, formants, and other acoustic properties from intermediate layers in order to test where and how CNNs encode various types of information. We further combine this technique with linear interpolation of a model's latent space to show a causal relationship between individual variables in the latent space and activations in a model's intermediate convolutional layers. In particular, observing the causal effect between linear interpolation and the resulting changes in intermediate layers can reveal how individual latent variables get transformed into spikes in activation in intermediate layers. We train and probe internal representations of two models -- a bare WaveGAN architecture and a ciwGAN extension which forces the Generator to output informative data and results in the emergence of linguistically meaningful representations. Interpretation and visualization is performed for three basic acoustic properties of speech: periodic vibration (corresponding to vowels), aperiodic noise vibration (corresponding to fricatives), and silence (corresponding to stops). The proposal also allows testing of higher-level morphophonological alternations such as reduplication (copying). In short, using the proposed technique, we can analyze how linguistically meaningful units in speech get encoded in each convolutional layer of a generative neural network.

翻译：本文展示了一种技术,可以以不受监督的方式解释和直观地显示在原始语音数据方面受过训练的CNN 基因层中的中间层。我们争辩说,在ReLU激活每个转换成革命层后,平均在地貌图上显示可解释的时间序列数据。这一技术可以对中间层进行声学分析,与人类语音数据的声学分析平行:我们可以从中间层提取F0、强度、持续时间、成形和其他声学特性,以测试CNN如何对各类信息进行编码。我们进一步将这一技术与模型潜潜潜潜空的线性语言内线性内插图显示潜伏空间中个别变量与模型中间熔化的因果关系结合起来。特别是,观察线性图中层内插图和由此产生的变化之间的因果关系分析,可以揭示单个潜在变量如何转化为中间层的电动。我们从中间层中对两种模型进行内部演示和探测, 一种光波GAN 结构, 和 civorGAN 扩展, 使发电机产生有意义的信息数据, 并导致语言上有意义地层层层层层层层内出现。解释和直观性电压网络,, 用于三个基本声学周期的电路变变变:, 性演演演:, 性演的周期:, 性变变变变变变变, 演: 性, 变变变变变到变到变到变到变到变到变到变到变变到变到变到变到变到变到变到变到变到变到变到变到变到变到变到变到变到变到变到变到变变到变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变