The objective of this paper is to combine multiple frame-level features into a single utterance-level representation considering pairwise relationship. For this purpose, we propose a novel graph attentive feature aggregation module by interpreting each frame-level feature as a node of a graph. The inter-relationship between all possible pairs of features, typically exploited indirectly, can be directly modeled using a graph. The module comprises a graph attention layer and a graph pooling layer followed by a readout operation. The graph attention layer first models the non-Euclidean data manifold between different nodes. Then, the graph pooling layer discards less informative nodes considering the significance of the nodes. Finally, the readout operation combines the remaining nodes into a single representation. We employ two recent systems, SE-ResNet and RawNet2, with different input features and architectures and demonstrate that the proposed feature aggregation module consistently shows a relative improvement over 10%, compared to the baseline.
翻译:本文的目的是将多个框架层面的特征结合成单一的语句级表达式。 为此, 我们提出一个新的图形关注特征汇总模块, 将每个框架层面的特征解释为图形的节点。 所有可能的特征组合( 通常间接开发的) 之间的相互关系, 可以通过图形直接建模。 该模块包含一个图形关注层和一个图形集合层, 并随后进行读出操作。 图形关注层首先将不同节点之间的非欧洲域数据组合模型模型模型。 然后, 考虑到节点的重要性, 图形集合层丢弃了信息较少的节点。 最后, 读出操作将剩余节点合并成一个单一代表点。 我们使用两个最近的系统SE-ResNet和RawNet2, 其输入特征和结构各不相同, 并表明拟议的特征组合模块与基线相比始终显示10%以上的相对改进。