Translated title: 基于 wav2vec2.0 框架的说话人识别基于图形特征融合技术 Translated abstract: 已经证明，预先训练的 wav2vec2.0 模型对于说话人识别非常有效。然而，当前的特征处理方法主要集中在对预先训练的 wav2vec2.0 模型输出特征的经典汇集上，如平均池化、最大池化等。这些方法将特征视为独立和不相关的单元，忽略了所有特征之间的相互关系，并且不将特征视为说话人的全面表示。门控循环单元（GRU）作为特征融合的方法，也可以被视为一种复杂的汇聚技术，主要集中在时域信息上，在一些主要信息不在时间维度上的情况下可能表现不佳。在本文中，我们调查图神经网络（GNN）作为基于 wav2vec2.0 框架的后端处理模块，针对上述问题提供一种解决方案。GNN 将所有输出特征作为图形信号数据，并从特征中提取相关的图形结构信息用于说话人识别。具体来说，我们首先简单证明了 GNN 特征融合方法在理论上可以优于平均值、最大值、随机池化方法等。接下来，我们将 wav2vec2.0 的输出特征建模为图形的顶点，并利用图形注意力网络（GAT）构造图形的邻接矩阵。最后，我们遵循消息传递神经网络（MPNN）的设计，设计了我们的消息函数、顶点更新函数和读取函数，将说话人特征转换为图形特征。实验表明，与基线方法相比，我们的性能可以提供相对改进。代码可在 xxx 处获得。 (The Graph feature fusion technique for speaker recognition based on wav2vec2.0 framework)

翻译：Translated title: 基于 wav2vec2.0 框架的说话人识别基于图形特征融合技术 Translated abstract: 已经证明，预先训练的 wav2vec2.0 模型对于说话人识别非常有效。然而，当前的特征处理方法主要集中在对预先训练的 wav2vec2.0 模型输出特征的经典汇集上，如平均池化、最大池化等。这些方法将特征视为独立和不相关的单元，忽略了所有特征之间的相互关系，并且不将特征视为说话人的全面表示。门控循环单元（GRU）作为特征融合的方法，也可以被视为一种复杂的汇聚技术，主要集中在时域信息上，在一些主要信息不在时间维度上的情况下可能表现不佳。在本文中，我们调查图神经网络（GNN）作为基于 wav2vec2.0 框架的后端处理模块，针对上述问题提供一种解决方案。GNN 将所有输出特征作为图形信号数据，并从特征中提取相关的图形结构信息用于说话人识别。具体来说，我们首先简单证明了 GNN 特征融合方法在理论上可以优于平均值、最大值、随机池化方法等。接下来，我们将 wav2vec2.0 的输出特征建模为图形的顶点，并利用图形注意力网络（GAT）构造图形的邻接矩阵。最后，我们遵循消息传递神经网络（MPNN）的设计，设计了我们的消息函数、顶点更新函数和读取函数，将说话人特征转换为图形特征。实验表明，与基线方法相比，我们的性能可以提供相对改进。代码可在 xxx 处获得。

Zirui Ge,Haiyan Guo,Zhen Yang

Pre-trained wav2vec2.0 model has been proved its effectiveness for speaker recognition. However, current feature processing methods are focusing on classical pooling on the output features of the pre-trained wav2vec2.0 model, such as mean pooling, max pooling etc. That methods take the features as the independent and irrelevant units, ignoring the inter-relationship among all the features, and do not take the features as an overall representation of a speaker. Gated Recurrent Unit (GRU), as a feature fusion method, can also be considered as a complicated pooling technique, mainly focuses on the temporal information, which may show poor performance in some situations that the main information is not on the temporal dimension. In this paper, we investigate the graph neural network (GNN) as a backend processing module based on wav2vec2.0 framework to provide a solution for the mentioned matters. The GNN takes all the output features as the graph signal data and extracts the related graph structure information of features for speaker recognition. Specifically, we first give a simple proof that the GNN feature fusion method can outperform than the mean, max, random pooling methods and so on theoretically. Then, we model the output features of wav2vec2.0 as the vertices of a graph, and construct the graph adjacency matrix by graph attention network (GAT). Finally, we follow the message passing neural network (MPNN) to design our message function, vertex update function and readout function to transform the speaker features into the graph features. The experiments show our performance can provide a relative improvement compared to the baseline methods. Code is available at xxx.

翻译：注意: translated title is just a translation of the original title into Chinese, and not an official translation.