This paper proposes a novel multimodal self-supervised architecture for energy-efficient audio-visual (AV) speech enhancement that integrates Graph Neural Networks with canonical correlation analysis (CCA-GNN). The proposed approach lays its foundations on a state-of-the-art CCA-GNN that learns representative embeddings by maximizing the correlation between pairs of augmented views of the same input while decorrelating disconnected features. The key idea of the conventional CCA-GNN involves discarding augmentation-variant information and preserving augmentation-invariant information while preventing capturing of redundant information. Our proposed AV CCA-GNN model deals with multimodal representation learning context. Specifically, our model improves contextual AV speech processing by maximizing canonical correlation from augmented views of the same channel and canonical correlation from audio and visual embeddings. In addition, it proposes a positional node encoding that considers a prior-frame sequence distance instead of a feature-space representation when computing the node's nearest neighbors, introducing temporal information in the embeddings through the neighborhood's connectivity. Experiments conducted on the benchmark ChiME3 dataset show that our proposed prior frame-based AV CCA-GNN ensures better feature learning in the temporal context, leading to more energy-efficient speech reconstruction than state-of-the-art CCA-GNN and multilayer perceptron.
翻译:本文建议建立一个新型的多式自我监督的节能视听语音强化架构,将图形神经网络与卡通相关分析(CCA-GNNN)相结合。 提议的方法以最先进的CCA-GNNN为基础,通过尽可能扩大同一频道的更多观点和声音和视觉嵌入的更多关联,学习具有代表性的嵌入。此外,它提议了一个定位节点编码,在计算最近的节点时,考虑前框架序列距离,而不是地格空间代表,在嵌入邻居的连接中引入时间信息。我们提议的AVCCA-GNN模型涉及多式代表性学习环境。具体地说,我们模型通过从同一频道的更多观点和视听嵌入的卡通性关系中最大限度地增加背景的AV语音处理。此外,它提议采用定位节点编码,在计算最近的节点邻居时,将时间信息引入邻域的连接中。在基准基点的ChiMEG-GNNS-S-PS-C-C-CSime-C-AD-C-C-CR-Simmeleg-Simleg-Silental Sy Adal-CADADAD-C-C-C-C-C-CADADAD-CRADR-C-C-CRAD-C-CR-CSD-CSD-CSD-CSDSDSDSD-S-SDS-SD-SD-SD-SD-SDSDSDSDSDSDSDSDSDSDSDSDSDSDSD 上, 之前的拟议框架中,以更好的前框架,以更更确保我们先前的拟议框架-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-