自动分解: 一种基因自监督的语义分解模型 (Autodecompose: A generative self-supervised model for semantic decomposition)

We introduce Autodecompose, a novel self-supervised generative model that decomposes data into two semantically independent properties: the desired property, which captures a specific aspect of the data (e.g. the voice in an audio signal), and the context property, which aggregates all other information (e.g. the content of the audio signal), without any labels given. Autodecompose uses two complementary augmentations, one that manipulates the context while preserving the desired property and the other that manipulates the desired property while preserving the context. The augmented variants of the data are encoded by two encoders and reconstructed by a decoder. We prove that one of the encoders embeds the desired property while the other embeds the context property. We apply Autodecompose to audio signals to encode sound source (human voice) and content. We pre-trained the model on YouTube and LibriSpeech datasets and fine-tuned in a self-supervised manner without exposing the labels. Our results showed that, using the sound source encoder of pre-trained Autodecompose, a linear classifier achieves F1 score of 97.6\% in recognizing the voice of 30 speakers using only 10 seconds of labeled samples, compared to 95.7\% for supervised models. Additionally, our experiments showed that Autodecompose is robust against overfitting even when a large model is pre-trained on a small dataset. A large Autodecompose model was pre-trained from scratch on 60 seconds of audio from 3 speakers achieved over 98.5\% F1 score in recognizing those three speakers in other unseen utterances. We finally show that the context encoder embeds information about the content of the speech and ignores the sound source information. Our sample code for training the model, as well as examples for using the pre-trained models are available here: \url{https://github.com/rezabonyadi/autodecompose}

翻译：我们引入Autodecompete, 这是一种将数据分解成两个音义独立的属性的新颖自我监督的基因模型: 想要的属性, 它捕捉数据的具体方面( 例如音频信号中的声音), 和上下文属性, 它集所有其它信息( 例如音频信号的内容), 没有给任何标签。自动脱色使用两个互补的增强功能, 一个在保存所需属性的同时操控上下文, 另一个则在保存上下文的同时操控所需属性。数据增强的变异功能由两个调解码器进行编码, 并由一个解码器重建。我们证明, 一个编码器将想要的属性嵌入另一个部分( 例如音频信号的内容), 我们应用Autodecomete 将音效信号转换成音源( 例如音频信号) 。我们先在YouTube上和LibriSpeech 数据发布模型, 然后在不披露标签的情况下进行精细调调。我们的结果显示, 使用一个甚至从音义的语音服务器前数 5 5 的代码, 显示在10秒内, sadealcommode 数据显示, 我们的变码中, 我们的音义模型中显示, 10 显示, 我们的音阶解的变式的代代码的代码数据在10秒的代码显示, 我们的代号的代号的代码中显示, 我们的代码中显示, 。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/