Adapting generic speech recognition models to specific individuals is a challenging problem due to the scarcity of personalized data. Recent works have proposed boosting the amount of training data using personalized text-to-speech synthesis. Here, we ask two fundamental questions about this strategy: when is synthetic data effective for personalization, and why is it effective in those cases? To address the first question, we adapt a state-of-the-art automatic speech recognition (ASR) model to target speakers from four benchmark datasets representative of different speaker types. We show that ASR personalization with synthetic data is effective in all cases, but particularly when (i) the target speaker is underrepresented in the global data, and (ii) the capacity of the global model is limited. To address the second question of why personalized synthetic data is effective, we use controllable speech synthesis to generate speech with varied styles and content. Surprisingly, we find that the text content of the synthetic data, rather than style, is important for speaker adaptation. These results lead us to propose a data selection strategy for ASR personalization based on speech content.
翻译:由于自定义数据的稀缺性,将通用语音识别模型适应于特定个体是一个具有挑战性的问题。最近的研究提出了使用个性化文本到语音合成来增加训练数据量的方法。在这里,我们询问关于这种策略的两个基本问题:在哪些情况下合成数据对于个性化有效,并且为什么在这些情况下它是有效的?为了回答第一个问题,我们使用最先进的自动语音识别(ASR)模型来适应四个代表不同说话者类型的基准数据集中的目标说话者。我们发现使用合成数据进行ASR个性化在所有情况下都是有效的,但特别适用于(i)目标说话者在全局数据中数量有限,以及(ii)全局模型的能力有限的情况。为了回答为什么个性化合成数据有效这个问题,我们使用可控语音合成来生成具有不同风格和内容的语音。令人惊讶的是,我们发现合成数据的文本内容,而不是风格对于说话人适应至关重要。这些结果导致我们提出了一种基于言论内容的ASR个性化数据选择策略。