De-identification of data used for automatic speech recognition modeling is a critical component in protecting privacy, especially in the medical domain. However, simply removing all personally identifiable information (PII) from end-to-end model training data leads to a significant performance degradation in particular for the recognition of names, dates, locations, and words from similar categories. We propose and evaluate a two-step method for partially recovering this loss. First, PII is identified, and each occurrence is replaced with a random word sequence of the same category. Then, corresponding audio is produced via text-to-speech or by splicing together matching audio fragments extracted from the corpus. These artificial audio/label pairs, together with speaker turns from the original data without PII, are used to train models. We evaluate the performance of this method on in-house data of medical conversations and observe a recovery of almost the entire performance degradation in the general word error rate while still maintaining a strong diarization performance. Our main focus is the improvement of recall and precision in the recognition of PII-related words. Depending on the PII category, between $50\% - 90\%$ of the performance degradation can be recovered using our proposed method.
翻译:用于自动语音识别模型的数据的解析是保护隐私的关键组成部分,特别是在医疗领域。然而,简单地从端到端示范培训数据中去除所有个人识别的信息(PII),导致显著性能退化,特别是确认类似类别的名称、日期、地点和字词。我们建议并评价一个两步方法,以部分弥补这一损失。首先,确定PII,并以同一类别的随机单词序列取代每个事件。然后,通过文字对字或将从物理中提取的音频碎片拼接在一起,产生相应的音频。这些人工音频/标签配对,加上与原数据中没有PII的音频转换器一起,被用来培训模型。我们评估这一方法在内部医疗谈话数据上的性能表现,观察总字错误率中几乎全部性能退化的恢复情况,同时保持很强的分化性性能。我们的主要重点是改进对PII相关词的回音和精确度。根据PII类别,可以使用我们提议的方法恢复性能退化的50-90美元。