Data anonymization is often a task carried out by humans. Automating it would reduce the cost and time required to complete this task. This paper presents a pipeline to automate the anonymization of audio data in French. We propose a pipeline, which takes audio files with their transcriptions and removes the named entities (NEs) present in the audio. Our pipeline is made up of a forced aligner, which aligns words in an audio transcript with speech and a model that performs named entity recognition (NER). Then, the audio segments that correspond to NEs are substituted with silence to anonymize audio. We compared forced aligners and NER models to find the best ones for our scenario. We evaluated our pipeline on a small hand-annotated dataset, achieving an F1 score of 0.769. This result shows that automating this task is feasible.
翻译:数据匿名化通常是由人类执行的任务。 自动化将减少完成这项任务所需的成本和时间。 本文将提供一个管道, 将音频数据的法文匿名化自动化。 我们建议一个管道, 将音频文件及其抄录带带, 并删除音频中存在的命名实体。 我们的管道由强迫的连接器组成, 它将音频记录中的文字与语音记录和履行名称实体识别的模型( NER) 相匹配。 然后, 与 NE 相对应的音频段被沉默取代为音频。 我们比较了强制调合器和 NER 模型, 以找到我们设想中的最佳信息。 我们用一个小手语附加注释的数据集评估了我们的管道, 达到0. 769的F1分。 这个结果显示, 将这项任务自动化是可行的。