We aim at improving spoken language modeling (LM) using very large amount of automatically transcribed speech. We leverage the INA (French National Audiovisual Institute) collection and obtain 19GB of text after applying ASR on 350,000 hours of diverse TV shows. From this, spoken language models are trained either by fine-tuning an existing LM (FlauBERT) or through training a LM from scratch. New models (FlauBERT-Oral) are shared with the community and evaluated for 3 downstream tasks: spoken language understanding, classification of TV shows and speech syntactic parsing. Results show that FlauBERT-Oral can be beneficial compared to its initial FlauBERT version demonstrating that, despite its inherent noisy nature, ASR-generated text can be used to build spoken language models.
翻译:我们利用法国国家视听学院(INA)收集的文字资料,并在350,000小时的不同电视节目中应用ASR之后获得19GB文本。从这一点上,口语模型要么通过微调现有的LM(FlauBERT)培训,要么从零开始培训LM(FlauBERT)培训。新模型(FlauBERT-Oral)与社区共享,并评估了三项下游任务:口语理解、电视节目分类和语音合成。结果显示,与最初的FlauBERT版本相比,口语模型是有益的,表明尽管它具有内在的吵闹性质,但它产生的文字可以用来建立口语模型。