Pre-trained language models have established the state-of-the-art on various natural language processing tasks, including dialogue summarization, which allows the reader to quickly access key information from long conversations in meetings, interviews or phone calls. However, such dialogues are still difficult to handle with current models because the spontaneity of the language involves expressions that are rarely present in the corpora used for pre-training the language models. Moreover, the vast majority of the work accomplished in this field has been focused on English. In this work, we present a study on the summarization of spontaneous oral dialogues in French using several language specific pre-trained models: BARThez, and BelGPT-2, as well as multilingual pre-trained models: mBART, mBARThez, and mT5. Experiments were performed on the DECODA (Call Center) dialogue corpus whose task is to generate abstractive synopses from call center conversations between a caller and one or several agents depending on the situation. Results show that the BARThez models offer the best performance far above the previous state-of-the-art on DECODA. We further discuss the limits of such pre-trained models and the challenges that must be addressed for summarizing spontaneous dialogues.
翻译:培训前语言模式确立了各种自然语言处理任务的最新知识,包括对话总结,使读者能够迅速获取从会议、访谈或电话中长时间交谈获得的关键信息,然而,这种对话仍然难以与当前模式打交道,因为语言的自发性涉及语言模式预培训中使用的团体中很少出现的表达形式。此外,这一领域完成的绝大多数工作都以英语为重点。在这项工作中,我们提交了一份关于自发口头对话以法语进行综合的研究报告,使用了几种特定语言的预培训模式:巴塞兹和贝尔格普特-2,以及多种语言预培训模式:姆巴雷、姆巴雷茨和MT5. 实验,该模式的任务是从呼叫者与根据情况而定的一个或多个代理人之间的呼叫中心对话中产生抽象的合成。结果显示,巴塞兹模式提供了远远超过以往在德克塔内卡对话方面的状态的最好表现。我们必须进一步讨论这种自发式对话的限度。