The segmentation of emails into functional zones (also dubbed email zoning) is a relevant preprocessing step for most NLP tasks that deal with emails. However, and despite the multilingual character of emails and their applications, previous literature regarding email zoning corpora and systems was developed essentially for English. In this paper, we analyse the existing email zoning corpora and propose a new multilingual benchmark composed of 635 emails in Portuguese, Spanish and French. Moreover, we introduce OKAPI, the first multilingual email segmentation model based on a language-agnostic sentence encoder. Besides generalizing well for unseen languages, our model is competitive with current English benchmarks, and reached new state-of-the-art performances for domain adaptation tasks in English.
翻译:将电子邮件分割到功能区(也称为电子邮件分区)是处理大多数处理电子邮件的NLP任务的一个相关预处理步骤。 然而,尽管电子邮件及其应用具有多种语言性质,但先前关于电子邮件分区的文献和系统主要是为英文开发的。在本文件中,我们分析现有的电子邮件分区公司,并提议一个新的多语种基准,由635个葡萄牙语、西班牙语和法语电子邮件组成。此外,我们引入了基于语言认知句子编码器的第一个多语言电子邮件分割模式 OKAPI。 我们的模式除了对不为人知的语言进行普及外,还具有与当前英语基准的竞争力,并实现了以英语进行域适应任务的最新最新表现。