We present a fairly large, Potential Idiomatic Expression (PIE) dataset for Natural Language Processing (NLP) in English. The challenges with NLP systems with regards to tasks such as Machine Translation (MT), word sense disambiguation (WSD) and information retrieval make it imperative to have a labelled idioms dataset with classes such as it is in this work. To the best of the authors' knowledge, this is the first idioms corpus with classes of idioms beyond the literal and the general idioms classification. In particular, the following classes are labelled in the dataset: metaphor, simile, euphemism, parallelism, personification, oxymoron, paradox, hyperbole, irony and literal. We obtain an overall inter-annotator agreement (IAA) score, between two independent annotators, of 88.89%. Many past efforts have been limited in the corpus size and classes of samples but this dataset contains over 20,100 samples with almost 1,200 cases of idioms (with their meanings) from 10 classes (or senses). The corpus may also be extended by researchers to meet specific needs. The corpus has part of speech (PoS) tagging from the NLTK library. Classification experiments performed on the corpus to obtain a baseline and comparison among three common models, including the BERT model, give good results. We also make publicly available the corpus and the relevant codes for working with it for NLP tasks.
翻译:我们用英文为自然语言处理(NLP)提供了相当大的潜在单词表达(PIE)数据集。由于NLP系统在机器翻译(MT)、词感分辨(WSD)和信息检索等任务方面面临的挑战,我们必须有一个标有标签的单词表达(Idomm)数据集,该数据集的类别如在这项工作中的类别。据作者所知,这是第一组单词表达(PIE)数据集,其类别在字形和一般语分类之外。特别是,以下类别在数据集中标有标签:隐喻、硅、电子化、平行主义、个性化、氧素moron、自相矛盾、超音调、讽刺和字形。我们获得一个总体的双独立说明(IAA)协议(IAAA)分,其中88.89%是作者所知的。过去许多努力在物质模型和样本类别方面是有限的,但这一数据集包含近20 100个样本,其中含有近1 200个类型(含其含义的)样本,从10个类(或感官意义)类中标、平行、个个个人个个个个个类(或感官标)平行的比。该数据库还可以将一个用于三个普通数据库的样本的、一个具体数据库的、一个标定、一个标、一个标、一个标化的标、一个标定(我们的标定的标)的标的标的比。该数据库可以进行到一个具体的样本,用于用于用于用于用于三个的图书馆的图书馆的图书馆的图书馆的样本,用于一个特定的图书馆的样本,用于一个特定的检索。