DiaLex:评价多种语言阿拉伯文文字嵌入基准 (DiaLex: A Benchmark for Evaluating Multidialectal Arabic Word Embeddings)

Muhammad Abdul-Mageed,Shady Elbassuoni,Jad Doughman,AbdelRahim Elmadany,El Moatez Billah Nagoudi,Yorgo Zoughby,Ahmad Shaher,Iskander Gaba,Ahmed Helal,Mohammed El-Razzaz

from arxiv, WANLP2021

Word embeddings are a core component of modern natural language processing systems, making the ability to thoroughly evaluate them a vital task. We describe DiaLex, a benchmark for intrinsic evaluation of dialectal Arabic word embedding. DiaLex covers five important Arabic dialects: Algerian, Egyptian, Lebanese, Syrian, and Tunisian. Across these dialects, DiaLex provides a testbank for six syntactic and semantic relations, namely male to female, singular to dual, singular to plural, antonym, comparative, and genitive to past tense. DiaLex thus consists of a collection of word pairs representing each of the six relations in each of the five dialects. To demonstrate the utility of DiaLex, we use it to evaluate a set of existing and new Arabic word embeddings that we developed. Our benchmark, evaluation code, and new word embedding models will be publicly available.

翻译：字嵌入是现代自然语言处理系统的核心组成部分,使得能够彻底评估它们成为一项至关重要的任务。我们描述DiaLex, 这是方言阿拉伯语嵌入的内在评估基准。 DiaLex 覆盖了五种重要的阿拉伯语方言: 阿尔及利亚、埃及、黎巴嫩、叙利亚和突尼斯。在这些方言中, DiaLex 提供了一个测试库, 测试六种混合和语义关系, 即男性对女性, 单数至双数, 单数至复数, 异名, 比较, 以及比数到过去时。 DiaLex 包含一组代表五种方言中六种关系中的每个关系的词配对。为了展示DiaLex 的实用性, 我们用它来评估我们开发的一套现有和新的阿拉伯词嵌入式。我们的基准、评价代码和新词嵌入模式将公开提供。

相关内容

词向量表示

关注 37

分散式表示即将语言表示为稠密、低维、连续的向量。研究者最早发现学习得到词嵌入之间存在类比关系。比如apple−apples ≈ car−cars， man−woman ≈ king – queen 等。这些方法都可以直接在大规模无标注语料上进行训练。词嵌入的质量也非常依赖于上下文窗口大小的选择。通常大的上下文窗口学到的词嵌入更反映主题信息，而小的上下文窗口学到的词嵌入更反映词的功能和上下文语义信息。

【Facebook】人工智能基准(Benchmarking)测试再思考，55页ppt

专知会员服务

31+阅读 · 2020年12月20日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

社交网络上议题社群的公共焦虑研究，中国人民大学新闻学院塔娜讲师，第八届全国社会媒体处理大会SMP2019

专知会员服务

15+阅读 · 2019年10月23日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日