SAT 以卫星卫星为基础的波斯文字嵌入分析评价框架 (SAT Based Analogy Evaluation Framework for Persian Word Embeddings)

In recent years there has been a special interest in word embeddings as a new approach to convert words to vectors. It has been a focal point to understand how much of the semantics of the the words has been transferred into embedding vectors. This is important as the embedding is going to be used as the basis for downstream NLP applications and it will be costly to evaluate the application end-to-end in order to identify quality of the used embedding model. Generally the word embeddings are evaluated through a number of tests, including analogy test. In this paper we propose a test framework for Persian embedding models. Persian is a low resource language and there is no rich semantic benchmark to evaluate word embedding models for this language. In this paper we introduce an evaluation framework including a hand crafted Persian SAT based analogy dataset, a colliquial test set (specific to Persian) and a benchmark to study the impact of various parameters on the semantic evaluation task.

翻译：近年来,人们特别关心将字嵌入作为将字转换为向量的新办法的词嵌入。它是一个焦点点,以了解该词的语义有多少被转移到嵌入矢量。这很重要, 因为嵌入将被用作下游NLP应用程序的基础, 评估应用端到端的成本很高, 以便确定所使用的嵌入模型的质量。通常, 单词嵌入是通过一系列测试, 包括类比测试来评估的。在本文中, 我们提议了一个波斯嵌入模型的测试框架。波斯语是一种低资源语言, 没有丰富的语义基准来评价该语言的字嵌入模型。在本文中, 我们引入了一个评估框架, 包括一个基于类比数据集的手制波斯SAT, 一套( 波斯语特有的) 和一项基准, 以研究各种参数对语义评价任务的影响。

相关内容

词向量表示

关注 37

分散式表示即将语言表示为稠密、低维、连续的向量。研究者最早发现学习得到词嵌入之间存在类比关系。比如apple−apples ≈ car−cars， man−woman ≈ king – queen 等。这些方法都可以直接在大规模无标注语料上进行训练。词嵌入的质量也非常依赖于上下文窗口大小的选择。通常大的上下文窗口学到的词嵌入更反映主题信息，而小的上下文窗口学到的词嵌入更反映词的功能和上下文语义信息。

【论文】使用编码器进行命名实体识别（TENER: Adapting Transformer Encoder for Named Entity Recognition）

专知会员服务

52+阅读 · 2019年12月28日

【NLP模型的跨语言/跨领域迁移】《Transferring NLP models across languages and domains》

专知会员服务

43+阅读 · 2019年11月25日

专知会员服务

15+阅读 · 2019年11月24日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation