RETEROM项目中更多的罗马尼亚字嵌入 (More Romanian word embeddings from the RETEROM project)

from arxiv, Publlished in Proceedings of the 13th International Conference on Linguistic Resources and Tools for Processing Romanian Language - CONSILR 2018. Complete proceedings volume available here: https://profs.info.uaic.ro/~consilr/2019/wp-content/uploads/2019/06/volum-ConsILR-2018-1.pdf

Automatically learned vector representations of words, also known as "word embeddings", are becoming a basic building block for more and more natural language processing algorithms. There are different ways and tools for constructing word embeddings. Most of the approaches rely on raw texts, the construction items being the word occurrences and/or letter n-grams. More elaborated research is using additional linguistic features extracted after text preprocessing. Morphology is clearly served by vector representations constructed from raw texts and letter n-grams. Syntax and semantics studies may profit more from the vector representations constructed with additional features such as lemma, part-of-speech, syntactic or semantic dependants associated with each word. One of the key objectives of the ReTeRom project is the development of advanced technologies for Romanian natural language processing, including morphological, syntactic and semantic analysis of text. As such, we plan to develop an open-access large library of ready-to-use word embeddings sets, each set being characterized by different parameters: used features (wordforms, letter n-grams, lemmas, POSes etc.), vector lengths, window/context size and frequency thresholds. To this end, the previously created sets of word embeddings (based on word occurrences) on the CoRoLa corpus (P\u{a}i\c{s} and Tufi\c{s}, 2018) are and will be further augmented with new representations learned from the same corpus by using specific features such as lemmas and parts of speech. Furthermore, in order to better understand and explore the vectors, graphical representations will be available by customized interfaces.

翻译：自动学习文字的矢量表达方式, 也称为“ 字嵌入”, 正在成为更多和更多自然语言处理算法的基本构件。构建字嵌入的方法和工具各不相同。多数方法都依赖于原始文本, 构建项目是单词发生和/ 字母 n- gram 。更详尽的研究正在使用在文本预处理后提取的其他语言特征。由原始文本和字母 n 克构建的矢量表达方式显然为通量表达方式服务。语法和语义研究可能更多地获益于矢量表达方式, 并具有与每个词相关的额外特性, 例如利玛、部分语音、合成或语义依赖者。 ReTeRoom 项目的关键目标之一是开发罗马尼亚自然语言处理的先进技术, 包括文字预处理过程的形态、合成和语义分析。因此, 我们计划开发一个开放的、现用词嵌入式词嵌入的大型图书馆, 每套都由不同的参数组成: 使用过的特性( 字型、字母缩写、语组、范围、递增版、和底版的底版的将使用缩缩缩缩缩版、、、、、的将使用格式、、版本的缩制、、格式、、格式、版本的缩制成、、、和等的文本的缩成、、、格式、、、、格式、、、、格式、、、、将制成成、格式、、、、、、、、、、、制成、制成、、制成、制成、制成、制、、制、、、、、、、、、、、、、、制成、制成、制成、制、、、制成、、、、、制、制成、制成、制成、制成、制成、制成、制成、制成、制成、

相关内容

词向量表示

关注 37

分散式表示即将语言表示为稠密、低维、连续的向量。研究者最早发现学习得到词嵌入之间存在类比关系。比如apple−apples ≈ car−cars， man−woman ≈ king – queen 等。这些方法都可以直接在大规模无标注语料上进行训练。词嵌入的质量也非常依赖于上下文窗口大小的选择。通常大的上下文窗口学到的词嵌入更反映主题信息，而小的上下文窗口学到的词嵌入更反映词的功能和上下文语义信息。

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日