commp- syn: 带有颜色的字嵌入 (comp-syn: Perceptually Grounded Word Embeddings with Color)

from arxiv, 9 pages, 3 figures, all code and data available at https://github.com/comp-syn/comp-syn. Forthcoming in the Proceedings of the 28th International Conference on Computational Linguistics

Popular approaches to natural language processing create word embeddings based on textual co-occurrence patterns, but often ignore embodied, sensory aspects of language. Here, we introduce the Python package comp-syn, which provides grounded word embeddings based on the perceptually uniform color distributions of Google Image search results. We demonstrate that comp-syn significantly enriches models of distributional semantics. In particular, we show that (1) comp-syn predicts human judgments of word concreteness with greater accuracy and in a more interpretable fashion than word2vec using low-dimensional word-color embeddings, and (2) comp-syn performs comparably to word2vec on a metaphorical vs. literal word-pair classification task. comp-syn is open-source on PyPi and is compatible with mainstream machine-learning Python packages. Our package release includes word-color embeddings for over 40,000 English words, each associated with crowd-sourced word concreteness judgments.

翻译：自然语言处理的流行方法创造了基于文本共发模式的字嵌入, 但通常忽略了语言的感官方面。在这里, 我们引入 Python 软件包 com- syn, 它基于Google 图像搜索结果的视觉统一色彩分布提供基于基础的字嵌入。我们显示, Comp- syn 极大地丰富了分布式语义的模型。特别是, 我们显示 (1) Comp- syn 以比 word2vec 更精确和更可解释的方式预测单词具体性, 并且使用低维度的字色嵌入; (2) Comp- syn 在隐喻式与 literal word- pair 分类任务上, comp- syn 是 PyPi 的开源, 并且与主流机器学习 Python 软件包兼容。我们的软件发布包括40,000 以上英文词的字色嵌入方式, 每个都与众源词具体性判断有关。

相关内容

词向量表示

关注 37

分散式表示即将语言表示为稠密、低维、连续的向量。研究者最早发现学习得到词嵌入之间存在类比关系。比如apple−apples ≈ car−cars， man−woman ≈ king – queen 等。这些方法都可以直接在大规模无标注语料上进行训练。词嵌入的质量也非常依赖于上下文窗口大小的选择。通常大的上下文窗口学到的词嵌入更反映主题信息，而小的上下文窗口学到的词嵌入更反映词的功能和上下文语义信息。

Python图像处理，366页pdf，Image Operators Image Processing in Python

专知会员服务

78+阅读 · 2020年7月23日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日