Popular approaches to natural language processing create word embeddings based on textual co-occurrence patterns, but often ignore embodied, sensory aspects of language. Here, we introduce the Python package comp-syn, which provides grounded word embeddings based on the perceptually uniform color distributions of Google Image search results. We demonstrate that comp-syn significantly enriches models of distributional semantics. In particular, we show that (1) comp-syn predicts human judgments of word concreteness with greater accuracy and in a more interpretable fashion than word2vec using low-dimensional word-color embeddings, and (2) comp-syn performs comparably to word2vec on a metaphorical vs. literal word-pair classification task. comp-syn is open-source on PyPi and is compatible with mainstream machine-learning Python packages. Our package release includes word-color embeddings for over 40,000 English words, each associated with crowd-sourced word concreteness judgments.
翻译:自然语言处理的流行方法创造了基于文本共发模式的字嵌入, 但通常忽略了语言的感官方面。 在这里, 我们引入 Python 软件包 com- syn, 它基于Google 图像搜索结果的视觉统一色彩分布提供基于基础的字嵌入。 我们显示, Comp- syn 极大地丰富了分布式语义的模型。 特别是, 我们显示 (1) Comp- syn 以比 word2vec 更精确和更可解释的方式预测单词具体性, 并且使用低维度的字色嵌入; (2) Comp- syn 在隐喻式与 literal word- pair 分类任务上, comp- syn 是 PyPi 的开源, 并且与主流机器学习 Python 软件包兼容。 我们的软件发布包括40,000 以上英文词的字色嵌入方式, 每个都与众源词具体性判断有关。