In contrast to their word- or sentence-level counterparts, character embeddings are still poorly understood. We aim at closing this gap with an in-depth study of English character embeddings. For this, we use resources from research on grapheme-color synesthesia -- a neuropsychological phenomenon where letters are associated with colors, which give us insight into which characters are similar for synesthetes and how characters are organized in color space. Comparing 10 different character embeddings, we ask: How similar are character embeddings to a synesthete's perception of characters? And how similar are character embeddings extracted from different models? We find that LSTMs agree with humans more than transformers. Comparing across tasks, grapheme-to-phoneme conversion results in the most human-like character embeddings. Finally, ELMo embeddings differ from both humans and other models.
翻译:与文字或句级的对应方相比, 字符嵌入仍不甚清楚。 我们的目标是通过深入研究英语字符嵌入来缩小这一差距。 为此, 我们使用关于石墨色合成研究的资源, 这是一种神经心理现象, 字母与颜色相关, 使我们能洞察到哪些字符与协同相近, 以及字符在颜色空间中如何组织。 比较了10个不同的字符嵌入, 我们问 : 字符嵌入与同步的字符嵌入如何相似? 和从不同模型中提取的字符嵌入有多相似? 我们发现 LSTMs 同意人比变异器更多。 比较任务, 石墨对语音转换导致最像人类的字符嵌入。 最后, ELM 嵌入与人类和其他模型不同 。