Recent approaches to text analysis from social media and other corpora rely on word lists to detect topics, measure meaning, or to select relevant documents. These lists are often generated by applying computational lexicon expansion methods to small, manually-curated sets of root words. Despite the wide use of this approach, we still lack an exhaustive comparative analysis of the performance of lexicon expansion methods and how they can be improved with additional linguistic data. In this work, we present LEXpander, a method for lexicon expansion that leverages novel data on colexification, i.e. semantic networks connecting words based on shared concepts and translations to other languages. We evaluate LEXpander in a benchmark including widely used methods for lexicon expansion based on various word embedding models and synonym networks. We find that LEXpander outperforms existing approaches in terms of both precision and the trade-off between precision and recall of generated word lists in a variety of tests. Our benchmark includes several linguistic categories and sentiment variables in English and German. We also show that the expanded word lists constitute a high-performing text analysis method in application cases to various corpora. This way, LEXpander poses a systematic automated solution to expand short lists of words into exhaustive and accurate word lists that can closely approximate word lists generated by experts in psychology and linguistics.
翻译:最近,社交媒体和其他社团的文本分析方法依靠文字列表来检测专题、衡量含义或选择相关文件。这些清单往往是通过将计算词汇扩展方法应用于小型、手工加工的根词组来生成的。尽管这种方法得到了广泛使用,但我们仍然缺乏对词汇扩展方法的性能以及如何用额外的语言数据来改进这些方法的详尽的比较分析。在这项工作中,我们介绍了LEXpander,一种词汇扩展方法,它利用关于灵活性的新数据,即将基于共同概念的文字与其他语言的翻译连接起来的语义网络。我们用一个基准来评估LEXpander,包括广泛使用的基于各种语言嵌入模式和同义网络的词汇扩展方法。我们发现,LEXpander在精确性和在各种测试中准确性与回顾生成的单词表之间的权衡方面,超越了现有的方法。我们的基准包括一些英语和德语的语言类别和情绪变量。我们还表明,扩大的词汇列表构成一种高性文本分析方法,在应用中包括广泛使用的词汇模型模型和精确性词汇列表中,通过系统化的词汇列表,可以形成一个精确的词汇列表。