Language models are notoriously difficult to evaluate. We release SuperSim, a large-scale similarity and relatedness test set for Swedish built with expert human judgments. The test set is composed of 1,360 word-pairs independently judged for both relatedness and similarity by five annotators. We evaluate three different models (Word2Vec, fastText, and GloVe) trained on two separate Swedish datasets, namely the Swedish Gigaword corpus and a Swedish Wikipedia dump, to provide a baseline for future comparison. We release the fully annotated test set, code, baseline models, and data.
翻译:语言模型很难评估,我们发行了SuperSim(SUPSSIM),这是瑞典人通过专家人类判断为瑞典人建立的大规模相似性和关联性测试。测试由5个注解者独立判断的1,360个单词和类似性组成。我们评估了三种不同的模型(Word2Vec、快图和GloVe),它们分别接受瑞典两个数据集的培训,即瑞典的Gigawoon 文集和瑞典的维基百科垃圾堆,为将来的比较提供基准。我们发布了一个完整的附加说明的测试集、代码、基线模型和数据。