Lexical semantic change (detecting shifts in the meaning and usage of words) is an important task for social and cultural studies as well as for Natural Language Processing applications. Diachronic word embeddings (time-sensitive vector representations of words that preserve their meaning) have become the standard resource for this task. However, given the significant computational resources needed for their generation, very few resources exist that make diachronic word embeddings available to the scientific community. In this paper we present DUKweb, a set of large-scale resources designed for the diachronic analysis of contemporary English. DUKweb was created from the JISC UK Web Domain Dataset (1996-2013), a very large archive which collects resources from the Internet Archive that were hosted on domains ending in `.uk'. DUKweb consists of a series word co-occurrence matrices and two types of word embeddings for each year in the JISC UK Web Domain dataset. We show the reuse potential of DUKweb and its quality standards via a case study on word meaning change detection.
翻译:字义的语义变化(发现字的含义和使用的变化)是社会和文化研究以及自然语言处理应用的一项重要任务。字嵌入(对时间敏感的矢量表示保留其含义的字)已成为这项任务的标准资源。然而,鉴于其生成所需的大量计算资源,很少有资源可用于将字义嵌入到科学界。在本文件中,我们介绍了DUKweb,这是一套大规模资源,用于对当代英语进行日光学分析。DUKweb是从UNJC Uk Web Domain数据集(1996-2013年)创建的,这是一个非常庞大的档案库,收集在`uk'结束的域内托管的因特网档案库中的资源。DUKweb由一系列的单词共生矩阵和每年在JSCIC Unow Domain数据集中的两种字嵌入式组成。我们通过对字意改变的探测进行个案研究,展示了DUKweb的再利用潜力及其质量标准。