Over the last few years, neural network derived word embeddings became popular in the natural language processing literature. Studies conducted have mostly focused on the quality and application of word embeddings trained on public available corpuses such as Wikipedia or other news and social media sources. However, these studies are limited to generic text and thus lack technical and scientific nuances such as domain specific vocabulary, abbreviations, or scientific formulas which are commonly used in academic context. This research focuses on the performance of word embeddings applied to a large scale academic corpus. More specifically, we compare quality and efficiency of trained word embeddings to TFIDF representations in modeling content of scientific articles. We use a word2vec skip-gram model trained on titles and abstracts of about 70 million scientific articles. Furthermore, we have developed a benchmark to evaluate content models in a scientific context. The benchmark is based on a categorization task that matches articles to journals for about 1.3 million articles published in 2017. Our results show that content models based on word embeddings are better for titles (short text) while TFIDF works better for abstracts (longer text). However, the slight improvement of TFIDF for larger text comes at the expense of 3.7 times more memory requirement as well as up to 184 times higher computation times which may make it inefficient for online applications. In addition, we have created a 2-dimensional visualization of the journals modeled via embeddings to qualitatively inspect embedding model. This graph shows useful insights and can be used to find competitive journals or gaps to propose new journals.
翻译:在过去几年里,神经网络派生的字嵌入在自然语言处理文献中变得很受欢迎。进行的研究主要侧重于在公共可获取的文献(如维基百科或其他新闻和社交媒体来源)上培训的字嵌入的质量和应用。然而,这些研究仅限于通用文本,因此缺乏技术和科学的细微差别,例如具体领域的词汇、缩略语或通常用于学术背景的科学公式。这一研究侧重于适用于大型学术资料库的字嵌入功能。更具体地说,我们将经过培训的字嵌入的质量和效率与TFIDF期刊在科学文章内容建模中的形象进行比较。我们使用经培训的文字嵌入模式的字嵌入质量和实用性模型,就约7 000万份科学文章的书名和摘要进行了培训。此外,我们制定了一个基准,用于评估科学背景内容模型的技术和科学细化模型,将文章与2017年出版的约130万篇文章相匹配。我们的研究结果表明,基于字嵌入的字嵌入模型更有用,而TRFF在摘要(长文文本)中则更好。然而,略改进了TFIDFS-Bramdegramgrammeal real recal recal real real real real real bemadududududududududududududududududududududududududududududududustrs 。