Beyond bibliometrics, there is interest in characterizing the evolution of the number of ideas in scientific papers. A common approach for investigating this involves analyzing the titles of publications to detect vocabulary changes over time. With the notion that phrases, or more specifically keyphrases, represent concepts, lexical diversity metrics are applied to phrased versions of the titles. Thus changes in lexical diversity are treated as indicators of shifts, and possibly expansion, of research. Therefore, optimizing detection of keyphrases is an important aspect of this process. Rather than just one, we propose to use multiple phrase detection models with the goal to produce a more comprehensive set of keyphrases from the source corpora. Another potential advantage to this approach is that the union and difference of these sets may provide automated techniques for identifying and omitting non-specific phrases. We compare the performance of several phrase detection models, analyze the keyphrase sets output of each, and calculate lexical diversity of corpora variants incorporating keyphrases from each model, using four common lexical diversity metrics.
翻译:除了字数学外,人们还关心科学论文中思想数量演变的特点。调查这种研究的共同方法包括分析出版物标题,以发现词汇随时间变化。关于词句或更具体地说关键词句代表概念的概念,对标题的短语版本应用了词汇多样性指标。因此,将词汇多样性的变化作为变化的指标,并可能扩大研究范围。因此,优化关键词句的探测是这一过程的一个重要方面。我们提议使用多个词句探测模型,目的是从源体中产生一套更全面的关键词句。这一方法的另一个潜在优势是,这些词组的组合和差异可以提供自动技术,用以识别和省略非特定词句。我们比较了几个短语检测模型的性能,分析每种词句的输出,并用四种通用的词汇多样性指标计算包含每个模型关键词的复合变量的词汇多样性。