Scientific publications have evolved several features for mitigating vocabulary mismatch when indexing, retrieving, and computing similarity between articles. These mitigation strategies range from simply focusing on high-value article sections, such as titles and abstracts, to assigning keywords, often from controlled vocabularies, either manually or through automatic annotation. Various document representation schemes possess different cost-benefit tradeoffs. In this paper, we propose to model different representations of the same article as translations of each other, all generated from a common latent representation in a multilingual topic model. We start with a methodological overview on latent variable models for parallel document representations that could be used across many information science tasks. We then show how solving the inference problem of mapping diverse representations into a shared topic space allows us to evaluate representations based on how topically similar they are to the original article. In addition, our proposed approach provides means to discover where different concept vocabularies require improvement.
翻译:科学出版物在索引编制、检索和计算各条款之间的相似性时,为减少词汇不匹配而发展了几个特征,这些减缓战略从仅仅侧重于诸如标题和摘要等高价值物品章节到指定关键词,通常是从受控制的词汇中人工或通过自动注解,各种文件表述计划具有不同的成本效益取舍。在本文件中,我们提议将同一条款作为翻译品的不同表述方式建模,所有这些都来自多语种专题模型中共同的潜在代表物。我们首先从方法上概述潜在的可变模型开始,这些模型可以用于许多信息科学任务。然后我们展示如何解决将不同表述图解绘制成一个共同主题空间的推论问题,从而使我们能够根据它们与原始条款在专题上如何相似来评价表述。此外,我们提议的方法提供了方法,以发现不同概念词汇需要改进的地方。