Embeddings of words and concepts capture syntactic and semantic regularities of language; however, they have seen limited use as tools to study characteristics of different corpora and how they relate to one another. We introduce TextEssence, an interactive system designed to enable comparative analysis of corpora using embeddings. TextEssence includes visual, neighbor-based, and similarity-based modes of embedding analysis in a lightweight, web-based interface. We further propose a new measure of embedding confidence based on nearest neighborhood overlap, to assist in identifying high-quality embeddings for corpus analysis. A case study on COVID-19 scientific literature illustrates the utility of the system. TextEssence is available from https://github.com/drgriffis/text-essence.
翻译:文字和概念的嵌入式体现了语言的统一性和语义性;然而,它们认为作为研究不同社团的特点及其相互关系的工具的用途有限;我们引入了TextEssence,这是一个互动系统,旨在利用嵌入对社团进行比较分析;文字Esence 包括视觉、邻居和基于相似的嵌入式分析模式,嵌入一个轻量、基于网络的界面;我们进一步提出了基于近邻重叠的嵌入信任的新措施,以协助确定高质量的嵌入点,以便进行人身分析;关于COVID-19科学文献的案例研究说明了该系统的效用;文字Esence 可从https://github.com/drgeriffis/text-essence查阅。