Traditional representations like Bag of words are high dimensional, sparse and ignore the order as well as syntactic and semantic information. Distributed vector representations or embeddings map variable length text to dense fixed length vectors as well as capture the prior knowledge which can transferred to downstream tasks. Even though embedding has become de facto standard for representations in deep learning based NLP tasks in both general and clinical domains, there is no survey paper which presents a detailed review of embeddings in Clinical Natural Language Processing. In this survey paper, we discuss various medical corpora and their characteristics, medical codes and present a brief overview as well as comparison of popular embeddings models. We classify clinical embeddings into nine types and discuss each embedding type in detail. We discuss various evaluation methods followed by possible solutions to various challenges in clinical embeddings. Finally, we conclude with some of the future directions which will advance the research in clinical embeddings.
翻译:传统的表达方式,如字袋,是高维的,分散的,忽略了顺序以及合成和语义信息。向矢量表示方式或嵌入图的变长文本分布为密度固定长度矢量,并捕捉可以转移到下游任务的先前知识。即使嵌入已成为普通和临床领域深学习的基于NLP任务的实际代表标准,但没有一份调查文件详细审查临床自然语言处理中的嵌入情况。在本调查文件中,我们讨论各种医学体体及其特征、医学编码,并简要概述和比较流行嵌入模式。我们将临床嵌入分为9种,详细讨论每一种嵌入类型。我们讨论各种评价方法,然后可能解决临床嵌入的各种挑战。最后,我们总结了一些未来方向,这些方向将推进临床嵌入的研究。