Recent advances in the healthcare industry have led to an abundance of unstructured data, making it challenging to perform tasks such as efficient and accurate information retrieval at scale. Our work offers an all-in-one scalable solution for extracting and exploring complex information from large-scale research documents, which would otherwise be tedious. First, we briefly explain our knowledge synthesis process to extract helpful information from unstructured text data of research documents. Then, on top of the knowledge extracted from the documents, we perform complex information retrieval using three major components- Paragraph Retrieval, Triplet Retrieval from Knowledge Graphs, and Complex Question Answering (QA). These components combine lexical and semantic-based methods to retrieve paragraphs and triplets and perform faceted refinement for filtering these search results. The complexity of biomedical queries and documents necessitates using a QA system capable of handling queries more complex than factoid queries, which we evaluate qualitatively on the COVID-19 Open Research Dataset (CORD-19) to demonstrate the effectiveness and value-add.
翻译:医疗行业最近的进展导致大量非结构化数据,使得执行诸如大规模高效和准确信息检索等任务具有挑战性。我们的工作为从大规模研究文件中提取和探索复杂信息提供了全方位的全方位解决方案,否则,这些信息将是乏味的。首先,我们简要地解释我们的知识综合过程,以便从研究文件的非结构化文本数据中提取有用信息。然后,除了从文件中获取的知识外,我们还利用三个主要组成部分(段落检索、知识图中的特里普列检索和复杂问题回答(QA))进行复杂的信息检索。这些组成部分结合了基于词汇和语义的方法,检索段落和三重词,并进行面面面式改进,以过滤这些搜索结果。生物医学查询和文件的复杂性要求使用一个质量评估系统,能够处理比事实查询更复杂的查询,我们评估了COVID-19公开研究数据集(CORD-19)的质量,以证明其有效性和增值。