While text mining and NLP research has been established for decades, there remain gaps in the literature that reports the use of these techniques in building real-world applications. For example, they typically look at single and sometimes simplified tasks, and do not discuss in-depth data heterogeneity and inconsistency that is common in real-world problems or their implication on the development of their methods. Also, few prior work has focused on the healthcare domain. In this work, we describe an industry project that developed text mining and NLP solutions to mine millions of heterogeneous, multilingual procurement documents in the healthcare sector. We extract structured procurement contract data that is used to power a platform for dynamically assessing supplier risks. Our work makes unique contributions in a number of ways. First, we deal with highly heterogeneous, multilingual data and we document our approach to tackle these challenges. This is mainly based on a method that effectively uses domain knowledge and generalises to multiple text mining and NLP tasks and languages. Second, applying this method to mine millions of procurement documents, we develop the first structured procurement contract database that will help facilitate the tendering process. Second, Finally, we discuss lessons learned for practical text mining/NLP development, and make recommendations for future research and practice.
翻译:虽然数十年来已经确立了文本采矿和《国家采购计划》研究,但在报告这些技术用于建设现实世界应用的文献方面仍然存在差距,例如,它们通常只研究单一的、有时是简化的任务,不讨论在现实世界问题中常见的深入数据差异性和不一致性,也不讨论它们对其方法发展的影响;此外,以前很少有工作侧重于保健领域。在这项工作中,我们描述了一个工业项目,该项目开发了文本采矿和《国家采购计划》的解决方案,在保健部门开采数百万种多样性和多语言的采购文件。我们提取了结构化的采购合同数据,用于为动态评估供应商风险提供一个平台提供动力。我们的工作以多种方式作出了独特的贡献。首先,我们处理高度多样化的多语言数据,并记录了我们应对这些挑战的方法。这主要基于一种有效利用域知识的方法,并概括了多种文本采矿和《国家采购计划》的任务和语言。第二,将这一方法应用于数百万种采购文件,我们开发了第一个结构化的采购合同数据库,将有助于招标进程。最后,我们讨论了实践中的经验教训,为实际文本采矿/《国家采购计划》的研究和未来开发提出建议。