With the recent developments in digitisation, there are increasing number of documents available online. There are several information extraction tools that are available to extract information from digitised documents. However, identifying precise answers to a given query is often a challenging task especially if the data source where the relevant information resides is unknown. This situation becomes more complex when the data source is available in multiple formats such as PDF, table and html. In this paper, we propose a novel data extraction system to discover relevant and focused information from diverse unstructured data sources based on text mining approaches. We perform a qualitative analysis to evaluate the proposed system and its suitability and adaptability using cotton industry.
翻译:随着最近在数字化方面的发展,在线提供的文件越来越多,有若干信息提取工具可以从数字化文件中提取信息,然而,确定对特定查询的准确答案往往是一项具有挑战性的任务,特别是如果有关信息所在的数据源未知,当数据源以多种格式,如PDF、表格和html提供时,这种情况就变得更加复杂。在本文件中,我们提议建立一个新的数据提取系统,从基于文本开采方法的不同非结构化数据源中发现相关和有重点的信息。我们进行定性分析,评估拟议的系统及其使用棉花工业的适宜性和适应性。