Many analysis and prediction tasks require the extraction of structured data from unstructured texts. However, an annotation scheme and a training dataset have not been available for training machine learning models to mine structured data from text without special templates and patterns. To solve it, this paper presents an end-to-end machine learning pipeline, Text2Struct, including a text annotation scheme, training data processing, and machine learning implementation. We formulated the mining problem as the extraction of metrics and units associated with numerals in the text. Text2Struct was trained and evaluated using an annotated text dataset collected from abstracts of medical publications regarding thrombectomy. In terms of prediction performance, a dice coefficient of 0.82 was achieved on the test dataset. By random sampling, most predicted relations between numerals and entities were well matched to the ground-truth annotations. These results show that Text2Struct is viable for the mining of structured data from text without special templates or patterns. It is anticipated to further improve the pipeline by expanding the dataset and investigating other machine learning models. A code demonstration can be found at: https://github.com/zcc861007/CourseProject
翻译:许多分析和预测任务需要从非结构化文本中提取结构化数据,然而,没有为培训机器学习模型提供说明计划和培训数据集,以培训机器学习模型,从没有特殊模板和模式的文本中清除结构化数据。为了解决这个问题,本文件提出了一个端到端机器学习管道、Text2Struct(包括文字说明计划)、培训数据处理和机器学习实施。我们将采矿问题表述为从文本中提取与数字相关的指标和单位。Text2Struct(Text2Struct)是利用从有关Treombectomy的医疗出版物摘要中收集的附加说明的文本数据集进行培训和评价的。在预测性能方面,在测试数据集中实现了0.82的dice系数。通过随机抽样,数字和实体之间大多数预测的关系与地面图解说明完全吻合。这些结果显示,Text2Struct(Text)对于从没有特殊模板或模式的文本中挖掘结构化数据是可行的。预计通过扩大数据集和调查其他机器学习模型来进一步改进管道。Acocrus/100/Courus演示。