Many analysis and prediction tasks require the extraction of structured data from unstructured texts. To solve it, this paper presents an end-to-end machine learning pipeline, Text2Struct, including a text annotation scheme, training data processing, and machine learning implementation. We formulated the mining problems as the extraction of metrics and units associated with numerals in the text. Text2Struct was evaluated on an annotated text dataset collected from abstracts of medical publications regarding thrombectomy. In terms of prediction performance, a dice coefficient of 0.82 was achieved on the test dataset. By random sampling, most predicted relations between numerals and entities were well matched to the ground-truth annotations. These results showed that the Text2Struct is viable for the mining of structured data from text without special templates or patterns. It is anticipated to further improve the pipeline by expanding the dataset and investigating other machine learning models. A code demonstration can be found at: https://github.com/zcc861007/CourseProject
翻译:许多分析和预测任务都要求从非结构化文本中提取结构化数据。为了解决这个问题,本文件提出了一个端到端机器学习管道、Text2Struct(包括文字说明计划)、培训数据处理和机器学习实施。我们把采矿问题作为与文本中数字相关的指标和单位的提取而提出。Text2Struct是根据从有关血压的医学出版物摘要中收集的附加说明的文本数据集进行评估的。在预测性能方面,测试数据集实现了0.82的骰子系数。通过随机抽样,大多数数字和实体之间的预测关系都与地面真相说明完全匹配。这些结果表明,Text2Struct对于从没有特殊模板或模式的文本中挖掘结构化数据是可行的。预期通过扩大数据集和调查其他机器学习模型来进一步改进管道。一个代码演示可见:https://github.com/zcc861007ConurseProject。