Text2Struct: 文字中采矿结构化数据的机械学习管道 (Text2Struct: A Machine Learning Pipeline for Mining Structured Data from Text) - 专知论文

会员服务 ·

0

MINE · Machine Learning · Learning · 数据集 · Unstructured ·

2022 年 12 月 18 日

Text2Struct: A Machine Learning Pipeline for Mining Structured Data from Text

翻译：Text2Struct: 文字中采矿结构化数据的机械学习管道

Chaochao Zhou,Bo Yang

Many analysis and prediction tasks require the extraction of structured data from unstructured texts. To solve it, this paper presents an end-to-end machine learning pipeline, Text2Struct, including a text annotation scheme, training data processing, and machine learning implementation. We formulated the mining problems as the extraction of metrics and units associated with numerals in the text. Text2Struct was evaluated on an annotated text dataset collected from abstracts of medical publications regarding thrombectomy. In terms of prediction performance, a dice coefficient of 0.82 was achieved on the test dataset. By random sampling, most predicted relations between numerals and entities were well matched to the ground-truth annotations. These results showed that the Text2Struct is viable for the mining of structured data from text without special templates or patterns. It is anticipated to further improve the pipeline by expanding the dataset and investigating other machine learning models. A code demonstration can be found at: https://github.com/zcc861007/CourseProject

翻译：许多分析和预测任务都要求从非结构化文本中提取结构化数据。为了解决这个问题,本文件提出了一个端到端机器学习管道、Text2Struct(包括文字说明计划)、培训数据处理和机器学习实施。我们把采矿问题作为与文本中数字相关的指标和单位的提取而提出。Text2Struct是根据从有关血压的医学出版物摘要中收集的附加说明的文本数据集进行评估的。在预测性能方面,测试数据集实现了0.82的骰子系数。通过随机抽样,大多数数字和实体之间的预测关系都与地面真相说明完全匹配。这些结果表明,Text2Struct对于从没有特殊模板或模式的文本中挖掘结构化数据是可行的。预期通过扩大数据集和调查其他机器学习模型来进一步改进管道。一个代码演示可见:https://github.com/zcc861007ConurseProject。

1

相关内容

MINE

【图机器学习进展与趋势@ICML2022】Graph Machine Learning @ ICML 2022

【图机器学习进展与趋势@ICML2022】Graph Machine Learning @ ICML 2022

专知会员服务

40+阅读 · 2022年7月25日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

哥伦比亚大学最新《机器学习》课程，Fall-B 2020 (Machine Learning)

专知会员服务

39+阅读 · 2020年11月3日

2020数据工程师成长路线图

专知会员服务

41+阅读 · 2020年9月6日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

专知会员服务

115+阅读 · 2020年4月5日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

IEEE ICKG 2022: Call for Papers

IEEE ICKG 2022: Call for Papers

机器学习与推荐算法

3+阅读 · 2022年3月30日

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

ACM TOMM Call for Papers

ACM TOMM Call for Papers

CCF多媒体专委会

2+阅读 · 2022年3月23日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

中国图象图形学学会CSIG

0+阅读 · 2021年11月16日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

基于局域和传播型表面等离激元长程耦合的新型表面/针尖增强拉曼光谱和传感基底

国家自然科学基金

0+阅读 · 2015年12月31日

肝细胞肝癌中高表达的PRC1基因功能及其受CTCF调控的机制研究

国家自然科学基金

0+阅读 · 2013年12月31日

半导体衬底上FeSe薄膜的外延生长及界面超导

国家自然科学基金

0+阅读 · 2013年12月31日

蛋白激酶LIMK1活性在小鼠卵母细胞染色体分离过程中的作用和分子机制

国家自然科学基金

0+阅读 · 2012年12月31日

抑癌基因MIIP加速EGFR蛋白质降解并抑制肺癌细胞生长的分子机制

国家自然科学基金

0+阅读 · 2012年12月31日

实时安全关键系统的建模、仿真与验证

国家自然科学基金

1+阅读 · 2012年12月31日

角蛋白功能接枝共聚物合成与性能研究

国家自然科学基金

0+阅读 · 2011年12月31日

Ndfip1蛋白抑制神经细胞凋亡的分子机制研究

国家自然科学基金

1+阅读 · 2010年12月31日

OPG/RANKL/RANK系统在关节软骨细胞分化阻滞解除及骨性关节炎发病中的作用

国家自然科学基金

0+阅读 · 2009年12月31日

超支化聚合物/硅的多维、多尺度自组装研究

国家自然科学基金

0+阅读 · 2008年12月31日

Tragic and Comical Networks. Clustering Dramatic Genres According to Structural Properties

Arxiv

0+阅读 · 2023年2月16日

Learning to Count Isomorphisms with Graph Neural Networks

Arxiv

0+阅读 · 2023年2月15日

Adversarial Robustness of Representation Learning for Knowledge Graphs

Arxiv

10+阅读 · 2022年9月30日

A Survey of Learning on Small Data

Arxiv

19+阅读 · 2022年7月29日

Beyond Just Vision: A Review on Self-Supervised Representation Learning on Multimodal and Temporal Data

Arxiv

28+阅读 · 2022年6月8日

Active Learning for Domain Adaptation: An Energy-based Approach

Arxiv

13+阅读 · 2021年12月2日

Learning Discrete Structures for Graph Neural Networks

Arxiv

17+阅读 · 2019年3月28日

Learning with Interpretable Structure from RNN

Arxiv

19+阅读 · 2018年10月25日

Multimodal Sentiment Analysis To Explore the Structure of Emotions

Arxiv

19+阅读 · 2018年5月25日

Self-Attention with Relative Position Representations

Arxiv

14+阅读 · 2018年3月6日

VIP会员

文章信息

相关主题

Machine Learning

相关VIP内容

【图机器学习进展与趋势@ICML2022】Graph Machine Learning @ ICML 2022

【图机器学习进展与趋势@ICML2022】Graph Machine Learning @ ICML 2022

专知会员服务

40+阅读 · 2022年7月25日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

哥伦比亚大学最新《机器学习》课程，Fall-B 2020 (Machine Learning)

专知会员服务

39+阅读 · 2020年11月3日

2020数据工程师成长路线图

专知会员服务

41+阅读 · 2020年9月6日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

专知会员服务

115+阅读 · 2020年4月5日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

自动驾驶轨迹规划中的基础模型：进展综述与开放挑战

《用于提升多域战备的大型语言模型辅助场景生成器》报告

【斯坦福博士论文】为人类使用优化 AI 模型

国防领域人工智能规模化应用的理论与实践

相关资讯

IEEE ICKG 2022: Call for Papers

IEEE ICKG 2022: Call for Papers

机器学习与推荐算法

3+阅读 · 2022年3月30日

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

ACM TOMM Call for Papers

ACM TOMM Call for Papers

CCF多媒体专委会

2+阅读 · 2022年3月23日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

中国图象图形学学会CSIG

0+阅读 · 2021年11月16日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

相关论文

Tragic and Comical Networks. Clustering Dramatic Genres According to Structural Properties

Arxiv

0+阅读 · 2023年2月16日

Learning to Count Isomorphisms with Graph Neural Networks

Arxiv

0+阅读 · 2023年2月15日

Adversarial Robustness of Representation Learning for Knowledge Graphs

Arxiv

10+阅读 · 2022年9月30日

A Survey of Learning on Small Data

Arxiv

19+阅读 · 2022年7月29日

Beyond Just Vision: A Review on Self-Supervised Representation Learning on Multimodal and Temporal Data

Arxiv

28+阅读 · 2022年6月8日

Active Learning for Domain Adaptation: An Energy-based Approach

Arxiv

13+阅读 · 2021年12月2日

Learning Discrete Structures for Graph Neural Networks

Arxiv

17+阅读 · 2019年3月28日

Learning with Interpretable Structure from RNN

Arxiv

19+阅读 · 2018年10月25日

Multimodal Sentiment Analysis To Explore the Structure of Emotions

Arxiv

19+阅读 · 2018年5月25日

Self-Attention with Relative Position Representations

Arxiv

14+阅读 · 2018年3月6日

相关基金

基于局域和传播型表面等离激元长程耦合的新型表面/针尖增强拉曼光谱和传感基底

国家自然科学基金

0+阅读 · 2015年12月31日

肝细胞肝癌中高表达的PRC1基因功能及其受CTCF调控的机制研究

国家自然科学基金

0+阅读 · 2013年12月31日

半导体衬底上FeSe薄膜的外延生长及界面超导

国家自然科学基金

0+阅读 · 2013年12月31日

蛋白激酶LIMK1活性在小鼠卵母细胞染色体分离过程中的作用和分子机制

国家自然科学基金

0+阅读 · 2012年12月31日

抑癌基因MIIP加速EGFR蛋白质降解并抑制肺癌细胞生长的分子机制

国家自然科学基金

0+阅读 · 2012年12月31日

实时安全关键系统的建模、仿真与验证

国家自然科学基金

1+阅读 · 2012年12月31日

角蛋白功能接枝共聚物合成与性能研究

国家自然科学基金

0+阅读 · 2011年12月31日

Ndfip1蛋白抑制神经细胞凋亡的分子机制研究

国家自然科学基金

1+阅读 · 2010年12月31日

OPG/RANKL/RANK系统在关节软骨细胞分化阻滞解除及骨性关节炎发病中的作用

国家自然科学基金

0+阅读 · 2009年12月31日

超支化聚合物/硅的多维、多尺度自组装研究

国家自然科学基金

0+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员