自动生成标签标签数据, 用于作者名的模糊化: 迭代群集法 (Generating automatically labeled data for author name disambiguation: An iterative clustering method) - 专知论文

会员服务 ·

0

簇 · 成对型 · 聚类方法 · 标注 · 实体解析 ·

2021 年 2 月 5 日

Generating automatically labeled data for author name disambiguation: An iterative clustering method

翻译：自动生成标签标签数据, 用于作者名的模糊化: 迭代群集法

Jinseok Kim,Jinmo Kim,Jason Owen-Smith

from arxiv, 25 pages

To train algorithms for supervised author name disambiguation, many studies have relied on hand-labeled truth data that are very laborious to generate. This paper shows that labeled training data can be automatically generated using information features such as email address, coauthor names, and cited references that are available from publication records. For this purpose, high-precision rules for matching name instances on each feature are decided using an external-authority database. Then, selected name instances in target ambiguous data go through the process of pairwise matching based on the rules. Next, they are merged into clusters by a generic entity resolution algorithm. The clustering procedure is repeated over other features until further merging is impossible. Tested on 26,566 instances out of the population of 228K author name instances, this iterative clustering produced accurately labeled data with pairwise F1 = 0.99. The labeled data represented the population data in terms of name ethnicity and co-disambiguating name group size distributions. In addition, trained on the labeled data, machine learning algorithms disambiguated 24K names in test data with performance of pairwise F1 = 0.90 ~ 0.92. Several challenges are discussed for applying this method to resolving author name ambiguity in large-scale scholarly data.

翻译：为培训监督作者姓名脱节的算法,许多研究都依赖手工标签的真伪数据,而这些数据很难生成。本文显示,标签的培训数据可以使用电子邮件地址、共同作者姓名等信息特征以及出版物记录中引用的参考文献自动生成。为此,使用外部授权数据库决定了每个特征匹配名称实例的高精度规则。然后,目标模棱两可数据中选定的名字实例通过基于规则的对称匹配过程进行。接下来,它们由通用实体解析算法合并成群集。在进一步合并之前,群集程序重复使用其他特性。在228K作者名称案例中,26 566个实例进行测试,这种迭代组生成了精确的标签数据,配对F1=0.99。标签数据代表了按名称族裔和共同模糊名称组大小分布的人口数据。此外,根据标签数据培训,机器学习算法在测试数据中的24K名称与配对式F1=0.90-0.92的功能混为一格。将数据应用于大规模解析式的作者。

0

相关内容

【干货书】机器学习速查手册，135页pdf

【干货书】机器学习速查手册，135页pdf

专知会员服务

127+阅读 · 2020年11月20日

【干货书】机器学习Primer，122页pdf

【干货书】机器学习Primer，122页pdf

专知会员服务

108+阅读 · 2020年10月5日

2020数据工程师成长路线图

专知会员服务

41+阅读 · 2020年9月6日

【干货书】管理统计和数据科学原理，678页pdf

【干货书】管理统计和数据科学原理，678页pdf

专知会员服务

185+阅读 · 2020年7月29日

【快讯】ICML 2020论文出炉，1088篇上榜，你的paper中了吗？

【快讯】ICML 2020论文出炉，1088篇上榜，你的paper中了吗？

专知会员服务

52+阅读 · 2020年6月1日

【微软-ACL2020】TinyMBERT: Multi-Stage Distillation Framework for Massive Multi-lingual NER

【微软-ACL2020】TinyMBERT: Multi-Stage Distillation Framework for Massive Multi-lingual NER

专知会员服务

36+阅读 · 2020年4月14日

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

专知会员服务

43+阅读 · 2020年1月28日

AAAI2020接受论文列表，1591篇论文目录全集

AAAI2020接受论文列表，1591篇论文目录全集

专知会员服务

99+阅读 · 2020年1月12日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

【ACL2020放榜!】事件抽取、关系抽取、NER、Few-Shot 相关论文整理

【ACL2020放榜!】事件抽取、关系抽取、NER、Few-Shot 相关论文整理

深度学习自然语言处理

18+阅读 · 2020年5月22日

【论文笔记】通俗理解少样本文本分类 (Few-Shot Text Classification) (1)

【论文笔记】通俗理解少样本文本分类 (Few-Shot Text Classification) (1)

深度学习自然语言处理

7+阅读 · 2020年4月8日

深度自进化聚类：Deep Self-Evolution Clustering

深度自进化聚类：Deep Self-Evolution Clustering

我爱读PAMI

15+阅读 · 2019年4月13日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

Disentangled的假设的探讨

Disentangled的假设的探讨

CreateAMind

9+阅读 · 2018年12月10日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

互信息论文笔记

互信息论文笔记

CreateAMind

23+阅读 · 2018年8月23日

Hierarchical Disentangled Representations

Hierarchical Disentangled Representations

CreateAMind

4+阅读 · 2018年4月15日

Adversarial Variational Bayes: Unifying VAE and GAN 代码

Adversarial Variational Bayes: Unifying VAE and GAN 代码

CreateAMind

7+阅读 · 2017年10月4日

Auto-Encoding GAN

Auto-Encoding GAN

CreateAMind

7+阅读 · 2017年8月4日

A Tensor-EM Method for Large-Scale Latent Class Analysis with Clustering Consistency

Arxiv

0+阅读 · 2021年3月30日

edATLAS: An Efficient Disambiguation Algorithm for Texting in Languages with Abugida Scripts

Arxiv

0+阅读 · 2021年3月29日

Divide-and-Conquer: A Distributed Hierarchical Factor Approach to Modeling Large-Scale Time Series Data

Divide-and-Conquer: A Distributed Hierarchical Factor Approach to Modeling Large-Scale Time Series Data

Arxiv

0+阅读 · 2021年3月26日

Collecting large-scale publication data at the level of individual researchers: A practical proposal for author name disambiguation

Arxiv

0+阅读 · 2021年3月26日

Exploiting Synthetically Generated Data with Semi-Supervised Learning for Small and Imbalanced Datasets

Arxiv

3+阅读 · 2019年3月24日

MAD-GAN: Multivariate Anomaly Detection for Time Series Data with Generative Adversarial Networks

MAD-GAN: Multivariate Anomaly Detection for Time Series Data with Generative Adversarial Networks

Arxiv

15+阅读 · 2019年1月15日

ConceptNet 5.5: An Open Multilingual Graph of General Knowledge

ConceptNet 5.5: An Open Multilingual Graph of General Knowledge

Arxiv

10+阅读 · 2018年12月11日

Generating Fine-Grained Open Vocabulary Entity Type Descriptions

Arxiv

4+阅读 · 2018年5月27日

Iterative Manifold Embedding Layer Learned by Incomplete Data for Large-scale Image Retrieval

Arxiv

8+阅读 · 2018年4月3日

Knowledge-based Word Sense Disambiguation using Topic Models

Arxiv

5+阅读 · 2018年1月5日

VIP会员

文章信息

相关主题

相关VIP内容

【干货书】机器学习速查手册，135页pdf

【干货书】机器学习速查手册，135页pdf

专知会员服务

127+阅读 · 2020年11月20日

【干货书】机器学习Primer，122页pdf

【干货书】机器学习Primer，122页pdf

专知会员服务

108+阅读 · 2020年10月5日

2020数据工程师成长路线图

专知会员服务

41+阅读 · 2020年9月6日

【干货书】管理统计和数据科学原理，678页pdf

【干货书】管理统计和数据科学原理，678页pdf

专知会员服务

185+阅读 · 2020年7月29日

【快讯】ICML 2020论文出炉，1088篇上榜，你的paper中了吗？

【快讯】ICML 2020论文出炉，1088篇上榜，你的paper中了吗？

专知会员服务

52+阅读 · 2020年6月1日

【微软-ACL2020】TinyMBERT: Multi-Stage Distillation Framework for Massive Multi-lingual NER

【微软-ACL2020】TinyMBERT: Multi-Stage Distillation Framework for Massive Multi-lingual NER

专知会员服务

36+阅读 · 2020年4月14日

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

专知会员服务

43+阅读 · 2020年1月28日

AAAI2020接受论文列表，1591篇论文目录全集

AAAI2020接受论文列表，1591篇论文目录全集

专知会员服务

99+阅读 · 2020年1月12日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

【普林斯顿博士论文】以奖励推动生成式人工智能的发展：奖励引导生成的理论与方法

中文版 | 火力支援与巡飞弹药的未来（附原文）

中文版 | 人工智能时代的任务式指挥

扩散模型中的 Transformer：图像生成及其延展应用询问 ChatGPT

相关资讯

【ACL2020放榜!】事件抽取、关系抽取、NER、Few-Shot 相关论文整理

【ACL2020放榜!】事件抽取、关系抽取、NER、Few-Shot 相关论文整理

深度学习自然语言处理

18+阅读 · 2020年5月22日

【论文笔记】通俗理解少样本文本分类 (Few-Shot Text Classification) (1)

【论文笔记】通俗理解少样本文本分类 (Few-Shot Text Classification) (1)

深度学习自然语言处理

7+阅读 · 2020年4月8日

深度自进化聚类：Deep Self-Evolution Clustering

深度自进化聚类：Deep Self-Evolution Clustering

我爱读PAMI

15+阅读 · 2019年4月13日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

Disentangled的假设的探讨

Disentangled的假设的探讨

CreateAMind

9+阅读 · 2018年12月10日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

互信息论文笔记

互信息论文笔记

CreateAMind

23+阅读 · 2018年8月23日

Hierarchical Disentangled Representations

Hierarchical Disentangled Representations

CreateAMind

4+阅读 · 2018年4月15日

Adversarial Variational Bayes: Unifying VAE and GAN 代码

Adversarial Variational Bayes: Unifying VAE and GAN 代码

CreateAMind

7+阅读 · 2017年10月4日

Auto-Encoding GAN

Auto-Encoding GAN

CreateAMind

7+阅读 · 2017年8月4日

相关论文

A Tensor-EM Method for Large-Scale Latent Class Analysis with Clustering Consistency

Arxiv

0+阅读 · 2021年3月30日

edATLAS: An Efficient Disambiguation Algorithm for Texting in Languages with Abugida Scripts

Arxiv

0+阅读 · 2021年3月29日

Divide-and-Conquer: A Distributed Hierarchical Factor Approach to Modeling Large-Scale Time Series Data

Divide-and-Conquer: A Distributed Hierarchical Factor Approach to Modeling Large-Scale Time Series Data

Arxiv

0+阅读 · 2021年3月26日

Collecting large-scale publication data at the level of individual researchers: A practical proposal for author name disambiguation

Arxiv

0+阅读 · 2021年3月26日

Exploiting Synthetically Generated Data with Semi-Supervised Learning for Small and Imbalanced Datasets

Arxiv

3+阅读 · 2019年3月24日

MAD-GAN: Multivariate Anomaly Detection for Time Series Data with Generative Adversarial Networks

MAD-GAN: Multivariate Anomaly Detection for Time Series Data with Generative Adversarial Networks

Arxiv

15+阅读 · 2019年1月15日

ConceptNet 5.5: An Open Multilingual Graph of General Knowledge

ConceptNet 5.5: An Open Multilingual Graph of General Knowledge

Arxiv

10+阅读 · 2018年12月11日

Generating Fine-Grained Open Vocabulary Entity Type Descriptions

Arxiv

4+阅读 · 2018年5月27日

Iterative Manifold Embedding Layer Learned by Incomplete Data for Large-scale Image Retrieval

Arxiv

8+阅读 · 2018年4月3日

Knowledge-based Word Sense Disambiguation using Topic Models

Arxiv

5+阅读 · 2018年1月5日

微信扫码咨询专知VIP会员