This paper presents a new task of predicting the coverage of a text document for relation extraction (RE): does the document contain many relational tuples for a given entity? Coverage predictions are useful in selecting the best documents for knowledge base construction with large input corpora. To study this problem, we present a dataset of 31,366 diverse documents for 520 entities. We analyze the correlation of document coverage with features like length, entity mention frequency, Alexa rank, language complexity and information retrieval scores. Each of these features has only moderate predictive power. We employ methods combining features with statistical models like TF-IDF and language models like BERT. The model combining features and BERT, HERB, achieves an F1 score of up to 46%. We demonstrate the utility of coverage predictions on two use cases: KB construction and claim refutation.
翻译:本文件提出了预测关系提取(RE)文本文件覆盖面的新任务:该文件是否包含一个特定实体的许多关联图象? 覆盖预测有助于选择知识基础建设的最佳文件,并有大量投入公司。 为研究这一问题,我们为520个实体提供了31 366个不同文件的数据集。我们分析了文件覆盖面与长度、实体提及频率、亚历克萨排名、语言复杂性和信息检索分数等特征的关联性。每个特征只有中等的预测力。我们采用了将特征与统计模型(如TF-IDF)和BERT等语言模型相结合的方法。将功能和BERT(HERB)相结合的模型达到了高达46%的F1分。我们展示了覆盖预测在KB构建和索赔重复两个使用案例上的效用。