使用TF-IDF 标签集匹配的扩展神经关键字提取 (Extending Neural Keyword Extraction with TF-IDF tagset matching)

Keyword extraction is the task of identifying words (or multi-word expressions) that best describe a given document and serve in news portals to link articles of similar topics. In this work we develop and evaluate our methods on four novel data sets covering less represented, morphologically-rich languages in European news media industry (Croatian, Estonian, Latvian and Russian). First, we perform evaluation of two supervised neural transformer-based methods (TNT-KID and BERT+BiLSTM CRF) and compare them to a baseline TF-IDF based unsupervised approach. Next, we show that by combining the keywords retrieved by both neural transformer based methods and extending the final set of keywords with an unsupervised TF-IDF based technique, we can drastically improve the recall of the system, making it appropriate to be used as a recommendation system in the media house environment.

翻译：关键词提取是确定最能描述某一文件的词语(或多字表达式),并在新闻门户中提供将类似专题的文章联系起来的功能。在这项工作中,我们制定和评价了我们关于欧洲新闻媒体业(克罗地亚、爱沙尼亚、拉脱维亚和俄罗斯)中代表性较弱、形态丰富语言的四个新数据集的方法。首先,我们评估了两种以神经变压器为基础的监督方法(TNT-KID和BERT+BILSTM 通用报告格式),并将其与基于不受监督的基线TF-IDF方法进行比较。其次,我们表明,通过将神经变压器所检索的关键词与以不受监督的TF-IDF为基础的技术相结合,我们可以大幅改进该系统的记忆,从而在媒体环境环境中将其用作建议系统。

相关内容

TF-IDF

关注 0

TF-IDF（英语：term frequency–inverse document frequency）是一种用于信息检索与文本挖掘的常用加权技术。tf-idf是一种统计方法，用以评估一字词对于一个文件集或一个语料库中的其中一份文件的重要程度。字词的重要性随着它在文件中出现的次数成正比增加，但同时会随着它在语料库中出现的频率成反比下降。tf-idf加权的各种形式常被搜索引擎应用，作为文件与用户查询之间相关程度的度量或评级。除了tf-idf以外，互联网上的搜索引擎还会使用基于链接分析的评级方法，以确定文件在搜索结果中出现的顺序。

【经典书】机器学习黑客秘笈(Machine Learning for Hackers)，322页pdf

专知会员服务

46+阅读 · 2021年2月8日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

【SIGIR2020】高效查询自动补全，Efficient and Effective Query Auto-Completion

专知会员服务

10+阅读 · 2020年5月14日

简明《神经网络数学》手册，16页pdf带你入门，Mathematics of Neural Networks

专知会员服务

68+阅读 · 2020年5月9日