Keyword extraction is the task of identifying words (or multi-word expressions) that best describe a given document and serve in news portals to link articles of similar topics. In this work we develop and evaluate our methods on four novel data sets covering less represented, morphologically-rich languages in European news media industry (Croatian, Estonian, Latvian and Russian). First, we perform evaluation of two supervised neural transformer-based methods (TNT-KID and BERT+BiLSTM CRF) and compare them to a baseline TF-IDF based unsupervised approach. Next, we show that by combining the keywords retrieved by both neural transformer based methods and extending the final set of keywords with an unsupervised TF-IDF based technique, we can drastically improve the recall of the system, making it appropriate to be used as a recommendation system in the media house environment.
翻译:关键词提取是确定最能描述某一文件的词语(或多字表达式),并在新闻门户中提供将类似专题的文章联系起来的功能。在这项工作中,我们制定和评价了我们关于欧洲新闻媒体业(克罗地亚、爱沙尼亚、拉脱维亚和俄罗斯)中代表性较弱、形态丰富语言的四个新数据集的方法。首先,我们评估了两种以神经变压器为基础的监督方法(TNT-KID和BERT+BILSTM 通用报告格式),并将其与基于不受监督的基线TF-IDF方法进行比较。 其次,我们表明,通过将神经变压器所检索的关键词与以不受监督的TF-IDF为基础的技术相结合,我们可以大幅改进该系统的记忆,从而在媒体环境环境中将其用作建议系统。