Existing methods for keyphrase extraction need preprocessing to generate candidate phrase or post-processing to transform keyword into keyphrase. In this paper, we propose a novel approach called duration modeling with semi-Markov Conditional Random Fields (DM-SMCRFs) for keyphrase extraction. First of all, based on the property of semi-Markov chain, DM-SMCRFs can encode segment-level features and sequentially classify the phrase in the sentence as keyphrase or non-keyphrase. Second, by assuming the independence between state transition and state duration, DM-SMCRFs model the distribution of duration (length) of keyphrases to further explore state duration information, which can help identify the size of keyphrase. Based on the convexity of parametric duration feature derived from duration distribution, a constrained Viterbi algorithm is derived to improve the performance of decoding in DM-SMCRFs. We thoroughly evaluate the performance of DM-SMCRFs on the datasets from various domains. The experimental results demonstrate the effectiveness of proposed model.
翻译:关键词提取的现有方法需要预先处理,以生成候选短语或后处理,将关键词转换为关键词句。在本文件中,我们提议一种新颖的方法,即用半马尔科夫有条件随机字段(DM-SMCRFs)来模拟关键词提取。首先,根据半马尔科夫链的特性,DM-SMCRs可以编码分层特性,并按顺序将该句中的短语划为关键词句或非关键词句。第二,DM-SMCRFs通过假定国家过渡期和州期限之间的独立,对关键词句(长度)的分配进行模型模型,以进一步探索国家期限信息,这可有助于确定关键词句的大小。根据从时间分布中得出的准参数的共性特征,将有限的维特比算法用于改进DM-SMCRs解码的性能。我们彻底评估DM-SMCRs在不同领域数据集的性能。实验结果显示了拟议模型的有效性。