Extracting frequent words from a collection of texts is performed on a great scale in many subjects. Extracting phrases, on the other hand, is not commonly done due to inherent complications when extracting phrases, the most significant complication being that of double-counting, where words or phrases are counted when they appear inside longer phrases that themselves are also counted. Several papers have been written on phrase mining that describe solutions to this issue; however, they either require a list of so-called quality phrases to be available to the extracting process, or they require human interaction to identify those quality phrases during the process. We present a method that eliminates double-counting without the need to identify lists of quality phrases. In the context of a set of texts, we define a principal phrase as a phrase that does not cross punctuation marks, does not start with a stop word, with the exception of the stop words "not" and "no", does not end with a stop word, is frequent within those texts without being double counted, and is meaningful to the user. Our method can identify such principal phrases independently without human input, and enables their extraction from any texts. An R package called phm has been developed that implements this method.
翻译:在许多主题中,从一个文本集中经常抽取文字是大尺度的。另一方面,由于在抽取短语时的内在复杂性,抽取短语并不常见,最重大的复杂因素是重复计算,当单词或短语出现在较长的短语内时,它们也计算在内。一些论文是关于采矿的短语,描述了这一问题的解决办法;然而,它们要么要求提取过程需要一份所谓的质量短语清单,或者它们需要人际互动来识别这些质量短语。我们提出一种方法,消除双重计算而无需确定质量短语清单。在一组文本中,我们将主要短语定义为一个不交叉标注标记的短语,不以停止字开始,但“不”和“不”这两个词除外,在这些文本中通常不使用停止字,而无需双重计算,而且对用户有意义。我们的方法可以独立地识别这些主要短语,而无需确定质量短语清单清单。在一组文本中,我们将主要短语定义为不交叉标注标记标记标记的短语,而不是以停止字开头的词;除了“不“不”和“不”等字,这些词并不以停止结尾,而是在这些文本中经常出现,而且对用户有意义。我们的方法可以独立地确定这些主要短语,而无需提供人的投入,并能够从任何文本中提取。