Extracting frequent words from a collection of texts is commonly performed in many subjects. However, as useful as it is to obtain a collection of commonly occurring words from texts, there is a need for more specific information to be obtained from texts in the form of most commonly occurring phrases. Despite this need, extracting frequent phrases is not commonly done due to inherent complications, the most significant being double-counting. Double-counting occurs when words or phrases are counted when they appear inside longer phrases that themselves are also counted, resulting in a selection of mostly meaningless phrases that are frequent only because they occur inside frequent super phrases. Several papers have been written on phrase mining that describe solutions to this issue; however, they either require a list of so-called quality phrases to be available to the extracting process, or they require human interaction to identify those quality phrases during the process. We present here a method that eliminates double-counting via a unique rectification process that does not require lists of quality phrases. In the context of a set of texts, we define a principal phrase as a phrase that does not cross punctuation marks, does not start with a stop word, with the exception of the stop words "not" and "no", does not end with a stop word, is frequent within those texts without being double counted, and is meaningful to the user. Our method identifies such principal phrases independently without human input, and enables their extraction from any texts within a reasonable amount of time.
翻译:经常从汇编的文本中抽取的词语通常在许多主题中进行,但是,虽然从汇编的文本中收集常见的词语是有用的,但还需要从以最经常出现的短语的形式从文本中获取更具体的信息。尽管如此,由于内在的复杂因素,最重大的是重复计算,通常不采用经常的短语。当单词或短语出现在本身也计算起来的较长的短语内时进行计算时,就会出现重复计算,结果选择了大部分是毫无意义的短语,而这些短语之所以经常出现,只是因为它们出现在频繁的超级短语内。一些文件是关于挖掘的短语,描述了这一问题的解决办法;然而,它们要么需要一份所谓的质量短语清单,供提取过程使用,要么它们需要人文互动,以找出这些质量短语。我们在这里提出一种方法,通过不要求质量短语清单的独特的校正过程来消除重复计算。在一组文本中,我们将一个主要短语定义为一个不交叉的词句,用来描述这个问题的解决办法;但是,“不单词的顺序是,不以实际的句数来计算,而是用“停止的句子”在句内,“停止的句子中,不使用“不使用”的句子中,不使用“不重复的句中先算出“不重复的句式”的句子,而是用句式计算出“停止的句子,“停止的句子“不计算出“不使用“停止的句子”的句子“停止的句子”的句子“停止的句子,不计算出“停止的句子“停止的句子,不计算出任何句子“停止的句子“停止的句子”的句子”的句子,不计算出“停止的句子“停止的句子“不计算出任何句子“停止的句子,不计算出的句子“停止的句子是“停止的句子”的句子,不计算出的句子,不作的句子“停止的句子,不计算出任何的句子的句子的句子的句子“停止的句子“的句子的句子“。