Entity extraction is critical to the intelligent development of various domains and the construction of knowledge agents. Yet, there is category imbalance problem in documents in some specific domains that some categories of entities are common, while some are rare and scattered. This paper proposes to use Zipf's law to tackle this problem and to promote the performance of entity extraction from documents. Using two forms of Zipf's law, words in the documents are classified into common and rare ones, and then sentences are classified into common and rare ones, and are further processed by text generation models respectively. Rare entities in the generated sentences are labeled with human-designed rules, and serve as a supplement to the raw dataset so as to alleviate the category imbalance problem. A case of extracting entities from technical documents on industrial safety is given and the experiments results on two datasets show the effectiveness of the proposed method.
翻译:实体提取对于各个领域的智能发展和知识媒介的构建至关重要,然而,在某些特定领域,有些实体类别普遍,有些类别是罕见和分散的,在文件方面存在分类不平衡问题,本文件提议利用齐普夫的法律解决这一问题,促进实体从文件中提取文件的绩效。使用两种形式的齐普夫法律,将文件中的文字分为共同和稀有的文字,然后将判决分为共同和稀有的文字,然后分别通过文本生成模型进行进一步处理。生成的句子中很少的实体被贴上人为设计的规则的标签,作为原始数据集的补充,以缓解分类失衡问题。提供了从工业安全技术文件中提取实体的案例,关于两个数据集的实验结果表明了拟议方法的有效性。</s>