Text simplification is one of the domains in Natural Language Processing (NLP) that offers an opportunity to understand the text in a simplified manner for exploration. However, it is always hard to understand and retrieve knowledge from unstructured text, which is usually in the form of compound and complex sentences. There are state-of-the-art neural network-based methods to simplify the sentences for improved readability while replacing words with plain English substitutes and summarising the sentences and paragraphs. In the Knowledge Graph (KG) creation process from unstructured text, summarising long sentences and substituting words is undesirable since this may lead to information loss. However, KG creation from text requires the extraction of all possible facts (triples) with the same mentions as in the text. In this work, we propose a controlled simplification based on the factual information in a sentence, i.e., triple. We present a classical syntactic dependency-based approach to split and rephrase a compound and complex sentence into a set of simplified sentences. This simplification process will retain the original wording with a simple structure of possible domain facts in each sentence, i.e., triples. The paper also introduces an algorithm to identify and measure a sentence's syntactic complexity (SC), followed by reduction through a controlled syntactic simplification process. Last, an experiment for a dataset re-annotation is also conducted through GPT3; we aim to publish this refined corpus as a resource. This work is accepted and presented in International workshop on Learning with Knowledge Graphs (IWLKG) at WSDM-2023 Conference. The code and data is available at www.github.com/sallmanm/SynSim.
翻译:通过受控的句法简化识别、测量和减少句法复杂度
文本简化是自然语言处理中的一个领域,能够为探索提供简单易懂的文本。然而,理解和从非结构化文本检索知识通常是困难的,因为它通常以复合和复杂的句子形式出现。有最先进的基于神经网络的方法,可以简化句子以改善可读性,同时替换单词为简单的英语替代词并概述句子和段落。但在从非结构化文本中创建知识图谱(KG)的过程中,总结长句并替换单词是不可取的,因为这可能会导致信息丢失。然而,从文本中创建KG需要提取与文本中相同提及的所有可能事实(三元组)。在这项工作中,我们提出了一种基于事实信息的句法控制简化方法,即三元组。我们提出了一种基于句子的句法依赖关系的经典方法,将复合和复杂的句子分解和重述为一组简化的句子。这种简化过程将保留每个句子中可能的域事实的原始措辞和简单的结构,即三元组。本文还介绍了一种识别和测量句子句法复杂度(SC)的算法,随后通过控制句法简化过程进行缩减。最后,还通过GPT3对数据集进行了重新注释的实验,我们旨在将这个细化的语料库作为资源公开。这项工作被接受并在WSDM-2023会议的知识图谱学习国际研讨会(IWLKG)上发表。代码和数据可在www.github.com/sallmanm/SynSim上获得。