Open Information Extraction models have shown promising results with sufficient supervision. However, these models face a fundamental challenge that the syntactic distribution of training data is partially observable in comparison to the real world. In this paper, we propose a syntactically robust training framework that enables models to be trained on a syntactic-abundant distribution based on diverse paraphrase generation. To tackle the intrinsic problem of knowledge deformation of paraphrasing, two algorithms based on semantic similarity matching and syntactic tree walking are used to restore the expressionally transformed knowledge. The training framework can be generally applied to other syntactic partial observable domains. Based on the proposed framework, we build a new evaluation set called CaRB-AutoPara, a syntactically diverse dataset consistent with the real-world setting for validating the robustness of the models. Experiments including a thorough analysis show that the performance of the model degrades with the increase of the difference in syntactic distribution, while our framework gives a robust boundary. The source code is publicly available at https://github.com/qijimrc/RobustOIE.
翻译:开放信息提取模型在足够的监督下显示出了有希望的结果。 但是,这些模型面临一个根本性的挑战,即培训数据的综合分布与现实世界相比部分可以观察到。 在本文中,我们提议了一个综合强健的培训框架,使模型能够在基于不同参数生成的合成分布法基础上得到培训。为了解决知识变形的内在问题,基于语义相似匹配和合成树散步的两种算法被用来恢复表达式转变的知识。培训框架可以普遍适用于其他合成部分可观测域。基于拟议框架,我们建立了一个称为CARB-AutoPara的新的评价组,这是一个与验证模型的稳健性真实世界环境相一致的合成多样性数据集。实验包括一项透彻分析,表明模型的性能随着合成分布差异的增加而退化,而我们的框架则给出了一个牢固的边界。源代码可在https://github.com/qijimprc/RobustIE.