Confidentiality hinders the publication of authentic, labeled datasets of personal and enterprise data, although they could be useful for evaluating knowledge graph construction approaches in industrial scenarios. Therefore, our plan is to synthetically generate such data in a way that it appears as authentic as possible. Based on our assumption that knowledge workers have certain habits when they produce or manage data, generation patterns could be discovered which can be utilized by data generators to imitate real datasets. In this paper, we initially derived 11 distinct patterns found in real spreadsheets from industry and demonstrate a suitable generator called Data Sprout that is able to reproduce them. We describe how the generator produces spreadsheets in general and what altering effects the implemented patterns have.
翻译:机密性妨碍了真实的、贴有标签的个人和企业数据数据集的公布,尽管这些数据集可能有助于评估工业情景中的知识图表构建方法。因此,我们的计划是以尽可能真实的方式合成生成此类数据。基于我们假设知识工作者在制作或管理数据时有一定习惯,可以发现数据生成者可以利用生成模式模仿真实数据集。在本文中,我们最初从工业界的实际电子表格中得出11种不同模式,并展示了能够复制这些数据的称为数据流的适当发电机。我们描述了生成者如何制作一般的电子表格,以及所实施的模式有哪些改变效果。