Today with Big Data and data lakes, we are faced of a mass of data that is very difficult to manage it manually. The protection of personal data in this context requires an automatic analysis for data discovery. Storing the names of attributes already analyzed in a knowledge base could optimize this automatic discovery. To have a better knowledge base, we should not store any attributes whose name does not make sense. In this article, to check if the name of an attribute has a meaning, we propose a solution that calculate the distances between this name and the words in a dictionary. Our studies on the distance functions like N-Gram, Jaro-Winkler and Levenshtein show limits to set an acceptance threshold for an attribute in the knowledge base. In order to overcome these limitations, our solution aims to strengthen the score calculation by using an exponential function based on the longest sequence. In addition, a double scan in dictionary is also proposed in order to process the attributes which have a compound name.
翻译:今天,有了大数据湖和数据湖,我们面临着难以手动管理的数据质量。 保护个人数据需要自动分析数据发现。 存储在知识库中分析过的属性名称可以优化这一自动发现。 要拥有更好的知识库, 我们不应该存储任何名称不合理的属性。 在本篇文章中, 要检查属性名称是否具有意义, 我们建议一个计算名称与字典中词词词距离的解决方案。 我们对N- Gram、 Jaro- Winkler 和 Levenshtein 等远程函数的研究显示限制, 以设定知识库中属性的接受阈值。 为了克服这些限制, 我们的解决方案旨在通过使用基于最长序列的指数函数来强化分数的计算。 此外, 还提议在字典中进行双重扫描, 以便处理具有复名的属性 。