For humans, it is often possible to predict data correlations from column names. We conduct experiments to find out whether deep neural networks can learn to do the same. If so, e.g., it would open up the possibility of tuning tools that use NLP analysis on schema elements to prioritize their efforts for correlation detection. We analyze correlations for around 120,000 column pairs, taken from around 4,000 data sets. We try to predict correlations, based on column names alone. For predictions, we exploit pre-trained language models, based on the recently proposed Transformer architecture. We consider different types of correlations, multiple prediction methods, and various prediction scenarios. We study the impact of factors such as column name length or the amount of training data on prediction accuracy. Altogether, we find that deep neural networks can predict correlations with a relatively high accuracy in many scenarios (e.g., with an accuracy of 95% for long column names).
翻译:对于人类来说,通常可以预测列名中的数据相关性。 我们进行实验, 以确定深神经网络能否学习同样的方法。 如果可以, 比如说, 它将打开调整工具的可能性, 以便使用NLP对化学元素的分析, 来优先进行相关检测。 我们分析大约 4 000 个数据集中大约 120 000 个列配的关联。 我们试图预测相关关系, 仅以列名为基础。 对于预测, 我们利用根据最近提议的变换器结构开发的预先训练的语言模型。 我们考虑不同类型的关联、 多重预测方法和各种预测假设。 我们研究列名长度或预测准确性培训数据数量等因素的影响。 总的来说, 我们发现深神经网络可以预测许多情景中的相对高精度相关关系( 例如, 长列名的精确度为95% ) 。