Detecting semantic concept of columns in tabular data is of particular interest to many applications ranging from data integration, cleaning, search to feature engineering and model building in machine learning. Recently, several works have proposed supervised learning-based or heuristic pattern-based approaches to semantic type annotation. Both have shortcomings that prevent them from generalizing over a large number of concepts or examples. Many neural network based methods also present scalability issues. Additionally, none of the known methods works well for numerical data. We propose $C^2$, a column to concept mapper that is based on a maximum likelihood estimation approach through ensembles. It is able to effectively utilize vast amounts of, albeit somewhat noisy, openly available table corpora in addition to two popular knowledge graphs to perform effective and efficient concept prediction for structured data. We demonstrate the effectiveness of $C^2$ over available techniques on 9 datasets, the most comprehensive comparison on this topic so far.
翻译:从数据整合、清洁、搜索到特别工程和机器学习模型建设等许多应用中,检测表单数据列的语义概念特别有意义,例如数据整合、清洁、特别工程和机器学习中的模型建设等。最近,一些作品提出了以有监督的学习或超自然模式为基础的语义说明型方法。两者都有缺陷,使它们无法概括大量概念或实例。许多以神经网络为基础的方法也存在可缩放问题。此外,已知方法对于数字数据而言,没有一种行之有效的方法。我们提议用$C$2美元作为概念绘图器的一个专栏,该专栏以通过聚合进行最大可能性估计为基础。它能够有效利用大量公开提供的表体,尽管有些吵闹,但除了两种受欢迎的知识图表外,还能够对结构化数据进行有效和高效的概念预测。我们展示了在9个数据集上现有技术的2美元的效力,这是迄今为止最全面的比较。