Data scientists are constantly facing the problem of how to improve prediction accuracy with insufficient tabular data. We propose a table enrichment system that enriches a query table by adding external attributes (columns) from data lakes and improves the accuracy of machine learning predictive models. Our system has four stages, join row search, task-related table selection, row and column alignment, and feature selection and evaluation, to efficiently create an enriched table for a given query table and a specified machine learning task. We demonstrate our system with a web UI to show the use cases of table enrichment.
翻译:数据科学家不断面临如何用不充分的表格数据来提高预测准确性的问题。 我们提议了一个表格浓缩系统,通过从数据湖中添加外部属性(列)来丰富查询表,并提高机器学习预测模型的准确性。 我们的系统分为四个阶段,即加入行搜索、任务相关表格的选择、行和列以及特征选择和评估,以便高效地为某个查询表和特定机器学习任务创建一个浓缩的表格。 我们用一个网络界面来显示表格浓缩的使用情况。