Many data analysis tasks heavily rely on a deep understanding of tables (multi-dimensional data). Across the tasks, there exist comonly used metadata attributes of table fields / columns. In this paper, we identify four such analysis metadata: Measure/dimension dichotomy, common field roles, semantic field type, and default aggregation function. While those metadata face challenges of insufficient supervision signals, utilizing existing knowledge and understanding distribution. To inference these metadata for a raw table, we propose our multi-tasking Metadata model which fuses field distribution and knowledge graph information into pre-trained tabular models. For model training and evaluation, we collect a large corpus (~582k tables from private spreadsheet and public tabular datasets) of analysis metadata by using diverse smart supervisions from downstream tasks. Our best model has accuracy = 98%, hit rate at top-1 > 67%, accuracy > 80%, and accuracy = 88% for the four analysis metadata inference tasks, respectively. It outperforms a series of baselines that are based on rules, traditional machine learning methods, and pre-trained tabular models. Analysis metadata models are deployed in a popular data analysis product, helping downstream intelligent features such as insights mining, chart / pivot table recommendation, and natural language QA...
翻译:许多数据分析任务在很大程度上依赖于对表格的深刻理解(多维数据)。在任务中,存在着表格字段/列中唯一使用的元数据属性。在本文件中,我们确定了四种分析元数据:措施/二元二分法、共同的实地作用、语义字段类型和默认聚合功能。虽然这些元数据面临监督信号不足的挑战,利用现有的知识和理解分布。为了为原始表格推断这些元数据,我们建议了将外地分布和知识图表信息结合到预先培训的表格模型的多任务元数据模型。对于模型培训和评估,我们通过使用下游任务的各种智能监督,收集了分析元数据的大块(~582k表格,来自私人电子表格和公共表格数据集的表格)。我们的最佳模型有准确性=98%,顶层1 > 67%,精度 > 80%, 和精度=88 % 用于四个分析元计算任务。它超越了基于规则、传统机器学习方法和预先培训的表格模型的一系列基线。分析元数据模型被放置在通用的图表中,用于分析的图像分析,作为智能图表的下游点分析产品。