Tabular data comprising rows (samples) with the same set of columns (attributes, is one of the most widely used data-type among various industries, including financial services, health care, research, retail, and logistics, to name a few. Tables are becoming the natural way of storing data among various industries and academia. The data stored in these tables serve as an essential source of information for making various decisions. As computational power and internet connectivity increase, the data stored by these companies grow exponentially, and not only do the databases become vast and challenging to maintain and operate, but the quantity of database tasks also increases. Thus a new line of research work has been started, which applies various learning techniques to support various database tasks for such large and complex tables. In this work, we split the quest of learning on tabular data into two phases: The Classical Learning Phase and The Modern Machine Learning Phase. The classical learning phase consists of the models such as SVMs, linear and logistic regression, and tree-based methods. These models are best suited for small-size tables. However, the number of tasks these models can address is limited to classification and regression. In contrast, the Modern Machine Learning Phase contains models that use deep learning for learning latent space representation of table entities. The objective of this survey is to scrutinize the varied approaches used by practitioners to learn representation for the structured data, and to compare their efficacy.
翻译:由各行(样板)组成的表列数据(样板)包含相同的一组列(属性,是各行业中最广泛使用的数据类型之一,包括金融服务、保健、研究、零售和后勤,等等。表格正在成为各行业和学术界储存数据的自然方式。这些表格中储存的数据是各种决策的基本信息来源。随着计算能力和互联网连接度的提高,这些公司储存的数据迅速增长,不仅数据库变得庞大,难以维护和操作,而且数据库任务的数量也在增加。因此,开始了新的研究工作,运用各种学习技术支持大型和复杂表格的各种数据库任务。在这项工作中,我们把对表格数据的研究分成两个阶段:经典学习阶段和现代机器学习阶段。典型学习阶段包括SVMS、线性和后勤回归以及基于树木的方法等模型。这些模型最适合小型表格。这些模型处理的任务数量可以限于分类和回归。这些模型的任务数量可以限于分类和回归。在这种工作中,我们将各种学习技术用于表格的学习方法分为深层次分析。在模型中, 模型用于进行深层次的系统学习。 模型用于深层分析。