表格数据小语言模型 (Small Language Models for Tabular Data)

Supervised deep learning is most commonly applied to difficult problems defined on large and often extensively curated datasets. Here we demonstrate the ability of deep representation learning to address problems of classification and regression from small and poorly formed tabular datasets by encoding input information as abstracted sequences composed of a fixed number of characters per input field. We find that small models have sufficient capacity for approximation of various functions and achieve record classification benchmark accuracy. Such models are shown to form useful embeddings of various input features in their hidden layers, even if the learned task does not explicitly require knowledge of those features. These models are also amenable to input attribution, allowing for an estimation of the importance of each input element to the model output as well as of which inputs features are effectively embedded in the model. We present a proof-of-concept for the application of small language models to mixed tabular data without explicit feature engineering, cleaning, or preprocessing, relying on the model to perform these tasks as part of the representation learning process.

翻译：监督深层次学习最通常地适用于大型和往往广泛整理的数据集中界定的难题。在这里,我们展示了深层代表性学习的能力,以解决从小型和结构差的表格数据集中分类和回归的问题。我们通过将输入信息编码为由每个输入字段固定字符数组成的抽象序列,将输入信息作为由每个输入字段固定字符数构成的抽象序列。我们发现,小模型有足够的能力近似各种功能并实现记录分类基准准确性。这些模型显示在其隐藏的层中形成了各种输入特征的有用嵌入,即使所学任务并不明确要求了解这些特征。这些模型也容易接受输入归属,从而可以估计每个输入要素对模型输出的重要性,以及输入特征有效地嵌入模型。我们提出了一个验证概念,用于将小语言模型在没有明确特征工程、清洁或预处理的情况下混合列表数据,依靠模型来完成这些任务,作为代表学习过程的一部分。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

NeurlPS 2022 | 自然语言处理相关论文分类整理

专知会员服务

51+阅读 · 2022年10月2日

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

75+阅读 · 2022年6月28日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日