Supervised deep learning is most commonly applied to difficult problems defined on large and often extensively curated datasets. Here we demonstrate the ability of deep representation learning to address problems of classification and regression from small and poorly formed tabular datasets by encoding input information as abstracted sequences composed of a fixed number of characters per input field. We find that small models have sufficient capacity for approximation of various functions and achieve record classification benchmark accuracy. Such models are shown to form useful embeddings of various input features in their hidden layers, even if the learned task does not explicitly require knowledge of those features. These models are also amenable to input attribution, allowing for an estimation of the importance of each input element to the model output as well as of which inputs features are effectively embedded in the model. We present a proof-of-concept for the application of small language models to mixed tabular data without explicit feature engineering, cleaning, or preprocessing, relying on the model to perform these tasks as part of the representation learning process.
翻译:监督深层次学习最通常地适用于大型和往往广泛整理的数据集中界定的难题。在这里,我们展示了深层代表性学习的能力,以解决从小型和结构差的表格数据集中分类和回归的问题。我们通过将输入信息编码为由每个输入字段固定字符数组成的抽象序列,将输入信息作为由每个输入字段固定字符数构成的抽象序列。我们发现,小模型有足够的能力近似各种功能并实现记录分类基准准确性。这些模型显示在其隐藏的层中形成了各种输入特征的有用嵌入,即使所学任务并不明确要求了解这些特征。这些模型也容易接受输入归属,从而可以估计每个输入要素对模型输出的重要性,以及输入特征有效地嵌入模型。我们提出了一个验证概念,用于将小语言模型在没有明确特征工程、清洁或预处理的情况下混合列表数据,依靠模型来完成这些任务,作为代表学习过程的一部分。