We study the application of large language models to zero-shot and few-shot classification of tabular data. We prompt the large language model with a serialization of the tabular data to a natural-language string, together with a short description of the classification problem. In the few-shot setting, we fine-tune the large language model using some labeled examples. We evaluate several serialization methods including templates, table-to-text models, and large language models. Despite its simplicity, we find that this technique outperforms prior deep-learning-based tabular classification methods on several benchmark datasets. In most cases, even zero-shot classification obtains non-trivial performance, illustrating the method's ability to exploit prior knowledge encoded in large language models. Unlike many deep learning methods for tabular datasets, this approach is also competitive with strong traditional baselines like gradient-boosted trees, especially in the very-few-shot setting.
翻译:我们研究将大语言模型应用于表格数据零射和几发分类。我们将大语言模型与表格数据序列连成一个自然语言字符串,并简短地描述分类问题。在几个镜头设置中,我们用一些贴标签的例子微调大语言模型。我们评估了几种序列化方法,包括模板、表格到文本模型和大语言模型。我们发现,尽管这种技术简单,但它在几个基准数据集上优于先前的深学习基于表格的分类方法。在多数情况下,甚至零点分类都取得了非三边性的工作表现,说明了该方法利用在大语言模型中编码的先前知识的能力。与许多关于表格数据集的深层次学习方法不同,这一方法也具有竞争力,与一些强大的传统基线,如梯根树,特别是在非常易拍的设置中。