We study the application of large language models to zero-shot and few-shot classification of tabular data. We prompt the large language model with a serialization of the tabular data to a natural-language string, together with a short description of the classification problem. In the few-shot setting, we fine-tune the large language model using some labeled examples. We evaluate several serialization methods including templates, table-to-text models, and large language models. Despite its simplicity, we find that this technique outperforms prior deep-learning-based tabular classification methods on several benchmark datasets. In most cases, even zero-shot classification obtains non-trivial performance, illustrating the method's ability to exploit prior knowledge encoded in large language models. Unlike many deep learning methods for tabular datasets, this approach is also competitive with strong traditional baselines like gradient-boosted trees, especially in the very-few-shot setting.
翻译:我们研究了大语言模型在零/少样本表格数据分类中的应用。我们使用表格数据的序列化字符串和分类问题的简短描述来提示大语言模型。在少样本情况下,我们使用标记样本微调大语言模型。我们评估了几种序列化方法,包括模板、表格到文本模型和大语言模型。尽管其简单性,我们发现此技术在几个基准数据集上优于以前的基于深度学习的表格分类方法。在大多数情况下,甚至零样本分类也能获得非平凡的性能,说明该方法能够利用大语言模型中编码的先前知识。与许多表格数据的深度学习方法不同,该方法在非常少样本的情况下也能够与强大的传统基线模型(如梯度提升树)竞争。