Deep neural networks (DNNs) are often used for text classification tasks as they usually achieve high levels of accuracy. However, DNNs can be computationally intensive with billions of parameters and large amounts of labeled data, which can make them expensive to use, to optimize and to transfer to out-of-distribution (OOD) cases in practice. In this paper, we propose a non-parametric alternative to DNNs that's easy, light-weight and universal in text classification: a combination of a simple compressor like gzip with a $k$-nearest-neighbor classifier. Without any training, pre-training or fine-tuning, our method achieves results that are competitive with non-pretrained deep learning methods on six in-distributed datasets. It even outperforms BERT on all five OOD datasets, including four low-resource languages. Our method also performs particularly well in few-shot settings where labeled data are too scarce for DNNs to achieve a satisfying accuracy.
翻译:深神经网络(DNN)通常用于文本分类任务,因为它们通常达到很高的精确度。然而,DNN可以计算大量数十亿参数和大量标签数据,这些参数和大量标签数据可以使其实际使用、优化和传输到分配外(OOOD)案例的成本很高。在本文中,我们建议了一种非参数的替代DNN的替代品,这种替代方法在文本分类中是简单、轻量和普遍的:一种简单的压缩机(例如 gzip ) 与 $k$-near-neighbor 分类器的组合。在没有任何培训、预培训或微调的情况下,我们的方法取得的结果与六个分布内数据集的未经事先训练的深层次学习方法具有竞争力。它甚至在所有五个OOD数据集(包括四种低资源语言)上都优于BERT。我们的方法在少数情况下也表现得特别好。