We present, to our knowledge, the first application of BERT to document classification. A few characteristics of the task might lead one to think that BERT is not the most appropriate model: syntactic structures matter less for content categories, documents can often be longer than typical BERT input, and documents often have multiple labels. Nevertheless, we show that a straightforward classification model using BERT is able to achieve the state of the art across four popular datasets. To address the computational expense associated with BERT inference, we distill knowledge from BERT-large to small bidirectional LSTMs, reaching BERT-base parity on multiple datasets using 30x fewer parameters. The primary contribution of our paper is improved baselines that can provide the foundation for future work.
翻译:据我们所知,我们首先提出BERT对文件分类的首次应用。任务的一些特点可能导致人们认为BERT不是最合适的模式:内容类别的综合结构不重要,文件往往比典型的BERT输入时间长,文件往往有多个标签。然而,我们表明,使用BERT的简单分类模型能够在四个广受欢迎的数据集中达到最新水平。为了解决与BERT推断有关的计算费用问题,我们从BERT大到小型双向LSTMS积累了知识,在多个数据集上,利用30x更少的参数达到BERT-基准对等。我们文件的主要贡献是改进了基线,为今后的工作奠定了基础。