分发文件代表的压缩 (Compressibility of Distributed Document Representations)

Contemporary natural language processing (NLP) revolves around learning from latent document representations, generated either implicitly by neural language models or explicitly by methods such as doc2vec or similar. One of the key properties of the obtained representations is their dimension. Whilst the commonly adopted dimensions of 256 and 768 offer sufficient performance on many tasks, it is many times unclear whether the default dimension is the most suitable choice for the subsequent downstream learning tasks. Furthermore, representation dimensions are seldom subject to hyperparameter tuning due to computational constraints. The purpose of this paper is to demonstrate that a surprisingly simple and efficient recursive compression procedure can be sufficient to both significantly compress the initial representation, but also potentially improve its performance when considering the task of text classification. Having smaller and less noisy representations is the desired property during deployment, as orders of magnitude smaller models can significantly reduce the computational overload and with it the deployment costs. We propose CoRe, a straightforward, representation learner-agnostic framework suitable for representation compression. The CoRe's performance is showcased and studied on a collection of 17 real-life corpora from biomedical, news, social media, and literary domains. We explored CoRe's behavior when considering contextual and non-contextual document representations, different compression levels, and 9 different compression algorithms. Current results based on more than 100,000 compression experiments indicate that recursive Singular Value Decomposition offers a very good trade-off between the compression efficiency and performance, making CoRe useful in many existing, representation-dependent NLP pipelines.

翻译：当代自然语言处理(NLP)围绕从潜在的文件表述中学习,这些表述要么隐含地由神经语言模型产生,要么明确采用诸如 doc2vec 或类似的方法产生。获得的表述的关键特性之一是其层面。虽然通常采用的256和768这两个层面在很多任务上表现充分,但许多情况下还不清楚默认层面是否是随后下游学习任务最合适的选择。此外,由于计算限制,代表层面很少受到超常参数的调整。本文件的目的是表明一个令人惊讶的简单而高效的循环压缩程序既可以大大压缩初始代表,也可以在考虑文本分类任务时提高其性能。在部署期间,规模小且不那么吵的表述是理想的属性,因为规模小的模型可以大大降低计算超负荷,并随之而来支付部署费用。我们建议Core,一个直接的、代表性学习者-敏感度框架,由于计算上的限制,适合进行压缩。Corereformal-reserviative recess recess redustrual a 17个真实存在的硬体-cliformacustrual Coral-cal-listrual restistrual laction resmal destrual destrual destrual destrational destrual destrational destrual destructions