In this paper, we propose a dictionary screening method for embedding compression in text classification tasks. The key purpose of this method is to evaluate the importance of each keyword in the dictionary. To this end, we first train a pre-specified recurrent neural network-based model using a full dictionary. This leads to a benchmark model, which we then use to obtain the predicted class probabilities for each sample in a dataset. Next, to evaluate the impact of each keyword in affecting the predicted class probabilities, we develop a novel method for assessing the importance of each keyword in a dictionary. Consequently, each keyword can be screened, and only the most important keywords are reserved. With these screened keywords, a new dictionary with a considerably reduced size can be constructed. Accordingly, the original text sequence can be substantially compressed. The proposed method leads to significant reductions in terms of parameters, average text sequence, and dictionary size. Meanwhile, the prediction power remains very competitive compared to the benchmark model. Extensive numerical studies are presented to demonstrate the empirical performance of the proposed method.
翻译:在本文中,我们建议了一种将压缩嵌入文本分类任务的字典筛选方法。 这种方法的主要目的是评估字典中每个关键字的重要性。 为此, 我们首先用完整的字典来训练一个预先指定的经常性神经网络模型。 这导致一个基准模型, 我们然后用这个模型来获得每个样本在数据集中的预测等级概率。 其次, 为了评价每个关键字在影响预测的类别概率方面的影响, 我们开发了一种评估每个关键字在字典中的重要性的新型方法。 因此, 每个关键字可以筛选, 只有最重要的关键字可以保留。 有了这些筛选关键字, 就可以构建一个规模大大缩小的新词典。 因此, 原始文本序列可以大量压缩。 拟议方法可以大幅降低参数、 平均文本序列 和字典大小 。 同时, 预测能力与基准模型相比仍然非常具有竞争力。 大量的数字研究将展示拟议方法的经验性表现。