Stopwords carry little semantic information and are often removed from text data to reduce dataset size and improve machine learning model performance. Consequently, researchers have sought to develop techniques for generating effective stopword sets. Previous approaches have ranged from qualitative techniques relying upon linguistic experts, to statistical approaches that extract word importance using correlations or frequency-dependent metrics computed on a corpus. We present a novel quantitative approach that employs iterative and recursive feature deletion algorithms to see which words can be deleted from a pre-trained transformer's vocabulary with the least degradation to its performance, specifically for the task of sentiment analysis. Empirically, stopword lists generated via this approach drastically reduce dataset size while negligibly impacting model performance, in one such example shrinking the corpus by 28.4% while improving the accuracy of a trained logistic regression model by 0.25%. In another instance, the corpus was shrunk by 63.7% with a 2.8% decrease in accuracy. These promising results indicate that our approach can generate highly effective stopword sets for specific NLP tasks.
翻译:标准词含有很少的语义信息,通常从文本数据中删除,以降低数据集大小,改进机器学习模型的性能。因此,研究人员努力开发产生有效制模的技术。以前的方法包括依赖语言专家的定性技术,到利用相关关系或按频率计算的数据来提取单词重要性的统计方法。我们提出了一个新的定量方法,采用迭代和循环特性删除算法,以观察哪些词可以从经过预先训练的变压器词汇中删除,其变形最小,到其性能,特别是情绪分析任务。随机性地,通过这种方法产生的断字列表会大幅降低数据集大小,同时对模型性能产生明显的影响,在其中一个例子中将堆积缩小了28.4%,同时将经过训练的物流回归模型的精确度提高了0.25%。在另一个实例中,该结构减少了63.7%的精度,减少了2.8 %。这些有希望的结果表明,我们的方法可以为具体的NLP任务生成非常有效的断字套。