标题：无平衡文本数据集中数据增强是否有效提高预测？摘要：不平衡数据集对于机器学习模型来说是一个重大挑战，经常导致预测存在偏差。为了解决这个问题，在自然语言处理（NLP）中广泛使用数据增强技术生成少数类别的新样本。然而，在本文中，我们质疑了数据增强总是必要的这一普遍假设，认为调整分类器的临界值而不使用数据增强可以产生类似于过采样技术的结果。我们的研究提供了理论和实证证据来支持这个观点。我们的发现有助于更好地理解处理不平衡数据的不同方法的优缺点，并帮助研究人员和实践者对于在执行任务时使用哪种方法做出明智的决策。 (Is augmentation effective to improve prediction in imbalanced text datasets?)

翻译：标题：无平衡文本数据集中数据增强是否有效提高预测？摘要：不平衡数据集对于机器学习模型来说是一个重大挑战，经常导致预测存在偏差。为了解决这个问题，在自然语言处理（NLP）中广泛使用数据增强技术生成少数类别的新样本。然而，在本文中，我们质疑了数据增强总是必要的这一普遍假设，认为调整分类器的临界值而不使用数据增强可以产生类似于过采样技术的结果。我们的研究提供了理论和实证证据来支持这个观点。我们的发现有助于更好地理解处理不平衡数据的不同方法的优缺点，并帮助研究人员和实践者对于在执行任务时使用哪种方法做出明智的决策。

Gabriel O. Assunção,Rafael Izbicki,Marcos O. Prates

from arxiv, 21 pages, 5 figures

Imbalanced datasets present a significant challenge for machine learning models, often leading to biased predictions. To address this issue, data augmentation techniques are widely used in natural language processing (NLP) to generate new samples for the minority class. However, in this paper, we challenge the common assumption that data augmentation is always necessary to improve predictions on imbalanced datasets. Instead, we argue that adjusting the classifier cutoffs without data augmentation can produce similar results to oversampling techniques. Our study provides theoretical and empirical evidence to support this claim. Our findings contribute to a better understanding of the strengths and limitations of different approaches to dealing with imbalanced data, and help researchers and practitioners make informed decisions about which methods to use for a given task.

翻译：注意： The proper noun "NLP" and "over-sampling" cannot be translated, so they are kept in English.