Classification algorithms using Transformer architectures can be affected by the sequence length learning problem whenever observations from different classes have a different length distribution. This problem brings models to use sequence length as a predictive feature instead of relying on important textual information. Even if most public datasets are not affected by this problem, privately corpora for fields such as medicine and insurance may carry this data bias. This poses challenges throughout the value chain given their usage in a machine learning application. In this paper, we empirically expose this problem and present approaches to minimize its impacts.
翻译:使用变换器结构的分类算法,如果不同类别的观测分布长度不同,则会受到序列学习问题的影响。这个问题使模型使用序列长度作为预测特征,而不是依赖重要的文字信息。即使大多数公共数据集不受这一问题的影响,医药和保险等领域的私人公司可能带有这种数据偏差。这在整个价值链中构成了挑战,因为它们在机器学习应用中使用。在本文件中,我们以经验方式揭示了这个问题,并提出了尽量减少其影响的方法。