Noisy labels in large E-commerce product data (i.e., product items are placed into incorrect categories) are a critical issue for product categorization task because they are unavoidable, non-trivial to remove and degrade prediction performance significantly. Training a product title classification model which is robust to noisy labels in the data is very important to make product classification applications more practical. In this paper, we study the impact of instance-dependent noise to performance of product title classification by comparing our data denoising algorithm and different noise-resistance training algorithms which were designed to prevent a classifier model from over-fitting to noise. We develop a simple yet effective Deep Neural Network for product title classification to use as a base classifier. Along with recent methods of stimulating instance-dependent noise, we propose a novel noise stimulation algorithm based on product title similarity. Our experiments cover multiple datasets, various noise methods and different training solutions. Results uncover the limit of classification task when noise rate is not negligible and data distribution is highly skewed.
翻译:大型电子商务产品数据中的噪音标签(即产品产品被置于不正确的类别中)是产品分类工作的一个关键问题,因为产品分类工作不可避免、非三重性,可以显著地去除和降低预测性能。培训一个产品标题分类模型,对数据中的噪音贴上有力,对于使产品分类应用更加实用非常重要。在本文中,我们研究依赖实例的噪音对产品标题分类工作的影响,方法是比较我们的数据脱色算法和不同的噪音阻力培训算法,这些算法旨在防止分类模型过度适应噪音。我们开发了一个简单而有效的深神经网络,用于产品标题分类,作为基本分类师。与最近刺激的以实例为基础的噪音方法一起,我们提出了基于产品名称相似的新噪音刺激算法。我们的实验涵盖了多个数据集、各种噪音方法和不同的培训解决方案。当噪音率不小而且数据分布高度扭曲时,发现分类任务的局限性。