The label noise transition matrix, denoting the transition probabilities from clean labels to noisy labels, is crucial knowledge for designing statistically robust solutions. Existing estimators for noise transition matrices, e.g., using either anchor points or clusterability, focus on computer vision tasks that are relatively easier to obtain high-quality representations. However, for other tasks with lower-quality features, the uninformative variables may obscure the useful counterpart and make anchor-point or clusterability conditions hard to satisfy. We empirically observe the failures of these approaches on a number of commonly used datasets. In this paper, to handle this issue, we propose a generally practical information-theoretic approach to down-weight the less informative parts of the lower-quality features. The salient technical challenge is to compute the relevant information-theoretical metrics using only noisy labels instead of clean ones. We prove that the celebrated $f$-mutual information measure can often preserve the order when calculated using noisy labels. The necessity and effectiveness of the proposed method is also demonstrated by evaluating the estimation error on a varied set of tabular data and text classification tasks with lower-quality features. Code is available at github.com/UCSC-REAL/Est-T-MI.
翻译:标签噪声转换矩阵,指出从清洁标签到吵闹标签的过渡概率,是设计统计上稳健的解决办法的关键知识。现有噪音转换矩阵的估算器,例如使用锚点或集束性,侧重于相对容易获得高质量表述的计算机愿景任务。然而,对于质量差的其他任务而言,非信息变量可能掩盖有用的对应方,使固定点或可分类性条件难以满足。我们从经验中观察到这些方法在一些常用数据集上的失败。为了处理这一问题,我们提出了一个一般实用的信息理论方法,以降低低质量特征中信息量的部分。突出的技术挑战是只使用噪音标签而不是清洁标签来计算相关的信息理论计量标准。我们证明,在使用噪音标签进行计算时,所庆祝的美元-混合信息计量标准往往能够维护秩序。在评估一套不同表格数据和文本分类的估算错误时,可以使用低质量特征来证明拟议方法的必要性和有效性。