Quantification, variously called "supervised prevalence estimation" or "learning to quantify", is the supervised learning task of generating predictors of the relative frequencies (a.k.a. "prevalence values") of the classes of interest in unlabelled data samples. While many quantification methods have been proposed in the past for binary problems and, to a lesser extent, single-label multiclass problems, the multi-label setting (i.e., the scenario in which the classes of interest are not mutually exclusive) remains by and large unexplored. A straightforward solution to the multi-label quantification problem could simply consist of recasting the problem as a set of independent binary quantification problems. Such a solution is simple but na\"ive, since the independence assumption upon which it rests is, in most cases, not satisfied. In these cases, knowing the relative frequency of one class could be of help in determining the prevalence of other related classes. We propose the first truly multi-label quantification methods, i.e., methods for inferring estimators of class prevalence values that strive to leverage the stochastic dependencies among the classes of interest in order to predict their relative frequencies more accurately. We show empirical evidence that natively multi-label solutions outperform the na\"ive approaches by a large margin. The code to reproduce all our experiments is available online.
翻译:量化, 各种称为“ 监督流行估计” 或“ 学习量化 ”, 是生成无标签数据样本中感兴趣的类别相对频率预测器( a.k.a.a. “ 流行值” ) 的监管学习任务。 虽然过去曾对二进制问题和在较小程度上对单标签多级问题提出了许多量化方法, 多标签设置( 即, 利益类别并非相互排斥的假设) 仍然很大, 尚未探索。 多标签量化问题的一个直截了当的解决方案可能只是将问题重新表述为一套独立的二进制量化问题。 这种解决方案很简单,但却是令人反感的,因为在大多数情况下,它所依据的独立假设并不令人满意。 在这些情况下, 了解一个类的相对频率可能有助于确定其他相关类的流行程度。 我们提出了第一个真正多标签量化方法, 即, 推断等级流行值的估算师, 努力利用各种独立的二进制量化问题作为一套独立的二进制量化问题。 这种解决方案很简单,, 但这种解决方案很简单, 因为它所依赖的独立假设, 在大多数情况下, 了解一个等级的相对的多选制, 显示我们现有利率的多变制 。