Supervised machine learning utilizes large datasets, often with ground truth labels annotated by humans. While some data points are easy to classify, others are hard to classify, which reduces the inter-annotator agreement. This causes noise for the classifier and might affect the user's perception of the classifier's performance. In our research, we investigated whether the classification difficulty of a data point influences how strongly a prediction mistake reduces the "perceived accuracy". In an experimental online study, 225 participants interacted with three fictive classifiers with equal accuracy (73%). The classifiers made prediction mistakes on three different types of data points (easy, difficult, impossible). After the interaction, participants judged the classifier's accuracy. We found that not all prediction mistakes reduced the perceived accuracy equally. Furthermore, the perceived accuracy differed significantly from the calculated accuracy. To conclude, accuracy and related measures seem unsuitable to represent how users perceive the performance of classifiers.
翻译:受监督的机器学习使用大型数据集, 通常使用由人类附加的地面真理标签。 虽然有些数据点容易分类, 但有些数据点很难分类, 因而降低了翻译者之间的协议。 这给分类者造成噪音, 并可能影响用户对分类者性能的看法 。 在我们的研究中, 我们调查了一个数据点的分类困难是否影响预测错误如何强烈地减少“ 感知准确性 ” 。 在一项实验性在线研究中, 225名参与者与3个触摸性分类者互动, 其准确性相等( 73% ) 。 分类者在三种不同的数据点上做了预测错误( 容易、 困难、 不可能 ) 。 在互动后, 参与者判断了分类者的准确性。 我们发现, 并非所有预测错误都同样降低了所察觉的准确性。 此外, 所认知的准确性与计算准确性有很大差异 。 得出结论、 准确性和相关措施似乎不适合代表用户如何看待分类者的表现 。