声响场景分类的公正性和针对性不足:分类评价的理由 (Fairness and underspecification in acoustic scene classification: The case for disaggregated evaluations)

Underspecification and fairness in machine learning (ML) applications have recently become two prominent issues in the ML community. Acoustic scene classification (ASC) applications have so far remained unaffected by this discussion, but are now becoming increasingly used in real-world systems where fairness and reliability are critical aspects. In this work, we argue for the need of a more holistic evaluation process for ASC models through disaggregated evaluations. This entails taking into account performance differences across several factors, such as city, location, and recording device. Although these factors play a well-understood role in the performance of ASC models, most works report single evaluation metrics taking into account all different strata of a particular dataset. We argue that metrics computed on specific sub-populations of the underlying data contain valuable information about the expected real-world behaviour of proposed systems, and their reporting could improve the transparency and trustability of such systems. We demonstrate the effectiveness of the proposed evaluation process in uncovering underspecification and fairness problems exhibited by several standard ML architectures when trained on two widely-used ASC datasets. Our evaluation shows that all examined architectures exhibit large biases across all factors taken into consideration, and in particular with respect to the recording location. Additionally, different architectures exhibit different biases even though they are trained with the same experimental configurations.

翻译：在机器学习(ML)应用方面的具体化和公平性最近已成为ML社区的两个突出问题。声学场景分类(ASC)应用迄今一直没有受到这一讨论的影响,但现在越来越多地在现实世界系统中使用,因为公平和可靠性是关键方面。在这项工作中,我们主张需要通过分类评价,对ASC模型进行更全面的评价,这需要考虑到城市、地点和记录装置等若干因素的绩效差异。虽然这些因素在ASC模型的性能中发挥了很好的作用,但大多数工作报告单一评价指标时都考虑到特定数据集的所有不同层面。我们认为,根据基础数据的具体亚群群计算出来的指标包含关于拟议系统预期真实世界行为的宝贵信息,而且其报告可以提高这种系统的透明度和可信任性。我们证明拟议的评价进程在发现两个广泛使用的ASC数据集培训时所表现出的不足和公平性问题的有效性。我们的评价表明,所有所审查的结构都显示,尽管经过培训,它们对所有因素都有重大偏向,但是在不同的实验性结构中也存在不同的选择。