Translated title: 测量多个分类器将被试分类为一个或多个（分层）名义类别的一致性：Fleiss Kappa的推广 (Measuring agreement among several raters classifying subjects into one-or-more (hierarchical) nominal categories. A generalisation of Fleiss' kappa)

Cohen's and Fleiss' kappa are well-known measures for inter-rater reliability. However, they only allow a rater to select exactly one category for each subject. This is a severe limitation in some research contexts: for example, measuring the inter-rater reliability of a group of psychiatrists diagnosing patients into multiple disorders is impossible with these measures. This paper proposes a generalisation of the Fleiss' kappa coefficient that lifts this limitation. Specifically, the proposed $\kappa$ statistic measures inter-rater reliability between multiple raters classifying subjects into one-or-more nominal categories. These categories can be weighted according to their importance, and the measure can take into account the category hierarchy (e.g., categories consisting of subcategories that are only available when choosing the main category like a primary psychiatric disorder and sub-disorders; but much more complex dependencies between categories are possible as well). The proposed $\kappa$ statistic can handle missing data and a varying number of raters for subjects or categories. The paper briefly overviews existing methods allowing raters to classify subjects into multiple categories. Next, we derive our proposed measure step-by-step and prove that the proposed measure equals Fleiss' kappa when a fixed number of raters chose one category for each subject. The measure was developed to investigate the reliability of a new mathematics assessment method, of which an example is elaborated. The paper concludes with the worked-out example of psychiatrists diagnosing patients into multiple disorders.

翻译：Translated abstract: Cohen's Kappa和Fleiss Kappa是常用的一致性度量。然而，它们只允许一个分类器为每个被试选择一个类别。这在某些研究背景下是一个严重的限制：例如，在多个精神科医生将患者诊断为多种障碍时，使用这些统计量是不可能的。本文提出了一种Fleiss Kappa系数的推广形式，以解决这个限制。具体来说，所提出的$\kappa$统计量测量了多个分类器将被试分类为一个或多个名义类别的一致性。这些类别可以根据其重要性进行加权，且该统计量可以考虑类别层次结构（例如，类别由选择主类别时只有子类别才能使用的子类别组成；但也可以处理更复杂的类别之间的依赖关系）。所提出的$\kappa$统计量可以处理缺失数据以及为被试或类别选择不同数量的分类器的情况。本文简要回顾了现有方法，这些方法允许分类器将被试分类为多个类别。接下来，我们逐步推导所提出的度量，并证明当一定数量的分类器为每个被试选择一个类别时，所提出的度量等同于Fleiss Kappa。该度量是为了研究新的数学评估方法的可靠性而开发的，其中详细说明了一个例子。本文最后通过精神科医生将患者诊断为多种障碍的案例来总结。