We consider the problem of constructing distribution-free prediction sets for data from two-layer hierarchical distributions. For iid data, prediction sets can be constructed using the method of conformal prediction. The validity of conformal prediction hinges on the exchangeability of the data, which does not hold when groups of observations come from distinct distributions, such as multiple observations on each patient in a medical database. We extend conformal methods to this hierarchical setting. We develop CDF pooling, single subsampling, and repeated subsampling approaches to construct prediction sets in unsupervised and supervised settings. We compare these approaches in terms of coverage and average set size. If asymptotic coverage is acceptable, we recommend the CDF pooling method for its balance between empirical coverage and average set size. If we desire coverage guarantees, then we recommend the repeated subsampling approach.
翻译:我们考虑为两级等级分布的数据建立无分布式预测装置的问题。对于iid数据,可以使用符合的预测方法来构建预测装置。符合的预测的正确性取决于数据的可交换性,当一组观察数据来自不同的分布,例如医疗数据库中每个病人的多重观察时,这种可交换性并不有效。我们将符合的方法推广到这一等级设置。我们开发CDF集合、单次抽样和重复的子抽样方法,以便在不受监督和监督的环境中构建预测装置。我们用这些方法在覆盖范围和平均设定大小上进行比较。如果非预防性覆盖是可以接受的,我们建议CDF集合方法在经验覆盖和平均设定大小之间保持平衡。如果我们想要保证覆盖,那么我们建议反复的子抽样方法。