重要性比重的多校准分区 (Multicalibrated Partitions for Importance Weights)

The ratio between the probability that two distributions $R$ and $P$ give to points $x$ are known as importance weights or propensity scores and play a fundamental role in many different fields, most notably, statistics and machine learning. Among its applications, importance weights are central to domain adaptation, anomaly detection, and estimations of various divergences such as the KL divergence. We consider the common setting where $R$ and $P$ are only given through samples from each distribution. The vast literature on estimating importance weights is either heuristic, or makes strong assumptions about $R$ and $P$ or on the importance weights themselves. In this paper, we explore a computational perspective to the estimation of importance weights, which factors in the limitations and possibilities obtainable with bounded computational resources. We significantly strengthen previous work that use the MaxEntropy approach, that define the importance weights based on a distribution $Q$ closest to $P$, that looks the same as $R$ on every set $C \in \mathcal{C}$, where $\mathcal{C}$ may be a huge collection of sets. We show that the MaxEntropy approach may fail to assign high average scores to sets $C \in \mathcal{C}$, even when the average of ground truth weights for the set is evidently large. We similarly show that it may overestimate the average scores to sets $C \in \mathcal{C}$. We therefore formulate Sandwiching bounds as a notion of set-wise accuracy for importance weights. We study these bounds to show that they capture natural completeness and soundness requirements from the weights. We present an efficient algorithm that under standard learnability assumptions computes weights which satisfy these bounds. Our techniques rely on a new notion of multicalibrated partitions of the domain of the distributions, which appear to be useful objects in their own right.

翻译：两种分配的 R$ 和 $P$ 给点的概率之间的比值。估计重要性重量的文献要么是超常的, 要么是强烈的假设 $和 $P$, 要么是重量的重量本身。在本文中, 我们探索对重要性重量的估算的计算视角, 其因素是限制和可能性的决定因素, 其间有固定的计算资源。我们大大加强了以往使用最大通量方法的工作, 其定义的比重以每分发的样本中最接近美元为美元。有关估计重要性重量的文献要么是超常的, 要么是粗量的, 要么是重的美元, 要么是重的美元, 或重的重量本身。在本文中, 我们探索对重要性的估算的计算视角, 以重量的精确值为基数, 将这些限制和可能性作为限制和可能性的数值的系数。我们大大的计算方法, 将显示这些数值的数值的比正数的数值的平均值。