两个抽样的二共问题实用有效推论 (Practical Valid Inferences for the Two-Sample Binomial Problem)

from arxiv, 41 pages, 8 figures. To appear in Statistics Surveys. v2 has changes based on reviewer comments. Main differences are the old v1 Sections 8 (Noninferiority and Equivalence Hypotheses) and 12 (Connection to Causal Inferences) were deleted for length. There was no issue with the correctness of those sections. There are other minor changes and additions in v2, with the main changes in Section 7

Our interest is whether two binomial parameters differ, which parameter is larger, and by how much. This apparently simple problem was addressed by Fisher in the 1930's, and has been the subject of many review papers since then. Yet there continues to be new work on this issue and no consensus solution. Previous reviews have focused primarily on testing and the properties of validity and power, or primarily on confidence intervals, their coverage, and expected length. Here we evaluate both. For example, we consider whether a p-value and its matching confidence interval are compatible, meaning that the p-value rejects at level $\alpha$ if and only if the $1-\alpha$ confidence interval excludes all null parameter values. For focus, we only examine non-asymptotic inferences, so that most of the p-values and confidence intervals are valid (i.e., exact) by construction. Within this focus, we review different methods emphasizing many of the properties and interpretational aspects we desire from applied frequentist inference: validity, accuracy, good power, equivariance, compatibility, coherence, and parameterization and direction of effect. We show that no one method can meet all the desirable properties and give recommendations based on which properties are given more importance.

翻译：我们的兴趣在于两个二进制参数是否不同,哪个参数更大,多少。这个显然简单的问题在1930年代由Fisher在1930年代解决,自那时起就成为许多审查文件的主题。然而,在这个问题上仍然有新的工作,没有协商一致的解决办法。以前的审查主要侧重于测试以及有效性和权力的属性,或者主要侧重于信任间隔、其覆盖面和预期长度。我们在这里对两者都进行评估。例如,我们考虑 p-价值及其匹配的信任期是否兼容,这意味着p-价值拒绝在1-alpha$的水平上,如果而且只有在1-alpha$的置信期不包括所有无效参数值时,则p-valpha$才是问题。关于重点,我们只研究非被动的推论,因此大多数 p-valu和信任期都是有效的(即准确的),在这个重点范围内,我们审查不同的方法强调我们所希望从经常推论中得出的许多属性和解释方面:有效性、准确性、准确性能、一致性、一致性、一致性和度度度度和有效性方向。我们只看,没有一种方法能够满足所有可取的属性和重要性。