An increasingly important data analytic challenge is understanding the relationships between subpopulations. Various visualization methods that provide many useful insights into those relationships are popular, especially in bioinformatics. This paper proposes a novel and rigorous approach to quantifying subpopulation relationships called the Population Difference Criterion (PDC). PDC is simultaneously a quantitative and visual approach to showing separation of subpopulations. It uses subpopulation centers, the respective variation about those centers and the relative subpopulation sizes. This is accomplished by drawing motivation for the PDC from classical permutation based hypothesis testing, while taking that type of idea into non-standard conceptual territory. In particular, the domain of very small P values is seen to seem to provide useful comparisons of data sets. Simulated permutation variation is carefully investigated, and we found that a balanced permutation approach is more informative in high signal (i.e large subpopulation difference) contexts, than conventional approaches based on all permutations. This result is quite surprising in view of related work done in low signal contexts, which came to the opposite conclusion. This issue is resolved by the proposal of an appropriate adjustment. Permutation variation is also quantified by a proposed bootstrap confidence interval, and demonstrated to be useful in understanding subpopulation relationships with cancer data.
翻译:越来越重要的数据分析挑战是了解亚人口之间的关系。各种能为这些关系提供许多有用见解的视觉化方法很受欢迎,特别是在生物信息学方面。本文件提出一种新的、严格的量化亚人口关系的方法,称为人口差异标准(PDC)。PDC同时是一种量化和视觉化的方法,以显示亚人口群体之间的分离。它使用亚人口中心、这些中心的差异和相对亚人口规模的相对差异。通过从古典的基于假设的假设测试中吸引PDC的动力,同时将这种类型的想法纳入非标准的概念领域。特别是,非常小的P值领域似乎能够提供有用的数据集比较。模拟的变异性经过仔细调查,我们发现平衡的变异性方法在高信号(即大型亚人口差异)背景下比基于所有变异性的传统方法更加丰富。鉴于在低信号环境中所做的相关工作,得出相反的结论,这一结果令人惊讶。这一问题通过适当的调整建议来解决。在适当调整过程中,对各套数据集进行量化。