This article develops new tools and new statistical theory for a statistical problem we call Scale Reliant Inference (SRI). Many scientific fields collect multivariate data that lack scale: where the size, sum, or total of each measurement is arbitrary and is not representative of the scale of the underlying system being measured. For example, in the analysis of high-throughput sequencing data, it is well known that the number of sequencing reads (the sequencing depth) varies substantially due non-biological (technical) factors. This article develops a formal problem statement for SRI which unifies problems seen in multiple scientific fields. Informally, we define SRI as an estimation problem in which an estimand of interest cannot be uniquely identified due to the lack of scale information in the observed data. This problem statement represents a reformulation of the related field of Compositional Data Analysis and allows us to prove fundamental limits on SRI. For example, we prove that inferential criteria such as consistency, calibration, and bias are unattainable for common SRI tasks. Moreover, we show that common methods often applied to SRI implicitly assume infinite knowledge of the system scale and can lead to a troubling phenomena termed unacknowledged bias. Counter-intuitively, we show that this problem worsens with more data and can lead to substantially elevated Type-I and Type-II error rates. Still, we show that rigorous statistical inference is possible so long as models acknowledge the fundamental uncertainty in the system scale. We introduce a class of models we call Scale Simulation Random Variables (SSRVs) as flexible, rigorous, and computationally efficient approach to SRI.
翻译:文章为统计问题开发了新工具和新的统计理论,我们称之为“ 降压弹性推断 ” (SRI ) 。 许多科学领域收集的多变量数据没有规模:每个测量的大小、总和或总和都是任意的,不能代表所测量的系统的规模。例如,在分析高通量排序数据时,众所周知,排序的数量包含(排序深度)与非生物(技术)因素大不相同。这篇文章为SRI开发了一个正式的问题说明,它统一了在多个科学领域所看到的问题。非正式地说,我们将SRI定义为一个估算问题,在其中,由于所观测的数据缺乏规模信息,无法确定一个独特的利益估计值和总和总和总和总和。这个问题说明代表了相关的构成数据分析领域,让我们证明SRI的基本限制。举例证明,一致性、校正和偏差等推论标准对于共同的SRI任务来说是无法达到的。 此外,我们指出,通常采用的方法通常用于SRI隐性地假定系统规模的无限知识,而由于所观察到的准确性规模,因此无法确定精确度的尺度的尺度模型,并且能够使我们更精确地显示一种精确的精确的精确的精确的精确的精确的精确度,从而显示我们能够显示一种精确的精确度的精确度的精确度的精确度,从而显示我们所呈现出一种不比标度的精确度, 。