In this article, we study the binomial mixture model under the regime that the binomial size $m$ can be relatively large compared to the sample size $n$. This project is motivated by the GeneFishing method (Liu et al., 2019), whose output is a combination of the parameter of interest and the subsampling noise. To tackle the noise in the output, we utilize the observation that the density of the output has a U shape and model the output with the binomial mixture model under a U shape constraint. We first analyze the estimation of the underlying distribution F in the binomial mixture model under various conditions for F. Equipped with these theoretical understandings, we propose a simple method Ucut to identify the cutoffs of the U shape and recover the underlying distribution based on the Grenander estimator (Grenander, 1956). It has been shown that when $m = {\Omega}(n^{2/3})$, the identified cutoffs converge at the rate $O(n^{-1/3})$. The $L_1$ distance between the recovered distribution and the true one decreases at the same rate. To demonstrate the performance, we apply our method to varieties of simulation studies, a GTEX dataset used in (Liu et al., 2019) and a single cell dataset from Tabula Muris.
翻译:在本文中,我们研究了二进制混合物模式,即二进制规模与样本规模相比,百万美元可能相对较大。该项目的动机是GeneFishing方法(Liu等人,2019年),其产出是利益参数和子抽样噪音的组合。为了解决产出中的噪音,我们使用这样的观察,即产出的密度为U形状,用U形状限制的二进制混合物模式模拟产出。我们首先分析F不同条件下的二进制混合物模型F基本分布F的估计值。根据这些理论理解,我们提出了一个简单的方法,即U形状的截断值和根据Grenander测算器(Grenander,1956年)恢复基本分布。我们发现,当输出的密度为 U = omega} (n ⁇ 2/3 }) 时,确定的截值将集中在 $O (n)-1/3} 。我们首先分析F 的二进制混合物模型模型中F,在F 以这些理论理解为基础,我们提出了一个简单的方法,即U型形状的距离为$1美元,在GFSetro 20的模型中,我们所使用的数据率的模型中,我们用了一个模拟模型中的数据率将一个模型中, 用于一个模拟模型中的数据率。