The objective of a genome-wide association study (GWAS) is to associate subsequences of individuals' genomes to the observable characteristics called phenotypes (e.g., high blood pressure). Motivated by the GWAS problem, in this paper we introduce the information-theoretic problem of \emph{associated subsequence retrieval}, where a dataset of $N$ (possibly high-dimensional) sequences of length $G$, and their corresponding observable (binary) characteristics is given. The sequences are chosen independently and uniformly at random from $\mathcal{X}^G$, where $\mathcal{X}$ is a finite alphabet. The observable (binary) characteristic is only related to a specific unknown subsequence of length $L$ of the sequences, called \textit{associated subsequence}. For each sequence, if the associated subsequence of it belongs to a universal finite set, then it is more likely to display the observable characteristic (i.e., it is more likely that the observable characteristic is one). The goal is to retrieve the associated subsequence using a dataset of $N$ sequences and their observable characteristics. We demonstrate that as the parameters $N$, $G$, and $L$ grow, a threshold effect appears in the curve of probability of error versus the rate which is defined as ${Gh(L/G)}/{N}$, where $h(\cdot)$ is the binary entropy function. This effect allows us to define the capacity of associated subsequence retrieval. We develop an achievable scheme and a matching converse for this problem, and thus characterize its capacity in two scenarios: the zero-error-rate and the $\epsilon$-error-rate.
翻译:基因组整体关联研究(GWAS)的目标是将个人基因组的子序列(可能是高维)序列(GWAS)和相应的可观测(二进制)特性联系起来。序列是从$\ mathcal{X ⁇ G$(例如高血压)随机独立选择的。受GWAS问题驱动的。在本文中,我们引入了计算序列长度为$L$的具体未知子序列问题,称为\textit{相关的子序列}。对于每个序列,如果相关的子序列属于通用定值,则更可能显示可观测特性(例如,与美元相比,美元为美元/X ⁇ G$(X ⁇ G$)的随机随机选择序列,其中$\ mathcal{高血压{X}美元是一个固定的字母。观察结果(binteral)特性仅与该序列长度为$L$(美元)的未知的子序列相联。当我们以美元/美元值的直径比值表示其值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值