Recent technological advancements have led to the rapid generation of high-throughput biological data, which can be used to address novel scientific questions in broad areas of research. These data can be thought of as a large matrix with covariates annotating both rows and columns of this matrix. Matrix linear models provide a convenient way for modeling such data. In many situations, sparse estimation of these models is desired. We present fast, general methods for fitting sparse matrix linear models to structured high-throughput data. We induce model sparsity using an L$_1$ penalty and consider the case when the response matrix and the covariate matrices are large. Due to data size, standard methods for estimation of these penalized regression models fail if the problem is converted to the corresponding univariate regression scenario. By leveraging matrix properties in the structure of our model, we develop several fast estimation algorithms (coordinate descent, FISTA, and ADMM) and discuss their trade-offs. We evaluate our method's performance on simulated data, E. coli chemical genetic screening data, and two Arabidopsis genetic datasets with multivariate responses. Our algorithms have been implemented in the Julia programming language and are available at https://github.com/senresearch/MatrixLMnet.jl.
翻译:最近技术进步导致快速生成了高通量生物数据,这些数据可用于解决广泛研究领域的新科学问题。这些数据可被视为一个大矩阵,内含该矩阵各行和列的注释。矩阵线性模型为此类数据的建模提供了方便的方法。在许多情形下,希望对这些模型进行少许估计。我们提出快速、一般的方法,将稀释的矩阵线性模型用于结构化的高通量数据。我们用1美元罚款诱发模型散漫,并在反应矩阵和共变矩阵大时考虑这种情况。由于数据大小,如果问题转换为相应的单向回归假设,这些受罚回归模型的估算标准方法将失败。通过利用模型结构中的矩阵特性,我们开发了几种快速估算算法(相近缘、FISTA和ADMMM),并讨论其取舍。我们评估了模拟数据的方法的性能,E. coli 化学基因筛选数据,以及两个具有多变量反应的阿拉伯二位数遗传数据集。我们已在MARMLRisques应用了多种算法。