Genome-wide studies leveraging recent high-throughput sequencing technologies collect high-dimensional data. However, they usually include small cohorts of patients, and the resulting tabular datasets suffer from the "curse of dimensionality". Training neural networks on such datasets is typically unstable, and the models overfit. One problem is that modern weight initialisation strategies make simplistic assumptions unsuitable for small-size datasets. We propose Graph-Conditioned MLP, a novel method to introduce priors on the parameters of an MLP. Instead of randomly initialising the first layer, we condition it directly on the training data. More specifically, we create a graph for each feature in the dataset (e.g., a gene), where each node represents a sample from the same dataset (e.g., a patient). We then use Graph Neural Networks (GNNs) to learn embeddings from these graphs and use the embeddings to initialise the MLP's parameters. Our approach opens the prospect of introducing additional biological knowledge when constructing the graphs. We present early results on 7 classification tasks from gene expression data and show that GC-MLP outperforms an MLP.
翻译:利用最近高通量测序技术进行的全基因组研究,利用最近高通量测序技术收集了高维数据。然而,这些研究通常包括小群患者,因此产生的表格数据集受“维度圈圈”的影响。这类数据集的培训神经网络通常不稳定,模型过于适合。一个问题是,现代加权初始化战略使得简单化的假设不适合于小尺寸数据集。我们提议了图形-有限制的 MLP,这是引入MLP 参数前缀的新方法。我们的方法不是随机初始化第一层,而是直接以培训数据为条件。更具体地说,我们为数据集中的每个特性(例如,一个基因)创建一个图表,其中每个节点代表同一数据集(例如,一个病人)的样本。我们然后使用“图形神经网络”学习这些图表的嵌入,并使用嵌入为 MLP 参数的初始化。我们的方法开启了在构建图表时引入更多生物知识的前景。我们介绍了7个基因表达式数据的早期结果,并展示了GCP MLML 格式。