We are in the era where the Big Data analytics has changed the way of interpreting the various biomedical phenomena, and as the generated data increase, the need for new machine learning methods to handle this evolution grows. An indicative example is the single-cell RNA-seq (scRNA-seq), an emerging DNA sequencing technology with promising capabilities but significant computational challenges due to the large-scaled generated data. Regarding the classification process for scRNA-seq data, an appropriate method is the k Nearest Neighbor (kNN) classifier since it is usually utilized for large-scale prediction tasks due to its simplicity, minimal parameterization, and model-free nature. However, the ultra-high dimensionality that characterizes scRNA-seq impose a computational bottleneck, while prediction power can be affected by the "Curse of Dimensionality". In this work, we proposed the utilization of approximate nearest neighbor search algorithms for the task of kNN classification in scRNA-seq data focusing on a particular methodology tailored for high dimensional data. We argue that even relaxed approximate solutions will not affect the prediction performance significantly. The experimental results confirm the original assumption by offering the potential for broader applicability.
翻译:我们处在一个时代,即大数据分析器改变了解释各种生物医学现象的方式,随着生成的数据的增加,需要新的机器学习方法来处理这种进化。举个例子:单细胞RNA-seq(scRNA-seq),这是一个新兴的DNA测序技术,其能力大有希望,但因大规模生成的数据而面临重大的计算挑战。关于ScRNA-seq数据的分类程序,一个适当的方法就是近距离的邻居分类器,因为它通常用于大型的预测任务,因为它的简单性、最低参数化和无模型性质。然而,ScRNA-seq的特高维维维度造成计算瓶颈,而预测力则可能受到“尺寸诅咒”的影响。在这项工作中,我们建议使用近邻搜索算法来完成ScRNA-sqeq数据中的 kNN的分类任务,重点是为高维数据定制的特定方法。我们说,即使比较宽松的近维度解决方案也不会显著影响原生的预测结果。