Protein function is inherently linked to its localization within the cell, and fluorescent microscopy data is an indispensable resource for learning representations of proteins. Despite major developments in molecular representation learning, extracting functional information from biological images remains a non-trivial computational task. Current state-of-the-art approaches use autoencoder models to learn high-quality features by reconstructing images. However, such methods are prone to capturing noise and imaging artifacts. In this work, we revisit deep learning models used for classifying major subcellular localizations, and evaluate representations extracted from their final layers. We show that simple convolutional networks trained on localization classification can learn protein representations that encapsulate diverse functional information, and significantly outperform autoencoder-based models. We also propose a robust evaluation strategy to assess quality of protein representations across different scales of biological function.
翻译:蛋白质功能与细胞内部的本地化有着内在的联系,荧光显微镜数据是学习蛋白质表现的一种不可或缺的资源。尽管分子代表性学习有了重大发展,但从生物图像中提取功能性信息仍是一项非三重计算任务。目前最先进的方法使用自动编码模型通过重建图像学习高质量特征。然而,这些方法容易捕捉噪音和成像文物。在这项工作中,我们重新审视用于主要子细胞本地化分类的深层学习模型,并评价从最后一层中提取的表示。我们表明,受过本地化分类培训的简单同源网络可以学习蛋白质表示,这种表示包含多种功能性信息,并大大超出基于自动化的自动编码模型。我们还提出一个强有力的评估战略,以评估不同生物功能规模的蛋白质表现质量。