Machine learning has shown extensive growth in recent years and is now routinely applied to sensitive areas. To allow appropriate verification of predictive models before deployment, models must be deterministic. However, major machine learning libraries default to the usage of non-deterministic algorithms based on atomic operations. Solely fixing all random seeds is not sufficient for deterministic machine learning. To overcome this shortcoming, various machine learning libraries released deterministic counterparts to the non-deterministic algorithms. We evaluated the effect of these algorithms on determinism and runtime. Based on these results, we formulated a set of requirements for deterministic machine learning and developed a new software solution, the mlf-core ecosystem, which aids machine learning projects to meet and keep these requirements. We applied mlf-core to develop deterministic models in various biomedical fields including a single cell autoencoder with TensorFlow, a PyTorch-based U-Net model for liver-tumor segmentation in CT scans, and a liver cancer classifier based on gene expression profiles with XGBoost.
翻译:为了在部署之前对预测模型进行适当核查,模型必须具有确定性。然而,大型机器学习图书馆默认使用基于原子操作的非确定性算法。 单修所有随机种子不足以进行确定性机器学习。 为了克服这一缺陷,各种机器学习图书馆释放了非确定性算法的确定性对应方。我们评估了这些算法对确定性学和运行时间的影响。根据这些结果,我们制定了一套确定性机器学习的要求,并开发了一个新的软件解决方案,即mlf-核心生态系统,帮助机器学习项目满足和保持这些要求。我们应用了mlf-cent核心在各种生物医学领域开发确定性模型,包括一个与TensorFlow的单细胞自动编码器、一个用于CT扫描中肝脏图断裂的基于PyTorrch的U-Net模型,以及一个基于XGBoost基因表达图谱的肝癌分类器。