重新定位后门,以检测机器学习基因组分析中的数据集偏差 (TRAPDOOR: Repurposing backdoors to detect dataset bias in machine learning-based genomic analysis)

Machine Learning (ML) has achieved unprecedented performance in several applications including image, speech, text, and data analysis. Use of ML to understand underlying patterns in gene mutations (genomics) has far-reaching results, not only in overcoming diagnostic pitfalls, but also in designing treatments for life-threatening diseases like cancer. Success and sustainability of ML algorithms depends on the quality and diversity of data collected and used for training. Under-representation of groups (ethnic groups, gender groups, etc.) in such a dataset can lead to inaccurate predictions for certain groups, which can further exacerbate systemic discrimination issues. In this work, we propose TRAPDOOR, a methodology for identification of biased datasets by repurposing a technique that has been mostly proposed for nefarious purposes: Neural network backdoors. We consider a typical collaborative learning setting of the genomics supply chain, where data may come from hospitals, collaborative projects, or research institutes to a central cloud without awareness of bias against a sensitive group. In this context, we develop a methodology to leak potential bias information of the collective data without hampering the genuine performance using ML backdooring catered for genomic applications. Using a real-world cancer dataset, we analyze the dataset with the bias that already existed towards white individuals and also introduced biases in datasets artificially, and our experimental result show that TRAPDOOR can detect the presence of dataset bias with 100% accuracy, and furthermore can also extract the extent of bias by recovering the percentage with a small error.

翻译：机器学习(ML)在包括图像、言论、文本和数据分析在内的若干应用中取得了前所未有的业绩。使用ML来理解基因突变(基因组)的基本模式,不仅在克服诊断缺陷方面,而且在设计癌症等危及生命的疾病的治疗方法方面,都取得了影响深远的结果。ML算法的成功和可持续性取决于所收集和用于培训的数据的质量和多样性。在这样一个数据集中,群体(族裔群体、性别群体等)的任职人数不足可能导致对某些群体作出不准确的预测,从而进一步加剧系统性歧视问题。在这项工作中,我们建议TRAPDOOR采用一种方法,通过重新规划一种主要为恶意目的提出的技术来识别偏差数据集:神经网络后门。我们考虑基因组供应链的典型合作学习环境,其中的数据可能来自医院、协作项目或研究机构,而核心云却没有意识到对敏感群体持有偏见。在这方面,我们开发一种方法,在不阻碍真实的准确性表现的情况下,通过重现的精确度来识别数据,同时利用MLLAR的准确性数据来显示真实的准确性。我们用正在测算的个人的数据来显示正在测算的准确性数据。