DNA motif discovery is an important issue in gene research, which aims to identify transcription factor binding sites (i.e., motifs) in DNA sequences to reveal the mechanisms that regulate gene expression. However, the phenomenon of data silos and the problem of privacy leakage have seriously hindered the development of DNA motif discovery. On the one hand, the phenomenon of data silos makes data collection difficult. On the other hand, the collection and use of DNA data become complicated and difficult because DNA is sensitive private information. In this context, how discovering DNA motifs under the premise of ensuring privacy and security and alleviating data silos has become a very important issue. Therefore, this paper proposes a novel method, namely DP-FLMD, to address this problem. Note that this is the first application of federated learning to the field of genetics research. The federated learning technique is used to solve the problem of data silos. It has the advantage of enabling multiple participants to train models together and providing privacy protection services. To address the challenges of federated learning in terms of communication costs, this paper applies a sampling method and a strategy for reducing communication costs to DP-FLMD. In addition, differential privacy, a privacy protection technique with rigorous mathematical proof, is also applied to DP-FLMD. Experiments on the DNA datasets show that DP-FLMD has high mining accuracy and runtime efficiency, and the performance of the algorithm is affected by some parameters.
翻译:DNA模体发现是基因研究的重要问题之一,旨在识别DNA序列中的转录因子结合位点(即模体),以揭示基因表达调节的机制。然而,数据孤岛现象和隐私泄露问题严重阻碍了DNA模体发现的发展。一方面,数据孤岛现象使数据采集困难。另一方面,DNA是敏感的私人信息,因此收集和使用DNA数据变得复杂而困难。在这种情况下,如何在确保隐私和安全的前提下发现DNA模体并减轻数据孤岛问题已成为一个非常重要的问题。因此,本文提出了一种新方法,即DP-FLMD,来解决问题。需要注意的是,DP-FLMD是联邦学习在基因学研究领域的首次应用。采用联邦学习技术解决数据孤岛问题,其优点是使多个参与者共同训练模型并提供隐私保护服务。为了解决联邦学习中的通信成本问题,本文采用了一种采样方法和通信成本降低策略应用于DP-FLMD。此外,本文还将差分隐私应用于DP-FLMD,这是一种具有严格数学证明的隐私保护技术。基于DNA数据集的实验表明,DP-FLMD具有较高的挖掘精度和运行时效率,并且算法的性能受到一些参数的影响。