The use of Artificial Intelligence (AI) to detect phishing emails is primarily dependent on large-scale centralized datasets, which opens it up to a myriad of privacy, trust, and legal issues. Moreover, organizations are loathed to share emails, given the risk of leakage of commercially sensitive information. So, it is uncommon to obtain sufficient emails to train a global AI model efficiently. Accordingly, privacy-preserving distributed and collaborative machine learning, particularly Federated Learning (FL), is a desideratum. Already prevalent in the healthcare sector, questions remain regarding the effectiveness and efficacy of FL-based phishing detection within the context of multi-organization collaborations. To the best of our knowledge, the work herein is the first to investigate the use of FL in email anti-phishing. This paper builds upon a deep neural network model, particularly RNN and BERT for phishing email detection. It analyzes the FL-entangled learning performance under various settings, including balanced and asymmetrical data distribution. Our results corroborate comparable performance statistics of FL in phishing email detection to centralized learning for balanced datasets, and low organization counts. Moreover, we observe a variation in performance when increasing organizational counts. For a fixed total email dataset, the global RNN based model suffers by a 1.8% accuracy drop when increasing organizational counts from 2 to 10. In contrast, BERT accuracy rises by 0.6% when going from 2 to 5 organizations. However, if we allow increasing the overall email dataset with the introduction of new organizations in the FL framework, the organizational level performance is improved by achieving a faster convergence speed. Besides, FL suffers in its overall global model performance due to highly unstable outputs if the email dataset distribution is highly asymmetric.
翻译:人工智能(AI) 用于检测phishing 电子邮件的使用主要取决于大规模中央化数据集, 从而打开大量隐私、 信任和法律问题。 此外, 由于商业敏感信息渗漏的风险, 各组织不愿分享电子邮件。 因此, 很难获得足够的电子邮件来高效培训全球AI模型。 因此, 隐私保护分布和协作机器学习, 特别是Fed Learning (FL) 是一种脱线现象。 在医疗保健部门, 仍然存在着关于基于 FL 的10 类虚拟化测试在多组织协作背景下的效果和功效的问题。 根据我们的知识, 各组织最先调查FL 的反phish信息。 因此, 本文基于一个深度的神经网络模型, 特别是RNN和BERT 用于光学邮件检测。 它分析各种环境下的FL 连接学习绩效, 包括平衡和对称数据传播的改善。 我们的FL 模型在以FL 直径直的图像检测中, 超越了整体的精确性能增长到中央的R 。 当我们组织数据计算的时候, 当我们逐渐测量一个稳定的数据时, 当我们逐渐测算的时候, 一个组织性数据的时候, 一个低的运行的时候, 一个持续的R 。