Deep learning transformer models become important by training on text data based on self-attention mechanisms. This manuscript demonstrated a novel universal spam detection model using pre-trained Google's Bidirectional Encoder Representations from Transformers (BERT) base uncased models with four datasets by efficiently classifying ham or spam emails in real-time scenarios. Different methods for Enron, Spamassain, Lingspam, and Spamtext message classification datasets, were used to train models individually in which a single model was obtained with acceptable performance on four datasets. The Universal Spam Detection Model (USDM) was trained with four datasets and leveraged hyperparameters from each model. The combined model was finetuned with the same hyperparameters from these four models separately. When each model using its corresponding dataset, an F1-score is at and above 0.9 in individual models. An overall accuracy reached 97%, with an F1 score of 0.96. Research results and implications were discussed.
翻译:深学习变压器模型在基于自我注意机制的文本数据培训中变得非常重要。 该手稿展示了一个新的通用垃圾检测模型,使用预先训练过的谷歌来自变换器(BERT)基础的双向编码显示器的双向编码模型,以四个数据集为基础,通过在实时情景中高效地对火腿或垃圾邮件进行分类,对四个数据集进行了4个数据集。对Enron、Spamassain、Lingspam和Spamtext信息分类数据集采用了不同方法,用于对模型进行单独培训,在四个数据集上取得了一个可接受性能的单一模型。通用垃圾检测模型(USM)得到了四个数据集的培训,并从每个模型中利用了4个模型的杠杆超参数。对综合模型进行了微调,与这四个模型的同一超参数进行了分别的调整。当每个模型使用相应的数据集时,每个模型的F1核心值都在0.9。总体精确度达到97%,F1分为0.96的研究结果和影响。