In this paper, we assess the viability of transformer models in end-to-end InfoSec settings, in which no intermediate feature representations or processing steps occur outside the model. We implement transformer models for two distinct InfoSec data formats - specifically URLs and PE files - in a novel end-to-end approach, and explore a variety of architectural designs, training regimes, and experimental settings to determine the ingredients necessary for performant detection models. We show that in contrast to conventional transformers trained on more standard NLP-related tasks, our URL transformer model requires a different training approach to reach high performance levels. Specifically, we show that 1) pre-training on a massive corpus of unlabeled URL data for an auto-regressive task does not readily transfer to binary classification of malicious or benign URLs, but 2) that using an auxiliary auto-regressive loss improves performance when training from scratch. We introduce a method for mixed objective optimization, which dynamically balances contributions from both loss terms so that neither one of them dominates. We show that this method yields quantitative evaluation metrics comparable to that of several top-performing benchmark classifiers. Unlike URLs, binary executables contain longer and more distributed sequences of information-rich bytes. To accommodate such lengthy byte sequences, we introduce additional context length into the transformer by providing its self-attention layers with an adaptive span similar to Sukhbaatar et al. We demonstrate that this approach performs comparably to well-established malware detection models on benchmark PE file datasets, but also point out the need for further exploration into model improvements in scalability and compute efficiency.
翻译:在本文中,我们评估了在端到端信息安全设置中变压器模型的可行性,在这些变压器设置中,没有中间特性表示或处理步骤在模型之外出现。我们采用创新的端到端方法,对两种截然不同的信息安全数据格式,特别是URL和PE文件,实施变压器模型,以两种截然不同的信息安全系统数据格式 -- -- 特别是URL和PE文件 -- -- 采用新型的端到端方法,并探索各种建筑设计、培训制度和实验设置,以确定性能检测模型所需的要素。我们显示,与在更标准的NLP相关任务中受过培训的传统变压器相比,我们的UR变压器模型模型模型需要不同的培训方法,才能达到高性能水平。具体地说,我们显示,为自动递增变压任务而无标签的URLA数据系统不会轻易转换成双向级分类,但是通过更长期的变压程序,我们用这种变压程序来显示这种变压的变压式数据序列。我们通过更长期的变压式的变压式数据系统,我们用这种变压式的变压式数据程序来显示,用这种变压式的变压方法可以使两个损失方法的变压的变压的变压的变压到更精确的变压。