PowerShell is a command line shell, that is widely used in organizations for configuration management and task automation. Unfortunately, PowerShell is also increasingly used by cybercriminals for launching cyber attacks against organizations, mainly because it is pre-installed on Windows machines and it exposes strong functionality that may be leveraged by attackers. This makes the problem of detecting malicious PowerShell scripts both urgent and challenging. We address this important problem by presenting several novel deep learning based detectors of malicious PowerShell scripts. Our best model obtains a true positive rate of nearly 90% while maintaining a low false positive rate of less than 0.1%, indicating that it can be of practical value. Our models employ pre-trained contextual embeddings of words from the PowerShell "language". A contextual word embedding is able to project semantically similar words to proximate vectors in the embedding space. A known problem in the cybersecurity domain is that labeled data is relatively scarce in comparison with unlabeled data, making it difficult to devise effective supervised detection of malicious activity of many types. This is also the case with PowerShell scripts. Our work shows that this problem can be largely mitigated by learning a pre-trained contextual embedding based on unlabeled data. We trained our models' embedding layer using a scripts dataset that was enriched by a large corpus of unlabeled PowerShell scripts collected from public repositories. As established by our performance analysis, the use of unlabeled data for the embedding significantly improved the performance of our detectors. We estimate that the usage of pre-trained contextual embeddings based on unlabeled data for improved classification accuracy will find additional applications in the cybersecurity domain.
翻译:PowerShell 是一个命令行外壳, 广泛用于组织配置管理和任务自动化。 不幸的是, PowerShell 也越来越多地被网络罪犯用来对组织发动网络攻击, 主要是因为它事先安装在Windows 机器上, 暴露了攻击者可能利用的强大功能。 这使得检测恶意 PowerShell 脚本的问题变得既紧迫又具有挑战性。 我们通过提供几个全新的基于恶意 PowerShell 脚本的深层次学习检测器来解决这一重要问题。 我们的最佳模型获得了近90 % 的真正正率, 同时又保持了不到0.1%的低假正率, 表明它可能具有实用价值。 我们的模型使用预先训练过的PowerShell“ 语言” 的文字嵌入背景。 一个背景字嵌入能够预测与嵌入空间的矢量相近的字句。 网络域的已知问题是, 标签前数据比未加标签的数据要少得多, 使得我们很难对许多类型的内部活动进行有效的监督检测。 这也是Powshell 无法理解的脚本。 我们的工作表明, 正在通过学习大量的脚本化数据, 将数据嵌入到我们所建的轨道 。