In this article, we present a Shell Language Preprocessing (SLP) library, which implements tokenization and encoding directed on the parsing of Unix and Linux shell commands. We describe the rationale behind the need for a new approach with specific examples when conventional Natural Language Processing (NLP) pipelines fail. Furthermore, we evaluate our methodology on a security classification task against widely accepted information and communications technology (ICT) tokenization techniques and achieve significant improvement of an F1-score from 0.392 to 0.874.
翻译:在文章中,我们提出一个壳牌语言预处理(SLP)图书馆,该图书馆在解析Unix和Linux shell命令时采用象征性和编码,我们描述了在常规的自然语言处理(NLP)管道失灵时需要采用新办法并举具体实例的理由,此外,我们对照广泛接受的信息和通信技术代用技术,评估了安全分类任务的方法,并大大改进了F1芯数,从0.392到0.874。