In this article, we present a Shell Language Preprocessing (SLP) library, which implements tokenization and encoding directed at parsing Unix and Linux shell commands. We describe the rationale behind the need for a new approach with specific examples of when conventional Natural Language Processing (NLP) pipelines fail. Furthermore, we evaluate our methodology on a security classification task against widely accepted information and communications technology (ICT) tokenization techniques and achieve significant improvement of an F1 score from 0.392 to 0.874.
翻译:在文章中,我们提出一个壳牌语言预处理(SLP)图书馆,该图书馆针对解析Unix和Linux shell命令,实施象征性和编码,我们描述需要采用新办法的理由,具体举例说明常规的自然语言处理(NLP)管道在何时失效;此外,我们对照广泛接受的信息和通信技术(ICT)代用技术,评价我们的安全分类任务方法,并大大改进了F1分,从0.392到0.874分。