HTTP2vec: 加入HTTP的异常交通侦查请求 (HTTP2vec: Embedding of HTTP Requests for Detection of Anomalous Traffic)

Hypertext transfer protocol (HTTP) is one of the most widely used protocols on the Internet. As a consequence, most attacks (i.e., SQL injection, XSS) use HTTP as the transport mechanism. Therefore, it is crucial to develop an intelligent solution that would allow to effectively detect and filter out anomalies in HTTP traffic. Currently, most of the anomaly detection systems are either rule-based or trained using manually selected features. We propose utilizing modern unsupervised language representation model for embedding HTTP requests and then using it to classify anomalies in the traffic. The solution is motivated by methods used in Natural Language Processing (NLP) such as Doc2Vec which could potentially capture the true understanding of HTTP messages, and therefore improve the efficiency of Intrusion Detection System. In our work, we not only aim at generating a suitable embedding space, but also at the interpretability of the proposed model. We decided to use the current state-of-the-art RoBERTa, which, as far as we know, has never been used in a similar problem. To verify how the solution would work in real word conditions, we train the model using only legitimate traffic. We also try to explain the results based on clusters that occur in the vectorized requests space and a simple logistic regression classifier. We compared our approach with the similar, previously proposed methods. We evaluate the feasibility of our method on three different datasets: CSIC2010, CSE-CIC-IDS2018 and one that we prepared ourselves. The results we show are comparable to others or better, and most importantly - interpretable.

翻译：超文本传输协议( HTTP) 是互联网上使用最为广泛的协议之一。因此,大多数攻击( 即 SQL 注入, XSS) 都使用HTTP 来作为运输机制。因此, 关键是要开发一个智能解决方案, 以便有效地检测和清除HTTP交通中的异常现象。目前, 大多数异常检测系统不是基于规则, 就是使用人工选择的功能来进行解释。我们提议使用现代不受监督的语言代表模式来嵌入 HTTP 请求, 然后用它来分类交通中的异常现象。解决方案的动机是自然语言处理( NLP ) 所使用的方法, 如 Doc2Vec 等, 它可以捕捉到对 HTTP 信息的真实理解, 从而提高入侵探测系统的效率。在我们的工作中, 我们不仅旨在创造合适的嵌入空间空间空间空间空间空间, 我们还决定使用当前状态的RoBERTA, 据我们所知, 从未在类似的问题上使用过同样的使用过。为了校正方法, 校验我们用一个简单的路径, 我们用一个简单的路径来测试我们之前的路径, 。