Many services today massively and continuously produce log files of different and varying formats. These logs are important since they contain information about the application activities, which is necessary for improvements by analyzing the behavior and maintaining the security and stability of the system. It is a common practice to store log files in a compressed form to reduce the sheer size of these files. A compression algorithm identifies frequent patterns in a log file to remove redundant information. This work presents an approach to detect frequent patterns in textual data that can be simultaneously registered during the file compression process with low consumption of resources. The log file can be visualized with the possibility to explore the extracted patterns using metrics based on such properties as frequency, length and root prefixes of the acquired pattern. This allows an analyst to gain the relevant insights more efficiently reducing the need for manual labor-intensive inspection in the log data. The extension of the implemented dictionary-based compression algorithm has the advantage of recognizing patterns in log files of any format and eliminates the need to manually perform preparation for any preprocessing of log files.
翻译:许多服务当前会大量持续地产生不同格式的日志文件。这些日志文件很重要,因为它们包含应用程序活动的信息,必要时可以通过分析行为来改善系统并维护系统的安全和稳定性。将日志文件以压缩方式存储以减小文件大小是一种常见做法。压缩算法识别日志文件中频繁的模式来移除冗余的信息。本工作提出了一种在文件压缩过程中检测文本数据频繁模式的方法,该方法消耗的资源较低,可以同时记录。通过基于频率、长度和获取的模式根前缀等特性的度量,可以可视化日志文件并探索提取的模式。这使得分析员可以更有效地获得相关洞察力,减少在日志数据中进行手动的繁重检查的需求。所实现的基于字典的压缩算法扩展具有识别任何格式日志文件模式的优势,并且消除了手动为任何预处理日志文件执行准备工作的需要。