As stability testing execution logs can be very long, software engineers need help in locating anomalous events. We develop and evaluate two models for scoring individual log-events for anomalousness, namely an N-Gram model and a Deep Learning model with LSTM (Long short-term memory). Both are trained on normal log sequences only. We evaluate the models with long log sequences of Android stability testing in our company case and with short log sequences from HDFS (Hadoop Distributed File System) public dataset. We evaluate next event prediction accuracy and computational efficiency. The LSTM model is more accurate in stability testing logs (0.848 vs 0.865), whereas in HDFS logs the N-Gram is slightly more accurate (0.904 vs 0.900). The N-Gram model has far superior computational efficiency compared to the Deep model (4 to 13 seconds vs 16 minutes to nearly 4 hours), making it the preferred choice for our case company. Scoring individual log events for anomalousness seems like a good aid for root cause analysis of failing test cases, and our case company plans to add it to its online services. Despite the recent surge in using deep learning in software system anomaly detection, we found limited benefits in doing so. However, future work should consider whether our finding holds with different LSTM-model hyper-parameters, other datasets, and with other deep-learning approaches that promise better accuracy and computational efficiency than LSTM based models.
翻译:由于稳定性测试执行日志可能非常长,软件工程师需要帮助定位异常事件。 我们开发并评价两种模型,用于为异常事件评分单日志活动,即N-Gram模型和LSTM(长短期内存)的深学习模型。 这两种模型都只接受正常日志序列的培训。 我们用公司案例的Android稳定性测试的长日志序列和HDFS(Hadoop分流文件系统)公开数据集的短日志序列来评估模型。 我们评估下一个事件的预测准确性和计算效率。 LSTM模型在稳定性测试日志上(0.848848对0.865)更加精确,而N-Gram模型在HDFS中则略为精确(0.904对0.900)。 N-Gram模型的计算效率远优于深模型(4至13秒对16分钟至近4小时)的计算,因此我们更喜欢我们的案件公司的选择。 将单个日志事件计事件记录事件比其他测试案例的准确性分析要好得多, 使用最近的测试案例的根基计算模型, 也认为我们将来的测算系统测算计划 是否保持了不同的测算。