Knowing HPC applications of jobs and analyzing their performance behavior play important roles in system management and optimizations. The existing approaches detect and identify HPC applications through machine learning models. However, these approaches rely heavily on the manually extracted features from resource utilization data to achieve high prediction accuracy. In this study, we propose an innovative application recognition method, ARcode, which encodes job monitoring data into images and leverages the automatic feature learning capability of convolutional neural networks to detect and identify applications. Our extensive evaluations based on the dataset collected from a large-scale production HPC system show that ARcode outperforms the state-of-the-art methodology by up to 18.87% in terms of accuracy at high confidence thresholds. For some specific applications (BerkeleyGW and e3sm), ARcode outperforms by over 20% at a confidence threshold of 0.8.
翻译:了解HPC的工作应用情况并分析其绩效行为在系统管理和优化中发挥着重要作用。现有方法通过机器学习模型检测和识别HPC的应用情况。然而,这些方法在很大程度上依赖从资源利用数据中人工提取的特性,以实现高预测准确性。在本研究中,我们建议采用创新的应用识别方法ARcode,将工作监测数据编码为图像,并利用动态神经网络的自动特征学习能力检测和识别应用情况。我们根据大规模生产HPC系统收集的数据集进行的广泛评估表明,ARcode在高度信任阈值的准确性方面比最新方法高出18.87%。对于某些具体应用(BerkeleyGW和e3sm),ARcode在0.8的可信度阈值超过20%。