规模化的地名化:OLCF高峰使用数据案例研究 (Pseudonymization at Scale: OLCF's Summit Usage Data Case Study)

from arxiv, 9 pages, 5 figures, accepted to BTSD 2022 workshop (see https://sites.google.com/view/btsd2022 for more information), to be published in the proceedings of IEEE Big Data 2022

The analysis of vast amounts of data and the processing of complex computational jobs have traditionally relied upon high performance computing (HPC) systems. Understanding these analyses' needs is paramount for designing solutions that can lead to better science, and similarly, understanding the characteristics of the user behavior on those systems is important for improving user experiences on HPC systems. A common approach to gathering data about user behavior is to analyze system log data available only to system administrators. Recently at Oak Ridge Leadership Computing Facility (OLCF), however, we unveiled user behavior about the Summit supercomputer by collecting data from a user's point of view with ordinary Unix commands. Here, we discuss the process, challenges, and lessons learned while preparing this dataset for publication and submission to an open data challenge. The original dataset contains personal identifiable information (PII) about OLCF users which needed be masked prior to publication, and we determined that anonymization, which scrubs PII completely, destroyed too much of the structure of the data to be interesting for the data challenge. We instead chose to pseudonymize the dataset to reduce its linkability to users' identities. Pseudonymization is significantly more computationally expensive than anonymization, and the size of our dataset, approximately 175 million lines of raw text, necessitated the development of a parallelized workflow that could be reused on different HPC machines. We demonstrate the scaling behavior of the workflow on two leadership class HPC systems at OLCF, and we show that we were able to bring the overall makespan time from an impractical 20+ hours on a single node down to around 2 hours. As a result of this work, we release the entire pseudonymized dataset and make the workflows and source code publicly available.

翻译：分析大量的数据和处理复杂的计算工作传统上依赖于高性能计算(HPC)系统。了解这些分析的需要对于设计能够带来更好的科学的解决方案至关重要。了解这些系统用户行为的特点对于改进HPC系统的用户经验至关重要。收集用户行为数据的共同方法是分析仅供系统管理员使用的系统记录数据。然而,最近,在Oak Ridge领导电子计算设施(OLCF),我们通过从用户的视角和普通 Unix 命令收集数据,公开了峰会超级计算机的用户行为。在这里,我们讨论过程、挑战和经验教训,同时为出版和提交开放数据挑战而准备该数据集,了解这些系统用户行为的特点对于改进HPC系统的用户经验十分重要。原始数据集包含关于OLCF用户的个人可识别信息(PII),在公布之前需要掩盖的关于用户行为的用户行为。我们确定匿名化(PII) 彻底摧毁了数据结构,以致无法引起数据挑战。我们选择将数据集从用户的视角到用户身份的链接化。我们选择了它的过程、挑战和Seudolonalalal化的流程的流程, 将我们更能在两个用户的系统上显示一个最昂贵的Sealityalityalizalation lade lade lade ladealde lade lade lade lax lax lax laxalde laxalde lax lax a a lax lax lax lax lax lax lax laxalde lade a a a lade lautusaldalde lautus lax lautaldald a lax lax lautd a lax lautd a a lautald a lautald a lax lax lax lax lax lax lax lax lax lax lax lautaldald lautal lautd lautal lautal lautd lautd a lautal lax lax lautal lax lauts a laut laus laus laus laus laus laut