SPEC5G: 5G细胞网络议定书分析数据集 (SPEC5G: A Dataset for 5G Cellular Network Protocol Analysis)

5G is the 5th generation cellular network protocol. It is the state-of-the-art global wireless standard that enables an advanced kind of network designed to connect virtually everyone and everything with increased speed and reduced latency. Therefore, its development, analysis, and security are critical. However, all approaches to the 5G protocol development and security analysis, e.g., property extraction, protocol summarization, and semantic analysis of the protocol specifications and implementations are completely manual. To reduce such manual effort, in this paper, we curate SPEC5G the first-ever public 5G dataset for NLP research. The dataset contains 3,547,586 sentences with 134M words, from 13094 cellular network specifications and 13 online websites. By leveraging large-scale pre-trained language models that have achieved state-of-the-art results on NLP tasks, we use this dataset for security-related text classification and summarization. Security-related text classification can be used to extract relevant security-related properties for protocol testing. On the other hand, summarization can help developers and practitioners understand the high level of the protocol, which is itself a daunting task. Our results show the value of our 5G-centric dataset in 5G protocol analysis automation. We believe that SPEC5G will enable a new research direction into automatic analyses for the 5G cellular network protocol and numerous related downstream tasks. Our data and code are publicly available.

翻译：5G是第5代蜂窝网络规程。这是最先进的全球无线标准,它使得一种先进的网络能够将几乎所有人和所有东西连接起来,加快速度,降低潜伏度。因此,它的开发、分析和安全至关重要。然而,5G协议开发和安全分析的所有方法,例如财产提取、协议总和,以及协议规格和执行的语义分析,都是完全手工操作的。为了减少这种人工操作,我们在本文件中为NLP研究翻译了有史以来第一个公开的5G数据集。数据集包含3,547,586句话,从13094个移动电话网络规格和13个在线网站的134M字,用134M字进行连接。通过利用已经就NLP任务取得最新结果的大规模预先培训语言模式,我们使用这个数据集进行与安全有关的文本分类和内容分析。与安全相关的文本分类可用于为协议测试相关的安全相关属性。另一方面,简洁化可以帮助开发者和从业人员理解协议的高级版本,从134M字句中,从13094个手机网络和13个在线网站的13个网站中,这都是至关重要的。我们利用了经过预先训练的语文语言模型开发的语言模型,从而将展示了5G的5G的高级数据分析结果。