Background: Ad hoc parsers are pieces of code that use common string functions like split, trim, or slice to effectively perform parsing. Whether it is handling command-line arguments, reading configuration files, parsing custom file formats, or any number of other minor string processing tasks, ad hoc parsing is ubiquitous -- yet poorly understood. Objective: This study aims to reveal the common syntactic and semantic characteristics of ad hoc parsing code in real world Python projects. Our goal is to understand the nature of ad hoc parsers in order to inform future program analysis efforts in this area. Method: We plan to conduct an exploratory study based on large-scale mining of open-source Python repositories from GitHub. We will use program slicing to identify program fragments related to ad hoc parsing and analyze these parsers and their surrounding contexts across 9 research questions using 25 initial syntactic and semantic metrics. Beyond descriptive statistics, we will attempt to identify common parsing patterns by cluster analysis.
翻译:背景:临时解析器是使用常见的字符串函数(例如split,trim或slice)有效执行解析的代码。无论是处理命令行参数,读取配置文件,解析自定义文件格式还是处理其他大量的字符串处理任务,临时解析是普遍存在但却很难理解的。目的:本研究旨在揭示现实世界Python项目中临时解析代码的常见语法和语义特征。我们的目标是理解临时解析器的性质,以便为今后在该领域的程序分析工作提供参考。方法:我们计划进行一项探索性研究,基于大规模挖掘GitHub上的开源Python仓库。我们将使用程序切片来识别与临时解析相关的程序片段,并使用25个初始的语法和语义度量标准,通过9个研究问题分析这些解析器及其周围的上下文。除了描述统计信息外,我们还将尝试通过聚类分析来确定常见的解析模式。