Regular expressions (regexes) are a powerful mechanism for solving string-matching problems. They are supported by all modern programming languages, and have been estimated to appear in more than a third of Python and JavaScript projects. Yet existing studies have focused mostly on one aspect of regex programming: readability. We know little about how developers perceive and program regexes, nor the difficulties that they face. In this paper, we provide the first study of the regex development cycle, with a focus on (1) how developers make decisions throughout the process, (2) what difficulties they face, and (3) how aware they are about serious risks involved in programming regexes. We took a mixed-methods approach, surveying 279 professional developers from a diversity of backgrounds (including top tech firms) for a high-level perspective, and interviewing 17 developers to learn the details about the difficulties that they face and the solutions that they prefer. In brief, regexes are hard. Not only are they hard to read, our participants said that they are hard to search for, hard to validate, and hard to document. They are also hard to master: the majority of our studied developers were unaware of critical security risks that can occur when using regexes, and those who knew of the risks did not deal with them in effective manners. Our findings provide multiple implications for future work, including semantic regex search engines for regex reuse and improved input generators for regex validation.
翻译:常规表达式( regexes) 是解决弦匹配问题的有力机制 。 它们得到所有现代编程语言的支持, 估计出现在三分之一以上的 Python 和 JavaScript 项目中。 然而, 现有的研究主要侧重于 Regex 编程的一个方面: 可读性。 我们很少了解开发者如何看待和编程 Regex, 以及他们所面临的困难。 在本文中, 我们提供了对 Regex 开发周期的首次研究, 重点是:(1) 开发者在整个过程中如何作出决定, 面临哪些困难, 以及(3) 他们如何意识到编程 Regex 中涉及的严重风险。 我们采取了混合方法, 从不同背景( 包括顶级技术公司) 中调查279名专业开发者, 以高层次的观点来调查 Regex 编程, 并采访17名开发者, 了解他们所面临的困难和他们喜欢的解决方案的细节。 简而言之, Regex 很难读到。 不仅很难读到, 我们的参与者说, 他们很难寻找, 很难, 难以验证, 很难找到, 并且 很难读到文件。 当我们的主要开发者们 如何了解, 如何研究, 如何了解, 如何了解, 如何研究这些关键地研究。</s>