野外采矿业 (Mining Idioms in the Wild)

Existing code repositories contain numerous instances of code patterns that are idiomatic ways of accomplishing a particular programming task. Sometimes, the programming language in use supports specific operators or APIs that can express the same idiomatic imperative code much more succinctly. However, those code patterns linger in repositories because the developers may be unaware of the new APIs or have not gotten around to them. Detection of idiomatic code can also point to the need for new APIs. We share our experiences in mine idiomatic patterns from the Hack repo at Facebook. We found that existing techniques either cannot identify meaningful patterns from syntax trees or require test-suite-based dynamic analysis to incorporate semantic properties to mine useful patterns. The key insight of the approach proposed in this paper -- \emph{Jezero} -- is that semantic idioms from a large codebase can be learned from \emph{canonicalized} dataflow trees. We propose a scalable, lightweight static analysis-based approach to construct such a tree that is well suited to mine semantic idioms using nonparametric Bayesian methods. Our experiments with Jezero on Hack code shows a clear advantage of adding canonicalized dataflow information to ASTs: \emph{Jezero} was significantly more effective than a baseline that did not have the dataflow augmentation in being able to effectively find refactoring opportunities from unannotated legacy code.

翻译：现有代码库包含许多代码模式的事例, 它们是完成特定编程任务的特殊方式。有时, 使用的编程语言支持特定操作员或API, 能够更简洁地表达相同的单词必用代码。但是, 这些代码模式在库中会存在, 因为开发者可能不知道新的 API, 或者没有绕过它们。探测单词代码也可以显示对新的 API 的需要。我们分享了我们在Facebook Hack repo 上从 Hack Repo 上找到的地雷单词型模式的经验。我们发现, 现有的技术要么无法识别来自合成树的有意义的模式, 或需要基于测试的、完全的动态分析, 才能将同一语义的特性与我的有用模式结合起来。本文中建议的方法的关键洞察到, 开发者可能不知道新的 API 。我们建议一种可缩放的、轻量的静态分析方法, 来构建这样一棵树, 这棵树非常适合地雷的单词系, 或者需要基于测试的测试测试基础数据流, 能够有效地显示一个不偏差的 Bay_ 数据流。我们用一种有效的数据实验, 能够。