Large language model-generated code (LLMgCode) has become increasingly prevalent in software development. Many studies report that LLMgCode has more quality and security issues than human-authored code (HaCode). It is common for LLMgCode to mix with HaCode in a code change, while the change is signed by only human developers, without being carefully checked. Many automated methods have been proposed to detect LLMgCode from HaCode, in which the perplexity-based method (PERPLEXITY for short) is the state-of-the-art method. However, the efficacy evaluation of PERPLEXITY has focused on the detection accuracy. In this article, we are interested in whether PERPLEXITY is good enough in a wider range of realistic evaluation settings. To this end, we devise a large-scale dataset that includes 11,664 HaCode snippets and 13,164 LLMgCode snippets, and based on that, we carry out a family of experiments to compare PERPLEXITY against feature-based and pre-training-based methods from three perspectives: (1) detection accuracy in terms of programming language, degree of difficulty, and scale of solution, (2) generalization capability, and (3) inference efficiency. The experimental results show that PERPLEXITY has the best generalization capability while it has low accuracy and efficiency in most cases. Based on the experimental results and detection mechanism of PERPLEXITY, we discuss implications into both the strengths and limitations of PERPLEXITY, e.g., PERPLEXITY is unsuitable for high-level programming languages while it has good interpretability. As the first large-scale investigation on detecting LLMgCode from HaCode, this article provides a wide range of evidence for future improvement.
翻译:暂无翻译