It is natural to suppose that a Large Language Model is more likely to generate correct test cases when prompted with correct code under test, compared to incorrect code under test. However, the size of this effect has never been previously measured, despite its obvious importance for both practicing software engineers and researchers. To answer the question, we conducted a comprehensive empirical study on 5 open source and 6 closed source language models, with 3 widely-used benchmark data sets together with 41 repo-level real-world examples from two different real-world data sets. Our results reveal that, when compared to incorrect code under test, LLMs prompted with correct code achieve improvements in test accuracy, code coverage, and bug detection of 57\%, 12\%, and 24\% respectively. We further show that these scientific conclusions carry over from the three benchmark data sets to the real-world code, where tests generated for incorrect code experience a 47\% worse bug detection rate. Finally, we report that improvements of +18\% in accuracy, +4\% coverage, and +34\% in bug detection can be achieved by providing natural language code descriptions. These findings have actionable conclusions. For example, the 47\% reduction in real-world bug detection is a clear concern. Fortunately, it is a concern for which our findings about the added value of descriptions offer an immediately actionable remedy.
翻译:暂无翻译