Code translation between programming languages (PLs) is a critical task in software engineering, facilitating the modernization of legacy systems, ensuring cross-platform compatibility, and enhancing software performance. Most existing studies instruct LLMs to perform code translation and evaluate their performance by either running the generated outputs through test suites or comparing them to reference outputs (ground truth). These outputs, however, may contain not only executable source code but also additional non-code elements, such as natural language explanations or formatting tokens. We refer to the combination of source code and non-code elements as the output format. It is crucial to understand and address variations in output format, as non-code elements can interfere with evaluation metrics, resulting in biased assessments of model performance and comparisons. We conduct an empirical analysis of the outputs from eleven instruct-tuned open-source LLMs, across five PLs: C, C++, Go, Java, and Python. The results show that between 26.4% and 73.7% of outputs produced by our evaluated LLMs necessitate post-processing. To mitigate output format bias, we propose a strategic combination of prompt engineering and regular expressions that effectively extracts source code from mixed-format outputs, enabling the eleven open-source models to achieve an average Code Extraction Success Rate (CSR) of 92.73%. Our empirical study confirms that output format bias affects widely used execution-based metrics, i.e., Computational Accuracy (CA), and text-based metrics, i.e., BLEU, CodeBLEU and CrystalBLEU. Additionally, we test five closed-source LLMs and observe that they also generate varying distributions of output formats, which could lead to output format biases. Our results highlight the need to mitigate the output format bias to enable reliable evaluations in LLMs for code translation.
翻译:暂无翻译