Generation of pseudo-code descriptions of legacy source code for software maintenance is a manually intensive task. Recent encoder-decoder language models have shown promise for automating pseudo-code generation for high resource programming languages such as C++, but are heavily reliant on the availability of a large code-pseudocode corpus. Soliciting such pseudocode annotations for codes written in legacy programming languages (PL) is a time consuming and costly affair requiring a thorough understanding of the source PL. In this paper, we focus on transferring the knowledge acquired by the code-to-pseudocode neural model trained on a high resource PL (C++) using parallel code-pseudocode data. We aim to transfer this knowledge to a legacy PL (C) with no PL-pseudocode parallel data for training. To achieve this, we utilize an Iterative Back Translation (IBT) approach with a novel test-cases based filtration strategy, to adapt the trained C++-to-pseudocode model to C-to-pseudocode model. We observe an improvement of 23.27% in the success rate of the generated C codes through back translation, over the successive IBT iteration, illustrating the efficacy of our approach.
翻译:生成用于软件维护的遗留源代码的伪代码描述是一个人工的艰巨任务。 最近的编码- 编码解码语言模型已经显示出将诸如 C++ 等高资源编程语言的伪代码生成自动化的前景, 但是严重依赖大型代码假码软件库的可用性。 为以遗留编程语言( PL) 写入的代码, 将这种伪代码描述引出伪代码是一个耗时且成本高昂的事情, 需要彻底理解源代码( PL ) 。 在本文中, 我们侧重于将所培训的高资源( C++) 的代码- 假码神经系统模型获得的知识转让给使用平行代码/ 假码数据的高资源( C++ ) 。 我们的目标是将这种知识转让给没有 PL- 伪码平行数据用于培训的遗留的 PL( C ) 。 为了实现这一点, 我们用一种基于过滤策略的新颖的测试案例( IB) 翻译法, 将训练有素的C++- 至 假码模式转换到 C- 。 我们观察到在生成的C- 效能化方法的成功率方面提高了23. 。</s>