In this paper, we propose an approach named psc2code to denoise the process of extracting source code from programming screencasts. First, psc2code leverages the Convolutional Neural Network based image classification to remove non-code and noisy-code frames. Then, psc2code performs edge detection and clustering-based image segmentation to detect sub-windows in a code frame, and based on the detected sub-windows, it identifies and crops the screen region that is most likely to be a code editor. Finally, psc2code calls the API of a professional OCR tool to extract source code from the cropped code regions and leverages the OCRed cross-frame information in the programming screencast and the statistical language model of a large corpus of source code to correct errors in the OCRed source code. We conduct an experiment on 1,142 programming screencasts from YouTube. We find that our CNN-based image classification technique can effectively remove the non-code and noisy-code frames, which achieves an F1-score of 0.95 on the valid code frames. Based on the source code denoised by psc2code, we implement two applications: 1) a programming screencast search engine; 2) an interaction-enhanced programming screencast watching tool. Based on the source code extracted from the 1,142 collected programming screencasts, our experiments show that our programming screencast search engine achieves the precision@5, 10, and 20 of 0.93, 0.81, and 0.63, respectively.
翻译:在本文中, 我们提出一个名为 psc2code 的方法, 以掩盖从编程屏幕屏幕上提取源代码的过程 。 首先, psc2code 将基于革命神经网络的图像分类用于删除非代码和噪音代码框架。 然后, psc2code 将边缘检测和基于集群的图像分割法用于在代码框架中检测子窗口, 并以检测到的子窗口为基础, 它识别并种植最有可能是一个代码编辑的屏幕区域 。 最后, psc2code 将专业 OCR 工具的 API 调用专业 OCR 工具从裁制的代码区域提取源代码, 并利用基于编程神经神经网络的图像分类系统图像分类法, 去除编程中的非代码 。 在编程中, 我们基于 CNN 的图像分类技术可以有效地删除非代码和噪音代码的屏幕区域 。 在 0. 0. 9 5 代码 区域, 利用基于编程的 OCRCD 跨框架, 在编程中分别使用源代码,, 以 20 方向 屏幕, 屏幕,, 显示我们 屏幕 的, 以 20 屏幕 屏幕, 运行,, 运行,, 以 20 屏幕 运行, 的,, 运行 运行 运行 运行,,, 运行, 运行,,,,,,, 运行,,,,, 运行,, 运行,,,, 运行,,, 运行 运行, 运行,,,,,,,,,,,,,,,,,,,,,,,,,,, 运行,,,,,,,,,,,,,,,,,,,,,, 运行,,, 运行, 运行,,,,,,,,,,