In this paper, we propose a novel approach for mining different program features by analysing the internal behaviour of a deep neural network trained on source code. Using an unlabelled dataset of Java programs and three different embedding strategies for the methods in the dataset, we train an autoencoder for each program embedding and then we test the emerging ability of the internal neurons in autonomously building internal representations for different program features. We defined three binary classification labelling policies inspired by real programming issues, so to test the performance of each neuron in classifying programs accordingly to these classification rules, showing that some neurons can actually detect different program properties. We also analyse how the program representation chosen as input affects the performance on the aforementioned tasks. On the other hand, we are interested in finding the overall most informative neurons in the network regardless of a given task. To this aim, we propose and evaluate two methods for ranking neurons independently of any property. Finally, we discuss how these ideas can be applied in different settings for simplifying the programmers' work, for instance if included in environments such as software repositories or code editors.
翻译:在本文中,我们提出一种新的方法,通过分析受过源代码培训的深神经网络的内部行为来挖掘不同方案特征。我们利用Java方案的一个未贴标签的数据集和三个不同的数据集方法嵌入战略,为每个嵌入的程序培训一个自动编码器,然后测试内部神经元在自主地构建不同程序特征的内部代表中的新能力。我们根据真实的编程问题确定了三个二进制分类标签政策,以便测试每个神经元在按照这些分类规则对程序进行分类时的性能,表明某些神经元能够实际检测到不同的程序属性。我们还分析了所选择的作为投入的程序表述如何影响上述任务的业绩。另一方面,我们有兴趣找到网络中总体信息最丰富的神经元,而不管是否有特定任务。为此,我们提出和评估两种方法,用于将神经元排在任何属性之外进行排序。最后,我们讨论这些想法如何在不同的环境中应用,以简化程序员的工作,例如软件储存库或代码编辑员等环境。