利用机器学习查明API参考文件知识 (On Using Machine Learning to Identify Knowledge in API Reference Documentation)

Using API reference documentation like JavaDoc is an integral part of software development. Previous research introduced a grounded taxonomy that organizes API documentation knowledge in 12 types, including knowledge about the Functionality, Structure, and Quality of an API. We study how well modern text classification approaches can automatically identify documentation containing specific knowledge types. We compared conventional machine learning (k-NN and SVM) and deep learning approaches trained on manually annotated Java and .NET API documentation (n = 5,574). When classifying the knowledge types individually (i.e., multiple binary classifiers) the best AUPRC was up to 87%. The deep learning and SVM classifiers seem complementary. For four knowledge types (Concept, Control, Pattern, and Non-Information), SVM clearly outperforms deep learning which, on the other hand, is more accurate for identifying the remaining types. When considering multiple knowledge types at once (i.e., multi-label classification) deep learning outperforms na\"ive baselines and traditional machine learning achieving a MacroAUC up to 79%. We also compared classifiers using embeddings pre-trained on generic text corpora and StackOverflow but did not observe significant improvements. Finally, to assess the generalizability of the classifiers, we re-tested them on a different, unseen Python documentation dataset. Classifiers for Functionality, Concept, Purpose, Pattern, and Directive seem to generalize from Java and .NET to Python documentation. The accuracy related to the remaining types seems API-specific. We discuss our results and how they inform the development of tools for supporting developers sharing and accessing API knowledge. Published article: https://doi.org/10.1145/3338906.3338943

翻译：使用 JavaDoc 等 API 参考文件是软件开发的一个组成部分。以前的研究引入了一种有根有底的分类学, 以12种类型组织 API 文件知识, 包括关于API 的功能性、结构和质量的知识。我们研究了现代文本分类方法如何能自动识别包含特定知识类型的文件。我们比较了常规机器学习( k- NN 和 SVM) 和在手动附加注释 Java 和.NET API 文件(n= 5, 5374) 培训的深层次学习方法。在对知识类型( 即多个二元分类) 进行分类时, AUPRC 的最佳类型文件高达87 % 。深层次的学习和 SVM 分类似乎具有互补性。对于四种知识类型( Concept、控制、模式和非信息), SVIMM 明显超越了深层次的学习方法。当一次考虑多种知识类型( e., 多标签分类) 时, 它们正在深层次地学习和传统机器学习达到 79% 。我们还从 GMAUAUC 的的 SDRDL, 我们还在进行了Oal- gliveralevolview 和GI 文件上, 。