Integrated Development Environments (IDE) are designed to make users more productive, as well as to make their work more comfortable. To achieve this, a lot of diverse tools are embedded into IDEs, and the developers of IDEs can employ anonymous usage logs to collect the data about how they are being used to improve them. A particularly important component that this can be applied to is code completion, since improving code completion using statistical learning techniques is a well-established research area. In this work, we propose an approach for collecting completion usage logs from the users in an IDE and using them to train a machine learning based model for ranking completion candidates. We developed a set of features that describe completion candidates and their context, and deployed their anonymized collection in the Early Access Program of IntelliJ-based IDEs. We used the logs to collect a dataset of code completions from users, and employed it to train a ranking CatBoost model. Then, we evaluated it in two settings: on a held-out set of the collected completions and in a separate A/B test on two different groups of users in the IDE. Our evaluation shows that using a simple ranking model trained on the past user behavior logs significantly improved code completion experience. Compared to the default heuristics-based ranking, our model demonstrated a decrease in the number of typing actions necessary to perform the completion in the IDE from 2.073 to 1.832. The approach adheres to privacy requirements and legal constraints, since it does not require collecting personal information, performing all the necessary anonymization on the client's side. Importantly, it can be improved continuously: implementing new features, collecting new data, and evaluating new models - this way, we have been using it in production since the end of 2020.
翻译:综合开发环境(IDE) 旨在让用户更有成效,并使他们的工作更舒适。 为此,许多多种工具都嵌入了IDEs, IDEs的开发者可以使用匿名使用日志来收集数据,了解如何使用它们来改进它们。 这一点可以应用到代码完成上的一个特别重要的组成部分是代码完成, 因为使用统计学习技术改进代码完成是一个成熟的研究领域。 在这项工作中, 我们提出了一个方法, 收集用户在 IDE 中的完成日志, 并用它们来培训基于机器的完成模式。 我们开发了一套功能, 描述完成候选人及其上下文, 并在IntellJ基于 IDE 的早期存取方案中安装了匿名收藏。 我们使用日志来收集用户完成代码的数据集, 并用来培训CatBoost 排名模型。 然后, 我们用两种环境来评估它: 预置的完成方法, 收集的完成完成的完成方式, 和在IDE 中对两个不同的用户组进行单独的 A/B测试。 我们的评估显示的是, 在完成后, 我们用简单的用户的排序中, 正在将更新的正确的完成的排序, 。