PYInfer: Python 变量的深度学习语义类型推断 (PYInfer: Deep Learning Semantic Type Inference for Python Variables)

Python type inference is challenging in practice. Due to its dynamic properties and extensive dependencies on third-party libraries without type annotations, the performance of traditional static analysis techniques is limited. Although semantics in source code can help manifest intended usage for variables (thus help infer types), they are usually ignored by existing tools. In this paper, we propose PYInfer, an end-to-end learning-based type inference tool that automatically generates type annotations for Python variables. The key insight is that contextual code semantics is critical in inferring the type for a variable. For each use of a variable, we collect a few tokens within its contextual scope, and design a neural network to predict its type. One challenge is that it is difficult to collect a high-quality human-labeled training dataset for this purpose. To address this issue, we apply an existing static analyzer to generate the ground truth for variables in source code. Our main contribution is a novel approach to statically infer variable types effectively and efficiently. Formulating the type inference as a classification problem, we can handle user-defined types and predict type probabilities for each variable. Our model achieves 91.2% accuracy on classifying 11 basic types in Python and 81.2% accuracy on classifying 500 most common types. Our results substantially outperform the state-of-the-art type annotators. Moreover, PYInfer achieves 5.2X more code coverage and is 187X faster than a state-of-the-art learning-based tool. With similar time consumption, our model annotates 5X more variables than a state-of-the-art static analysis tool. Our model also outperforms a learning-based function-level annotator on annotating types for variables and function arguments. All our tools and datasets are publicly available to facilitate future research in this direction.

翻译：Python 类型推论在实践中具有挑战性。由于其动态属性和对第三方图书馆没有类型说明的高度依赖性, 传统静态分析技术的性能有限。虽然源代码中的语义可以帮助显示变量( 帮助推断类型) 的预定用途, 但通常被现有工具忽略。在本文中, 我们建议 PyInfer 是一个基于端到端的学习类型推论工具, 自动生成 Python 变量的型号说明。关键见解是, 上下文代码描述对于推断变量类型至关重要。对于变量的每一次使用, 我们收集其上下文范围内的几个符号, 设计一个神经网络来预测其类型。一个挑战是, 要为此收集高质量的人类标签培训数据集是困难的。为了解决这个问题, 我们使用一个基于端到端到端的变量分析器来生成源代码中的变量的地面真相。我们的主要贡献是, 以静态推算的变数类型为有效和高效的。将类型推论作为分类的模型, 我们可以用最精确的型号类型来分析一个普通的用户类型, 并且用最精确的模型进行我们最精确的模型。将一个普通的 Ralevalx 。。将一个普通的 Ralx 。。将一个普通类型和最精确的将一个普通的型号将一个普通的变数将一个普通的变数。