Due to the long runtime of Data Science (DS) pipelines, even small programming mistakes can be very costly, if they are not detected statically. However, even basic static type checking of DS pipelines is difficult because most are written in Python. Static typing is available in Python only via external linters. These require static type annotations for parameters or results of functions, which many DS libraries do not provide. In this paper, we show how the wealth of Python DS libraries can be used in a statically safe way via Safe-DS, a domain specific language (DSL) for DS. Safe-DS catches conventional type errors plus errors related to range restrictions, data manipulation, and call order of functions, going well beyond the abilities of current Python linters. Python libraries are integrated into Safe-DS via a stub language for specifying the interface of its declarations, and an API-Editor that is able to extract type information from the code and documentation of Python libraries, and automatically generate suitable stubs. Moreover, Safe-DS complements textual DS pipelines with a graphical representation that eases safe development by preventing syntax errors. The seamless synchronization of textual and graphic view lets developers always choose the one best suited for their skills and current task. We think that Safe-DS can make DS development easier, faster, and more reliable, significantly reducing development costs.
翻译:由于数据科学(DS)管道的长运行时间,即使是小的编程错误,如果它们不是静态检测,也可能非常昂贵。然而,即使是DS管道的基本静态类型检查也很困难,因为大多数都是用Python编写的。在Python中,静态类型仅通过外部linter可用。这些需要参数或函数结果的静态类型注释,而许多DS库不提供。在本文中,我们展示了如何通过Safe-DS,一种针对DS的领域特定语言(DSL),以静态安全的方式使用Python DS库的丰富性。Safe-DS捕获传统的类型错误以及与范围限制,数据操作和函数调用顺序有关的错误,远远超出当前Python linter的能力。Python库通过一种存根语言集成到Safe-DS中,用于指定其声明的接口,以及一个API编辑器,能够从Python库的代码和文档中提取类型信息,并自动生成合适的存根。此外,Safe-DS通过一种图形表示形式补充了文本DS管道,通过防止语法错误来简化安全开发。文本和图形视图的无缝同步使开发人员始终可以选择最适合他们的技能和当前任务的视图。我们认为,Safe-DS可以使DS开发更加容易,更快速,更可靠,从而显着降低开发成本。