High-throughput sequencing file formats and tools encode coordinate intervals with respect to a reference sequence in at least four distinct, incompatible ways. Integrating data from and moving data between different formats has the potential to introduce subtle off-by-one errors. Here, we introduce the notion of typesafe coordinates: coordinate intervals are not only an integer pair, but members of a type class comprising four types: the Cartesian product of a zero or one basis, and an open or closed interval end. By leveraging the type system of statically and strongly-typed, compiled languages we can provide static guarantees that an entire class of error is eliminated. We provide a reference implementation in D as part of a larger work (dhtslib), and proofs of concept in Rust, OCaml, and Python. Exploratory implementations are available at https://github.com/blachlylab/typesafe-coordinates.
翻译:高通量排序文件格式和工具以至少四种不同、不兼容的方式对参考序列的间隔进行编码。不同格式之间整合数据和移动数据有可能引入微妙的逐个错误。在这里,我们引入了类型安全坐标的概念:协调间隔不仅仅是一对整数,而是由四种类型组成的类型类别的成员:零或一基的笛卡尔产品,以及开放或封闭的间隔端。通过利用静态和强型类型语言的类型系统,我们汇编的语言可以提供静态保证,消除整个错误类别。我们提供了D的参考实施,作为较大工作的一部分(dhtslib),并在Rust、OCaml和Python提供了概念的证明。探索性实施可在https://github.com/blachlylab/typefe-coadorps查阅。