Modern data collection and analysis pipelines often involve a sophisticated mix of applications written in general purpose and specialized programming languages. Many formats commonly used to import and export data between different programs or systems, such as CSV or JSON, are verbose, inefficient, not type-safe, or tied to a specific programming language. Protocol Buffers are a popular method of serializing structured data between applications - while remaining independent of programming languages or operating systems. They offer a unique combination of features, performance, and maturity that seems particularly well suited for data-driven applications and numerical computing. The RProtoBuf package provides a complete interface to Protocol Buffers from the R environment for statistical computing. This paper outlines the general class of data serialization requirements for statistical computing, describes the implementation of the RProtoBuf package, and illustrates its use with example applications in large-scale data collection pipelines and web services.
翻译:现代数据收集和分析管道往往涉及以通用语言和专门编程语言书写的各种复杂应用组合,许多用于不同程序或系统(如CSV或JSON)之间输入和输出数据的通用格式,诸如CSV或JSON,是verbose、低效、不易类型或与特定编程语言挂钩的;《议定书》缓冲是两种应用程序之间结构化数据序列化的流行方法,同时仍然独立于编程语言或操作系统;它们提供了独特的特征、性能和成熟性组合,似乎特别适合数据驱动应用程序和数字计算;RProtoBuf软件包为来自R环境的用于统计计算的协议缓冲提供了完整的接口;本文件概述了统计计算数据序列化要求的一般类别,描述了RProtoBuf软件包的实施情况,并举例说明了其在大规模数据收集管道和网络服务中的应用。