Today's big data science communities manage their data publication and replication at the application layer. These communities utilize myriad mechanisms to publish, discover, and retrieve datasets - the result is an ecosystem of either centralized, or otherwise a collection of ad-hoc data repositories. Publishing datasets to centralized repositories can be process-intensive, and those repositories do not accept all datasets. The ad-hoc repositories are difficult to find and utilize due to differences in data names, metadata standards, and access methods. To address the problem of scientific data publication and storage, we have designed Hydra, a secure, distributed, and decentralized data repository made of a loose federation of storage servers (nodes) provided by user communities. Hydra runs over Named Data Networking (NDN) and utilizes the State Vector Sync (SVS) protocol that lets individual nodes maintain a "global view" of the system. Hydra provides a scalable and resilient data retrieval service, with data distribution scalability achieved via NDN's built-in data anycast and in-network caching and resiliency against individual server failures through automated failure detection and maintaining a specific degree of replication. Hydra utilizes "Favor", a locally calculated numerical value to decide which nodes will replicate a file. Finally, Hydra utilizes data-centric security for data publication and node authentication. Hydra uses a Network Operation Center (NOC) to bootstrap trust in Hydra nodes and data publishers. The NOC distributes user and node certificates and performs the proof-of-possession challenges. This technical report serves as the reference for Hydra. It outlines the design decisions, the rationale behind them, the functional modules, and the protocol specifications.
翻译:今天的巨型数据科学社区管理其数据发布和在应用层复制。这些社区利用多种机制来发布、发现和检索数据集。 这些社区使用多种机制来发布、发现和检索数据集。 结果是中央化或收集临时数据储存库的生态系统。 向中央储存库发布数据集可以是过程密集型的, 这些储存库不接受所有数据集。 临时储存库由于数据名称、 元数据标准和存取方法的差异而难以找到和利用。 为了解决科学数据发布和存储的问题,我们设计了由用户社区提供的存储服务器松散的组合(节点)组成的海德拉,一个安全、分发和分散的数据储存库。 海德拉运行在命名数据网络(NDN)上运行的生态系统,并利用国家矢量同步(SVS)协议,让单个节点维持系统的“全局视角”。 水流管理提供了可缩放和弹性的数据检索服务,通过NDN建立的数据发布和网络的参考系统, 由用户用户存储和网络提供的存储和存储服务器的松散(no train) 运行自动的自动检测和存储服务器, 将运行数据运行到一个具体版本。