As the growth of genomics samples rapidly expands due to increased access to high resolution DNA sequencing technology, the need for a scalable platform to aggregate dispersed datasets enable easy access to the vast wealth of DNA sequences available is paramount. In this work, we introduce and demonstrate a novel way to use Named Data Networking (NDN) in conjunction with a Kubernetes cluster to design a flexible and scalable genomics Data Lake in the cloud. In addition, the use of the NDN Data Plane Development Kit (DPDK) provides an efficient and accessible distribution of the datasets to researchers anywhere. This report will explain the need to deploy a Data Lake for genomics data, what is necessary to deploy successfully, and detailed instructions to replicate the proposed design. Finally, this technical report outlines future enhancement options for further improvements.
翻译:在这项工作中,我们介绍并展示了与Kubernetes集群一起使用命名数据网络的新方式,以设计云层中灵活和可缩放的基因组数据湖;此外,使用NDN数据平板开发工具包(DPDK)为任何地方的研究人员提供了高效和方便的数据集分布;本报告将解释为基因组数据数据部署数据湖的必要性、成功部署所需的内容以及复制拟议设计的详细指示;最后,本技术报告概述了今后进一步改进的改进备选方案。