Data stores are the foundation on which data science, in all its variations, is built upon. They provide a queryable interface to structured and unstructured data. Data science often starts by leveraging these query features to perform initial data preparation. However, most data stores are designed to run continuously to service disparate user requests with little or no downtime. Many HPC architectures process user requests by job queue scheduler and maintain a shard filesystem to store a jobs persistent data. We deploy a MongoDB sharded cluster with a run script that is designed to run a data science workload concurrently. As our test piece, we run data ingest and data queries to measure the performance with different configurations on the Blue Waters supper computer.
翻译:数据存储处是数据科学及其所有变异的基础。 它们为结构化和无结构化的数据提供了一个可查询的界面。 数据科学通常从利用这些查询功能进行初步数据准备开始。 然而, 多数数据存储处的设计旨在持续运行, 以在很少或没有故障的情况下为不同的用户请求服务。 许多 HPC 架构通过工作队列调度器处理用户请求, 并维持一个硬文件系统以存储持续工作的数据 。 我们安装了一个运行脚本的模块, 用于同时运行数据科学工作量。 作为我们的测试片, 我们运行数据存储和数据查询, 以测量蓝水晚宴计算机上不同配置的性能 。