为数据密集科学模拟和评估云存云的积存积存 (Simulation and evaluation of cloud storage caching for data intensive science)

A common task in scientific computing is the derivation of data. This workflow extracts the most important information from large input data and stores it in smaller derived data objects. The derived data objects can then be used for further analysis tasks. Typically, those workflows use distributed storage and computing resources. A straightforward configuration of storage media would be low cost tape storage and higher cost disk storage. The large, infrequently accessed input data is stored on tape storage. The smaller, frequently accessed derived data is stored on disk storage. In a best case scenario, the large input data is only accessed very infrequently and in a well planned pattern. However, practice shows that often the data has to be processed continuously and unpredictably. This can significantly reduce tape storage performance. A common approach to counter this is storing copies of the large input data on disk storage. This contribution evaluates an approach that uses cloud storage resources to serve as a flexible cache or buffer depending on the computational workflow. The proposed model is elaborated for the case of continuously processed data. For the evaluation, a simulation was developed, which can be used to evaluate models related to storage and network resources. We show that using commercial cloud storage can reduce the on-premises disk storage requirements, while maintaining an equal throughput of jobs. Moreover, the key metrics of the model are discussed and an approach is described that uses the simulation to assist with the decision process of using commercial cloud storage. The goal is to investigate approaches and propose new evaluation methods to overcome the future data challenges.

翻译：科学计算的一个共同任务是从数据中衍生出数据。这个工作流程从大量输入数据中提取最重要的信息,并将其储存在较小的衍生数据对象中。从中得出的数据对象可以用于进一步的分析任务。通常,这些工作流程使用分布式存储和计算资源。存储介质的简单配置是低成本磁带存储和较高的磁盘存储。大量、不经常访问的输入数据储存在磁带存储中。在磁盘存储中存储了较小、经常访问的、衍生的数据;在最佳情况下,大量输入数据只是很少地以规划良好的模式获取。然而,实践表明数据往往必须连续和难以预测地处理。这可以大大降低磁带存储的性能。一种共同的方法是储存磁盘存储的大型输入数据副本。这种贡献评价评价方法使用云存储资源作为灵活的缓冲或缓冲,取决于计算工作流程。为持续处理数据的情况而设计了拟议模型。在评估中,开发了一种模拟,可用于评价与存储和网络资源有关的模型有关的模型。我们提议,使用商业云存储和模拟方法,同时使用模拟方法,在模拟中,在模拟中,同时讨论,在模拟使用商业存储和模拟中,在模拟中,同时讨论。我们讨论,在模拟,在模拟中,可以使用新的存储和模拟,在模拟中,用新的存储方法。我们讨论。我们讨论。我们讨论。