In parallel with big data processing and analysis dominating the usage of distributed and cloud infrastructures, the demand for distributed metadata access and transfer has increased. In many application domains, the volume of data generated exceeds petabytes, while the corresponding metadata amounts to terabytes or even more. This paper proposes a novel solution for efficient and scalable metadata access for distributed applications across wide-area networks, dubbed SMURF. Our solution combines novel pipelining and concurrent transfer mechanisms with reliability, provides distributed continuum caching and prefetching strategies to sidestep fetching latency, and achieves scalable and high-performance metadata fetch/prefetch services in the cloud. We also study the phenomenon of semantic locality in real trace logs, which is not well utilized in metadata access prediction. We implement a novel prefetch predictor based on this observation and compare it with three existing state-of-the-art prefetch schemes on Yahoo! Hadoop audit traces. By effectively caching and prefetching metadata based on the access patterns, our continuum caching and prefetching mechanism significantly improves local cache hit rate and reduces the average fetching latency. We replayed approximately 20 Million metadata access operations from real audit traces, in which our system achieved 90% accuracy during prefetch prediction and reduced the average fetch latency by 50% compared to the state-of-the-art mechanisms.
翻译:在使用分布式和云度基础设施的同时,在使用分布式和云度基础设施的同时,对分布式元数据获取和传输的需求也在增加,在许多应用领域,生成的数据数量超过了花字节,数据数量超过了花状数,而相应的元数据数量甚至达到百万字节。本文件提出了高效和可缩放的元数据获取新解决方案,用于广域网络(称为SMURF)的分布式应用。我们的解决办法是将新颖的管线和同步传输机制与可靠地同步地结合起来,提供分布式连续的缓冲和预展式战略,以绕开取延缓度,实现云中可缩和高性能的元数据获取/预发服务。我们还研究了真实跟踪日志中含语区域域域域域的现象,而这种现象在元数据获取预测中没有得到充分利用。我们根据这一观察实施了一个新的预伸缩预测,并将其与亚虎(Yahoo)的三种现有最先进的预伸缩计划进行比较。我们根据访问模式有效地累积和预伸缩式的元数据获取,通过连续的缓缩和预缩式机制在云中实现可扩展式平均50度预测率率,从而大幅改进了地方的准确度预测率。