Data lakes hold a growing amount of cold data that is infrequently accessed, yet require interactive response times. Serverless functions are seen as a way to address this use case since they offer an appealing alternative to maintaining (and paying for) a fixed infrastructure. Recent research has analyzed the potential of serverless for data processing. In this paper, we expand on such work by looking into the question of serverless resource allocation to data processing tasks (number and size of the functions). We formulate a general model to roughly estimate completion time and financial cost, which we apply to augment an existing serverless data processing system with an advisory tool that automatically identifies configurations striking a good balance -- which we define as being close to the "knee" of their Pareto frontier. The model takes into account key aspects of serverless: start-up, computation, network transfers, and overhead as a function of the input sizes and intermediate result exchanges. Using (micro)benchmarks and parts of TPC-H, we show that this advisor is capable of pinpointing configurations desirable to the user. Moreover, we identify and discuss several aspects of data processing on serverless affecting efficiency. By using an automated tool to configure the resources, the barrier to using serverless for data processing is lowered and the narrow window where it is cost effective can be expanded by using a more optimal allocation instead of having to over-provision the design.
翻译:没有服务器的功能被视为解决这一使用案例的一种方法,因为它们提供了维持(和支付)固定基础设施的替代物。最近的研究分析了无服务器数据处理的潜力。在本文中,我们通过研究将服务器无资源分配给数据处理任务(功能的数量和大小)的问题来扩大这项工作。我们制定了一个大致估计完成时间和财务成本的一般模型,我们应用这个模型来扩大现有的无服务器数据处理系统,并使用一个咨询工具来自动确定达到良好平衡的配置 -- -- 我们将其定义为接近于其Pareto前沿的“膝盖”。模型考虑到服务器无服务器的关键方面:启动、计算、网络传输和间接费用,作为输入大小和中间结果交换的一种功能。我们使用(微)贝辛基和TPC-H的部分,我们表明这个顾问能够确定用户所需的配置。此外,我们确定并讨论服务器上一些影响效率的不高配置的方面,我们将其定义为接近于其“膝盖”的配置。模型考虑到服务器无服务器的关键方面:启动、计算、网络传输、管理作为输入大小和中间结果交换功能交换功能的功能的功能。我们用一个自动化工具来配置一个更低的、更低压的服务器,从而降低其设计窗口,从而降低其设计,从而可以降低其设计。