We observe that many system policies that make threshold decisions involving a resource (e.g., time, memory, cores) naturally reveal additional, or implicit feedback. For example, if a system waits X min for an event to occur, then it automatically learns what would have happened if it waited <X min, because time has a cumulative property. This feedback tells us about alternative decisions, and can be used to improve the system policy. However, leveraging implicit feedback is difficult because it tends to be one-sided or incomplete, and may depend on the outcome of the event. As a result, existing practices for using feedback, such as simply incorporating it into a data-driven model, suffer from bias. We develop a methodology, called Sayer, that leverages implicit feedback to evaluate and train new system policies. Sayer builds on two ideas from reinforcement learning -- randomized exploration and unbiased counterfactual estimators -- to leverage data collected by an existing policy to estimate the performance of new candidate policies, without actually deploying those policies. Sayer uses implicit exploration and implicit data augmentation to generate implicit feedback in an unbiased form, which is then used by an implicit counterfactual estimator to evaluate and train new policies. The key idea underlying these techniques is to assign implicit probabilities to decisions that are not actually taken but whose feedback can be inferred; these probabilities are carefully calculated to ensure statistical unbiasedness. We apply Sayer to two production scenarios in Azure, and show that it can evaluate arbitrary policies accurately, and train new policies that outperform the production policies.
翻译:我们观察到,许多系统政策在涉及资源(如时间、记忆、核心)的门槛决策中自然会显示额外的或隐含的反馈。例如,如果一个系统在事件发生时等待X分钟,那么它就会自动学习如果等待<X分钟,因为时间具有累积属性。这种反馈告诉我们其他决定,并且可以用来改进系统政策。然而,利用隐含的反馈是困难的,因为它往往偏向于片面或不完整,可能取决于事件的结果。因此,使用反馈的现有做法,例如仅仅将其纳入数据驱动的政策模式,会受到偏向性的影响。我们开发了一个叫做Sayer的方法,利用隐含的反馈来评价和训练新的系统政策。 Sayer基于两个想法,即强化学习 -- -- 随机探索和不带偏见的反现实估计 -- -- 利用现有政策收集的数据来估计新的候选政策的业绩,而没有实际部署这些政策。 Sayererer使用隐含的探索和隐含的数据增强可以产生不偏倚的、不偏倚的反馈,然后由隐含的对应性政策加以应用,这种政策被隐含地用于不精确的准确的分析。