利用基于有延迟奖励的观察数据的模拟数据模拟,通过销售渠道优化销售渠道:LinkedIn的案例研究 (Sales Channel Optimization via Simulations Based on Observational Data with Delayed Rewards: A Case Study at LinkedIn)

Training models on data obtained from randomized experiments is ideal for making good decisions. However, randomized experiments are often time-consuming, expensive, risky, infeasible or unethical to perform, leaving decision makers little choice but to rely on observational data collected under historical policies when training models. This opens questions regarding not only which decision-making policies would perform best in practice, but also regarding the impact of different data collection protocols on the performance of various policies trained on the data, or the robustness of policy performance with respect to changes in problem characteristics such as action- or reward- specific delays in observing outcomes. We aim to answer such questions for the problem of optimizing sales channel allocations at LinkedIn, where sales accounts (leads) need to be allocated to one of three channels, with the goal of maximizing the number of successful conversions over a period of time. A key problem feature constitutes the presence of stochastic delays in observing allocation outcomes, whose distribution is both channel- and outcome- dependent. We built a discrete-time simulation that can handle our problem features and used it to evaluate: a) a historical rule-based policy; b) a supervised machine learning policy (XGBoost); and c) multi-armed bandit (MAB) policies, under different scenarios involving: i) data collection used for training (observational vs randomized); ii) lead conversion scenarios; iii) delay distributions. Our simulation results indicate that LinUCB, a simple MAB policy, consistently outperforms the other policies, achieving a 18-47% lift relative to a rule-based policy

翻译：从随机实验获得的数据培训模式是作出良好决定的理想条件,然而,随机实验往往耗费时间、费用昂贵、风险、不可行或不道德,使决策者没有什么选择,而是依赖历史政策下收集的观察数据,在培训模式下,这就提出了问题,不仅涉及哪些决策政策在实践上最有效,而且涉及不同数据收集协议对数据培训的各种政策业绩的影响,或政策业绩对问题特点变化的稳健性,如在观察结果方面的行动或奖励性拖延等。我们的目的是回答以下问题:优化Linked In的销售渠道分配问题,在这些渠道中,销售账户(领导)需要分配给三个渠道中的一个,目的是在一段时间内最大限度地增加成功转换的数目。一个关键问题是,在观察分配结果方面存在着混乱性的延误,而分配既取决于渠道,也取决于结果。我们建立了一个不固定时间的模拟,可以处理我们的问题特征,并使用它来评估:a)基于历史规则的政策;b) 需要将销售账户(领导) 需要分配给三个渠道中的一个渠道,目的是在一段时间内最大限度地增加成功转换的数目。一个机器(X) 用于收集数据周期性的政策假设;x格式,在一种管理层政策下进行一种不定期的模型下,用于进行一种或多种政策;x式的周期性的政策;x级政策;x式的周期内,用于实现一种不定期的政策;a-ro化的数据收集(x) 进行一种不定期政策;a-