Sequential incentive marketing is an important approach for online businesses to acquire customers, increase loyalty and boost sales. How to effectively allocate the incentives so as to maximize the return (e.g., business objectives) under the budget constraint, however, is less studied in the literature. This problem is technically challenging due to the facts that 1) the allocation strategy has to be learned using historically logged data, which is counterfactual in nature, and 2) both the optimality and feasibility (i.e., that cost cannot exceed budget) needs to be assessed before being deployed to online systems. In this paper, we formulate the problem as a constrained Markov decision process (CMDP). To solve the CMDP problem with logged counterfactual data, we propose an efficient learning algorithm which combines bisection search and model-based planning. First, the CMDP is converted into its dual using Lagrangian relaxation, which is proved to be monotonic with respect to the dual variable. Furthermore, we show that the dual problem can be solved by policy learning, with the optimal dual variable being found efficiently via bisection search (i.e., by taking advantage of the monotonicity). Lastly, we show that model-based planing can be used to effectively accelerate the joint optimization process without retraining the policy for every dual variable. Empirical results on synthetic and real marketing datasets confirm the effectiveness of our methods.
翻译:序列激励营销是在线企业获取客户、增加忠诚和提升销售的一个重要方法。然而,文献中较少研究这一问题,因为1)分配战略必须使用历史记录的数据学习,而这些数据是反事实性的,2)最佳性和可行性(即成本不能超过预算)需要评估,然后才能部署到在线系统。在本文中,我们将问题发展成一个限制的Markov决策程序(CMDP ) 。为了用记录反事实数据解决CMDP问题,我们建议一种高效的学习算法,将两部分搜索和基于模型的规划结合起来。首先,CMDP被转换为双重使用Lagrangaian放松,这证明与双重变量是单调的。此外,我们表明,通过政策学习可以解决双重问题(即成本不能超过预算 ), 通过双层搜索(i.e.) 高效地发现最佳的双重变量,我们可以通过双层搜索(i. i. e. ) 解决CMDP 问题。我们建议一种高效的学习算法,将两面搜索方法结合起来,将两面搜索的结果都用于加速。</s>