Markov 以线性流中序列决策的匪徒为模型的 Markov 决策程序 (Markov Decision Process modeled with Bandits for Sequential Decision Making in Linear-flow)

In membership/subscriber acquisition and retention, we sometimes need to recommend marketing content for multiple pages in sequence. Different from general sequential decision making process, the use cases have a simpler flow where customers per seeing recommended content on each page can only return feedback as moving forward in the process or dropping from it until a termination state. We refer to this type of problems as sequential decision making in linear--flow. We propose to formulate the problem as an MDP with Bandits where Bandits are employed to model the transition probability matrix. At recommendation time, we use Thompson sampling (TS) to sample the transition probabilities and allocate the best series of actions with analytical solution through exact dynamic programming. The way that we formulate the problem allows us to leverage TS's efficiency in balancing exploration and exploitation and Bandit's convenience in modeling actions' incompatibility. In the simulation study, we observe the proposed MDP with Bandits algorithm outperforms Q-learning with $\epsilon$-greedy and decreasing $\epsilon$, independent Bandits, and interaction Bandits. We also find the proposed algorithm's performance is the most robust to changes in the across-page interdependence strength.

翻译：在获得和保留会员/订户方面,我们有时需要按顺序建议多页的营销内容。与一般的顺序决策程序不同,使用案例有一个更简单的流程,即每页看到推荐内容的客户每看到推荐内容,只能将反馈回回回,作为进程向前推进,或从中下降到终止状态。我们把这类问题称为线性流动的顺序决策。我们提议将这一问题发展成一个由匪徒用来模拟过渡概率矩阵的 MDP 和匪徒用来模拟过渡概率矩阵的 MDP 。在建议时间,我们利用Thompson 抽样(TS) 来抽样过渡概率,并通过精确的动态编程分配具有分析解决方案的最佳一系列行动。我们制定问题的方式使我们能够利用TS在平衡勘探和开发过程中的效率以及黑道在建模行动上的方便性。在模拟研究中,我们用Bandits算法观察拟议的MDP在用美元-greedy和降低美元-greedy、独立匪和互动banditts之间学习Q-lex。我们还发现,拟议的算算法的性表现是跨页变化中最牢固的。