Online decision making aims to learn the optimal decision rule by making personalized decisions and updating the decision rule recursively. It has become easier than before with the help of big data, but new challenges also come along. Since the decision rule should be updated once per step, an offline update which uses all the historical data is inefficient in computation and storage. To this end, we propose a completely online algorithm that can make decisions and update the decision rule online via stochastic gradient descent. It is not only efficient but also supports all kinds of parametric reward models. Focusing on the statistical inference of online decision making, we establish the asymptotic normality of the parameter estimator produced by our algorithm and the online inverse probability weighted value estimator we used to estimate the optimal value. Online plugin estimators for the variance of the parameter and value estimators are also provided and shown to be consistent, so that interval estimation and hypothesis test are possible using our method. The proposed algorithm and theoretical results are tested by simulations and a real data application to news article recommendation.
翻译:在线决策的目的是通过作出个性化决定和更新决定规则来学习最佳决策规则。 在大数据的帮助下,这比以前容易得多,但新的挑战也随之出现。由于决定规则应该每步更新一次,使用所有历史数据的离线更新在计算和储存方面效率低下。为此,我们提议一个完全在线算法,通过随机梯度梯度梯度下降,在网上作出决定和更新决定规则。它不仅有效,而且支持所有类型的参数奖赏模式。侧重于在线决策的统计推论,我们通过模拟和对新闻文章建议的实际数据应用,测试了我们算法和我们用来估计最佳价值的在线反概率加权估计值是否正常。还提供并显示参数和价值估计值差异的在线插件估计数据,以便利用我们的方法进行期间估计和假设测试。提议的算法和理论结果通过模拟和对新闻文章建议的实际数据应用进行测试。