In this paper, we consider the problem of learning online to manage Demand Response (DR) resources. A typical DR mechanism requires the DR manager to assign a baseline to the participating consumer, where the baseline is an estimate of the counterfactual consumption of the consumer had it not been called to provide the DR service. A challenge in estimating baseline is the incentive the consumer has to inflate the baseline estimate. We consider the problem of learning online to estimate the baseline and to optimize the operating costs over a period of time under such incentives. We propose an online learning scheme that employs least-squares for estimation with a perturbation to the reward price (for the DR services or load curtailment) that is designed to balance the exploration and exploitation trade-off that arises with online learning. We show that, our proposed scheme is able to achieve a very low regret of $\mathcal{O}\left((\log{T})^2\right)$ with respect to the optimal operating cost over $T$ days of the DR program with full knowledge of the baseline, and is individually rational for the consumers to participate. Our scheme is significantly better than the averaging type approach, which only fetches $\mathcal{O}(T^{1/3})$ regret.
翻译:在本文中,我们考虑学习在线管理需求响应(DR)资源的问题。典型的DR机制要求DR管理器为参与的消费者分配一个基线,其中基线是对消费者计数实际消耗的估计,如果未被要求提供DR服务,消费者将消耗基线的负数值。估算基线的挑战在于消费者有动机夸大基线估计值。我们考虑在线学习如何估计基线并在此类激励下优化一段时间内的运营成本。我们提出了一种在线学习方案,该方案采用最小二乘估计,并向奖励价格(用于DR服务或负荷减缓)引入扰动,以平衡在线学习中出现的探索和开发的权衡。我们证明,我们的方案能够实现针对全面了解基线的DR计划的最优运营成本非常低的遗憾度,达到$\mathcal{O}((\log{T})^2)$,并且对于消费者参与是个人理性的。我们的方案明显优于仅能达到$\mathcal{O}(T^{1/3})$遗憾度的平均式方法。