We consider the problem of content caching at the wireless edge to serve a set of end users via unreliable wireless channels so as to minimize the average latency experienced by end users due to the constrained wireless edge cache capacity. We formulate this problem as a Markov decision process, or more specifically a restless multi-armed bandit problem, which is provably hard to solve. We begin by investigating a discounted counterpart, and prove that it admits an optimal policy of the threshold-type. We then show that this result also holds for average latency problem. Using this structural result, we establish the indexability of our problem, and employ the Whittle index policy to minimize average latency. Since system parameters such as content request rates and wireless channel conditions are often unknown and time-varying, we further develop a model-free reinforcement learning algorithm dubbed as Q^{+}-Whittle that relies on Whittle index policy. However, Q^{+}-Whittle requires to store the Q-function values for all state-action pairs, the number of which can be extremely large for wireless edge caching. To this end, we approximate the Q-function by a parameterized function class with a much smaller dimension, and further design a Q^{+}-Whittle algorithm with linear function approximation, which is called Q^{+}-Whittle-LFA. We provide a finite-time bound on the mean-square error of Q^{+}-Whittle-LFA. Simulation results using real traces demonstrate that Q^{+}-Whittle-LFA yields excellent empirical performance.
翻译:我们考虑的是无线边缘的内容缓冲问题,以便通过不可靠的无线频道为一组终端用户服务,从而最大限度地减少终端用户由于无线边缘缓冲能力受到限制而经历的平均延迟。我们将此问题当作一个Markov决定程序,或者更具体地说,作为一个无休止的多武装土匪问题,这是很难解决的。我们首先调查一个折扣对应方,并证明它承认一个最优的门槛型政策。我们然后表明,这一结果还存在平均悬浮问题。我们利用这一结构结果,确定我们的问题的可指数性,并采用惠特尔指数政策来尽量减少平均延迟。由于内容请求率和无线频道条件等系统参数往往不为人所知,我们进一步开发一个无模式的强化学习算法,该算法以惠特尔指数政策为基础。然而, ⁇ -hittle需要将所有州-行动配方的Q-功能储存起来。 使用这一结果对于无线边缘的可追踪性能非常大。至此端,我们近于Simle-A的精确度值,我们用一个节度A-ral-ral-ral-lial-al-laf-lais-def-la-la-de-de-lax-lax-lax-lax-lax 进一步展示一个更小的功能。