We consider a stochastic lost-sales inventory control system with a lead time $L$ over a planning horizon $T$. Supply is uncertain, and is a function of the order quantity (due to random yield/capacity, etc). We aim to minimize the $T$-period cost, a problem that is known to be computationally intractable even under known distributions of demand and supply. In this paper, we assume that both the demand and supply distributions are unknown and develop a computationally efficient online learning algorithm. We show that our algorithm achieves a regret (i.e. the performance gap between the cost of our algorithm and that of an optimal policy over $T$ periods) of $O(L+\sqrt{T})$ when $L\geq\log(T)$. We do so by 1) showing our algorithm cost is higher by at most $O(L+\sqrt{T})$ for any $L\geq 0$ compared to an optimal constant-order policy under complete information (a well-known and widely-used algorithm) and 2) leveraging its known performance guarantee from the existing literature. To the best of our knowledge, a finite-sample $O(\sqrt{T})$ (and polynomial in $L$) regret bound when benchmarked against an optimal policy is not known before in the online inventory control literature. A key challenge in this learning problem is that both demand and supply data can be censored; hence only truncated values are observable. We circumvent this challenge by showing that the data generated under an order quantity $q^2$ allows us to simulate the performance of not only $q^2$ but also $q^1$ for all $q^1<q^2$, a key observation to obtain sufficient information even under data censoring. By establishing a high probability coupling argument, we are able to evaluate and compare the performance of different order policies at their steady state within a finite time horizon. Since the problem lacks convexity, we develop an active elimination method that adaptively rules out suboptimal solutions.
翻译:我们考虑的是具有周转时间(L$)的库存损失控制系统。 供应是不确定的, 并且是订单数量( 随机收益/ 能力等) 的函数。 我们的目标是在美元( 随机收益/ 能力) 时, 将美元( T) 期成本降至最低, 即便在已知的需求和供应分配情况下, 这个问题也已知是计算上难以解决的问题。 在本文中, 我们假设, 供需分配是未知的, 并且开发了一个计算高效的在线学习算法。 我们的算法实现了一个遗憾( 即, 我们的算法成本与美元( 美元) 之间的性能差异。 美元( 美元) 成本( 美元) 成本( 美元) 成本( 美元) 价格( 美元) 价格( 美元( 美元) 价格( 美元) ( 美元) 成本( 美元) 成本( 美元) 成本( 美元) 成本( 美元( 美元) 美元( 美元) 美元( 美元( 美元) ( 美元) 美元( 美元) 美元( 美元) ( 美元) ( 美元) ( 美元) ( 美元) ( 美元) ( 美元) ( 美元) ( 美元) ( 美元) ( 美元) ( 美元) ( 美元) ( 美元) ( 美元) ( 美元) ( 美元) ( 美元) ( 美元) ( 美元) ( 美元) ( 美元) ( 美元) ( 美元) ( 美元) ( 美元) ( 美元) ( ) ( ) ( ) ( 美元) ( ) ( 美元) ( ) ( 美元) ( 美元) ( ) ( ) ( ) ( 美元) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( 美元) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) (