重复第一价格拍卖中的最佳无 Regret 学习 (Optimal No-regret Learning in Repeated First-price Auctions)

We study online learning in repeated first-price auctions with censored feedback, where a bidder, only observing the winning bid at the end of each auction, learns to adaptively bid in order to maximize her cumulative payoff. To achieve this goal, the bidder faces a challenging dilemma: if she wins the bid--the only way to achieve positive payoffs--then she is not able to observe the highest bid of the other bidders, which we assume is iid drawn from an unknown distribution. This dilemma, despite being reminiscent of the exploration-exploitation trade-off in contextual bandits, cannot directly be addressed by the existing UCB or Thompson sampling algorithms. In this paper, by exploiting the structural properties of first-price auctions, we develop the first learning algorithm that achieves $O(\sqrt{T}\log^{2.5} T)$ regret bound, which is minimax optimal up to $\log$ factors, when the bidder's private values are stochastically generated. We do so by providing an algorithm on a general class of problems, called the partially ordered contextual bandits, which combine the graph feedback across actions, the cross learning across contexts, and a partial order over the contexts. We establish both strengths and weaknesses of this framework, by showing a curious separation that a regret nearly independent of the action/context sizes is possible under stochastic contexts, but is impossible under adversarial contexts. Despite the limitation of this general framework, we further exploit the structure of first-price auctions and develop a learning algorithm that operates sample-efficiently (and computationally efficiently) in the presence of adversarially generated private values. We establish an $O(\sqrt{T}\log^3 T)$ regret bound for this algorithm, hence providing a complete characterization of optimal learning guarantees for first-price auctions.

翻译：我们研究在反复的第一价拍卖中反复进行第一价拍卖的在线学习,我们假设,这是从未知的分布中提取的。这个两难点,尽管在每次拍卖结束时只观察中标的中标,却无法直接通过现有的UCB或汤普森抽样算法来解决。在本文中,通过利用第一价拍卖的结构属性,我们开发了第一个学习算法,实现了美元(sqrt{T ⁇ ççç ⁇ 2.5}T)的正向回报,如果她赢得了出价的唯一方法,那么她就无法遵守其他投标人的最高出价,而我们假设这是从未知的分布中提取出来的。尽管在背景强效交易交易交易交易交易中,我们通过提供总体问题的算法,称为部分上标的内值采集法,通过第一价拍卖的结构性属性,我们开发了第一个学习美元(sqrqrqroral3}T)的学习算法,这是个最小的顶点,当投标人的私人价值首次生成时, 最优的值(我们通过提供整个问题类别的算法,但又称为部分背景上的缩略的缩图。我们根据整个内部的逻辑, 学习了整个内部的逻辑,在不同的背景中, 度上演算中形成了一个不相偏差的逻辑中, 学习了整个过程里, 。