In this thesis, we research learning algorithms for optimal decision making in two different contexts, Reinforcement Learning in Part I and Auction Design in Part II. Reinforcement learning (RL) is an area of machine learning that is concerned with how an agent should act in an environment in order to maximize its cumulative reward over time. In Chapter 2, inspired by statistical physics, we develop a novel approach to Reinforcement Learning (RL) that not only learns optimal policies with enhanced desirable properties but also sheds new light on maximum entropy RL. In Chapter 3, we tackle the generalization problem in RL using a Bayesian perspective. We show that imperfect knowledge of the environments dynamics effectively turn a fully-observed Markov Decision Process (MDP) into a Partially Observed MDP (POMDP) that we call the Epistemic POMDP. Informed by this observation, we develop a new policy learning algorithm LEEP which has improved generalization properties. Designing an incentive compatible, individually rational auction that maximizes revenue is a challenging and intractable problem. Recently, deep learning based approaches have been proposed to learn optimal auctions from data. While successful, this approach suffers from a few limitations, including sample inefficiency, lack of generalization to new auctions, and training difficulties. In Chapter 4, we construct a symmetry preserving neural network architecture, EquivariantNet, suitable for anonymous auctions. EquivariantNet is not only more sample efficient but is also able to learn auction rules that generalize well to other settings. In Chapter 5, we propose a novel formulation of the auction learning problem as a two player game. The resulting learning algorithm, ALGNet, is easier to train, more reliable and better suited for non stationary settings.
翻译:在此篇中,我们研究的是在两种不同的背景下为最佳决策而学习拍卖法,在两种不同的背景下进行优化决策的学习算法,《第一部分强化学习》和《第二部分拍卖设计》。《强化学习》是机器学习的一个领域,涉及代理人如何在一定环境中采取行动,以便长期最大限度地获得累积的奖励。在第二章中,我们根据统计物理的启发,开发了一种新型的强化拍卖学习方法(RL),该方法不仅以更好的性能来学习最佳政策,而且还为最高性价比的RL提供了新的亮度。在第3章中,我们从巴伊西亚的角度解决了RL的普及问题。我们表明,对环境动态的不完善知识将充分观测到的Markov 决策程序(MDP)转化为一个完全可观测到的 MDP(POMDP ) 。 在第二章中,我们称之为EPOMDP(E) 的“强化性价比 ” (LEEEE), 我们开发了一种新的政策学习算法, 改进了通用性价比的特性, 使收入最大化的个体拍卖成为一个具有挑战性和棘手的问题。最近, 深的基于学习方法建议从数据中学习了最佳拍卖法,,, 学习方法是学习一个成功,, 学习了, 学习, 学习了成本 学习 学习 学习 学习 学习了, 学习 学习了 学习 学习 学习了,,, 方法是学习到,,,, 学习了 学习了 学习了 学习了 学习了,,, 学习了 学习了, 学习了, 学习了 学习了, 学习了 学习了 学习了 学习了 学习了 学习了,, 学习了 方法,,, 学习了,学习了,学习了,学习了,学习了,学习了, 学习了 学习了,学习了 学习了 学习了 学习了 学习了,学习了, 学习了 学习了 学习 学习 学习,学习 学习 学习 学习 学习 学习 学习 学习 学习 学习 学习了 学习了,学习了,学习了 学习