In this work, the trick-taking game Wizard with a separate bidding and playing phase is modeled by two interleaved partially observable Markov decision processes (POMDP). Deep Q-Networks (DQN) are used to empower self-improving agents, which are capable of tackling the challenges of a highly non-stationary environment. To compare algorithms between each other, the accuracy between bid and trick count is monitored, which strongly correlates with the actual rewards and provides a well-defined upper and lower performance bound. The trained DQN agents achieve accuracies between 66% and 87% in self-play, leaving behind both a random baseline and a rule-based heuristic. The conducted analysis also reveals a strong information asymmetry concerning player positions during bidding. To overcome the missing Markov property of imperfect-information games, a long short-term memory (LSTM) network is implemented to integrate historic information into the decision-making process. Additionally, a forward-directed tree search is conducted by sampling a state of the environment and thereby turning the game into a perfect information setting. To our surprise, both approaches do not surpass the performance of the basic DQN agent.
翻译:在这项工作中,具有单独招标和播放阶段的捉弄游戏向导以两个半可观测部分可观测的马尔科夫决策过程(POMDP)为模型。深Q-Networks (DQN) 用于增强自我改进的代理器的能力,这些代理器能够应对高度非静止环境的挑战。为了比较算法,对出价和计数之间的准确性进行了监测,这些算法与实际奖励密切相关,并提供了明确界定的上下级性能约束。经过训练的DQN 代理器在自我游戏中获得了66%至87%的默认,留下一个随机基线和基于规则的超常性。进行的分析还揭示出在投标中玩家位置方面的严重信息不对称现象。为了克服不完善的信息游戏中缺失的马尔科夫特性,实施了长期的短期内存(LSTM)网络,将历史信息纳入决策进程。此外,通过对环境状况进行抽样,从而将游戏转化为完美的信息设置,从而进行前向方向的树木搜索。让我们感到惊讶的是,两种方法都没有超过基本的DN代理器的性。