When omega-regular objectives were first proposed in model-free reinforcement learning (RL) for controlling MDPs, deterministic Rabin automata were used in an attempt to provide a direct translation from their transitions to scalar values. While these translations failed, it has turned out that it is possible to repair them by using good-for-MDPs (GFM) B\"uchi automata instead. These are nondeterministic B\"uchi automata with a restricted type of nondeterminism, albeit not as restricted as in good-for-games automata. Indeed, deterministic Rabin automata have a pretty straightforward translation to such GFM automata, which is bi-linear in the number of states and pairs. Interestingly, the same cannot be said for deterministic Streett automata: a translation to nondeterministic Rabin or B\"uchi automata comes at an exponential cost, even without requiring the target automaton to be good-for-MDPs. Do we have to pay more than that to obtain a good-for-MDP automaton? The surprising answer is that we have to pay significantly less when we instead expand the good-for-MDP property to alternating automata: like the nondeterministic GFM automata obtained from deterministic Rabin automata, the alternating good-for-MDP automata we produce from deterministic Streett automata are bi-linear in the the size of the deterministic automaton and its index, and can therefore be exponentially more succinct than minimal nondeterministic B\"uchi automata.
翻译:当在无模型强化学习中首次提出用于控制 MDP 的 OMEGA 常规目标时, 使用确定性 Rabin 自动地图试图提供直接翻译, 从他们向 斯卡勒 值的过渡中提供直接翻译。 虽然这些翻译失败, 但结果显示, 可以通过使用良好的 MDP (GM) B\\\\\\\\\\ uchi 自动mata 来修复它们。 这些不是决定性的 B\\\\\\ uchi 自动数据, 具有有限的非确定性类型, 尽管不局限于为游戏的自动数据。 事实上, 确定性 Rabin 自动自动数据对于这种GFM 自动数据(在州和配对中是双线的) 。 有趣的是, 无法这么说的是, “ 确定性Stemitministic Startatomata : 翻译为非确定性狂犬病或B\\\\\ autmata, 其不要求目标的自动数据是良好的非游戏的最低限度。 因此, 确定性OD- deal- deminal- demoal- demodal- demodal- demodal- demo- demogradustrical- we hown the we hown