We propose a new objective for option discovery that emphasizes the computational advantage of using options in planning. In a sequential machine, the speed of planning is proportional to the number of elementary operations used to achieve a good policy. For episodic tasks, the number of elementary operations depends on the number of options composed by the policy in an episode and the number of options being considered at each decision point. To reduce the amount of computation in planning, for a given set of episodic tasks and a given number of options, our objective prefers options with which it is possible to achieve a high return by composing few options, and also prefers a smaller set of options to choose from at each decision point. We develop an algorithm that optimizes the proposed objective. In a variant of the classic four-room domain, we show that 1) a higher objective value is typically associated with fewer number of elementary planning operations used by the option-value iteration algorithm to obtain a near-optimal value function, 2) our algorithm achieves an objective value that matches it achieved by two human-designed options 3) the amount of computation used by option-value iteration with options discovered by our algorithm matches it with the human-designed options, 4) the options produced by our algorithm also make intuitive sense--they seem to move to and terminate at the entrances of rooms.
翻译:我们提出一个新的选项发现目标,强调在规划中使用选项的计算优势。在相继机器中,规划速度与实现良好政策所使用的基本操作数量成正比。对于偶发任务,基本操作的数量取决于政策在一个插曲中包含的选项数量和每个决策点考虑的选项数量。为了减少规划中的计算数量,针对一组附带任务和特定选项,我们的目标更倾向于选择能够通过包含少数选项实现高回报的选项,并且更倾向于在每个决策点选择更小的一组选项。我们开发了优化拟议目标的算法。在典型的四室域的变式中,我们显示:(1) 较高的客观价值通常与选择价值的 Iteration 算法用于获取近最佳价值功能的初级规划操作数量较少相关联。(2) 我们的算法达到了一个客观价值,与两种人为设计的选项相匹配 3) 选项值的计算数量,与选择值所使用的选择值与我们从算算算算中发现的选项的优化。 4) 我们的典型的四室域域域域域中,我们显示更高的客观价值通常与选择与选择的选项相匹配。