学习生成所有可行动作 (Learning to Generate All Feasible Actions)

Several machine learning (ML) applications are characterized by searching for an optimal solution to a complex task. The search space for this optimal solution is often very large, so large in fact that this optimal solution is often not computable. Part of the problem is that many candidate solutions found via ML are actually infeasible and have to be discarded. Restricting the search space to only the feasible solution candidates simplifies finding an optimal solution for the tasks. Further, the set of feasible solutions could be re-used in multiple problems characterized by different tasks. In particular, we observe that complex tasks can be decomposed into subtasks and corresponding skills. We propose to learn a reusable and transferable skill by training an actor to generate all feasible actions. The trained actor can then propose feasible actions, among which an optimal one can be chosen according to a specific task. The actor is trained by interpreting the feasibility of each action as a target distribution. The training procedure minimizes a divergence of the actor's output distribution to this target. We derive the general optimization target for arbitrary f-divergences using a combination of kernel density estimates, resampling, and importance sampling. We further utilize an auxiliary critic to reduce the interactions with the environment. A preliminary comparison to related strategies shows that our approach learns to visit all the modes in the feasible action space, demonstrating the framework's potential for learning skills that can be used in various downstream tasks.

翻译：几个机器学习(ML)应用的特点是寻找复杂任务的最佳解决方案。寻找这一最佳解决方案的搜索空间往往非常大, 其规模很大, 以至于这一最佳解决方案往往无法比较。部分问题在于, 通过 ML 找到的许多候选解决方案实际上不可行, 并且必须丢弃。将搜索空间限制在仅可行的解决方案中, 使候选人简化了为任务寻找最佳解决方案的方法。此外, 一套可行的解决方案可以被重新用于具有不同任务特征的多种问题。特别是, 我们发现, 复杂的任务可以分解成子任务和相应的技能。我们提议通过培训一个演员来产生所有可行行动来学习可重复和可转移的技能。受过培训的行为者可以提出可行的行动, 其中可以根据具体任务选择最佳的解决方案。将每项行动的可行性作为目标分配加以解释。培训程序可以将行为者产出分配到这个目标的差别降到最低。我们发现, 将任意调整的通用优化目标分解为分解成子任务和相应的技能。我们提议通过培训一个组合来学习一个可重复和可转移的技能技能。受过训练的行为者可以提出可行的方法, 使用初步的对比, 学习各种方法, 学习, 学习各种方法, 学习, 学习了我们所使用的方法, 学习了各种方法, 学习, 学习, 学习了各种方法, 学习, 学习了各种方法, 学习了各种方法, 以展示, 学习, 学习, 学习了各种方法, 学习了各种方法, 学习, 学习了各种方法, 学习, 学习, 学习, 学习, 学习, 学习了各种方法, 学习, 学习, 学习, 学习了各种方法, 学习, 学习, 学习, 学习, 学习了各种方法, 学习了各种方法, 学习了各种方法,, 学习,, 学习了。