We formulate learning for control as an $\textit{inverse problem}$ -- inverting a dynamical system to give the actions which yield desired behavior. The key challenge in this formulation is a $\textit{distribution shift}$ -- the learning agent only observes the forward mapping (its actions' consequences) on trajectories that it can execute, yet must learn the inverse mapping for inputs-outputs that correspond to a different, desired behavior. We propose a general recipe for inverse problems with a distribution shift that we term $\textit{iterative inversion}$ -- learn the inverse mapping under the current input distribution (policy), then use it on the desired output samples to obtain new inputs, and repeat. As we show, iterative inversion can converge to the desired inverse mapping, but under rather strict conditions on the mapping itself. We next apply iterative inversion to learn control. Our input is a set of demonstrations of desired behavior, given as video embeddings of trajectories, and our method iteratively learns to imitate trajectories generated by the current policy, perturbed by random exploration noise. We find that constantly adding the demonstrated trajectory embeddings $\textit{as input}$ to the policy when generating trajectories to imitate, a-la iterative inversion, steers the learning towards the desired trajectory distribution. To the best of our knowledge, this is the first exploration of learning control from the viewpoint of inverse problems, and our main advantage is simplicity -- we do not require rewards, and only employ supervised learning, which easily scales to state-of-the-art trajectory embedding techniques and policy representations. With a VQ-VAE embedding, and a transformer-based policy, we demonstrate non-trivial continuous control on several tasks. We also report improved performance on imitating diverse behaviors compared to reward based methods.
翻译:我们设计用于控制的学习 $\ textit{ 反向问题} $ -- 颠倒一个动态系统, 以提供产生预期行为的行动。 这个配方的主要挑战是 $\ textit{ 分发转换} $ -- 学习代理只观察它可以执行的轨迹上的前方映射(其行动的后果), 但是必须学习与不同、 期望的行为相对的投入输出的反向映射。 我们提出一个通向分配变化的反向问题的总配方, 我们使用一个动态系统, 来给当前投入分配( 政策) 学习的反向映射( 政策), 然后用它来获取新的投入, 重复。 正如我们所显示的那样, 迭代版的映射( 其行动的后果), 但是在映射本身的相当严格的条件下, 我们接下来要用反向的反向映射。 我们的投入是一系列的预想行为的演示, 以视频嵌入轨道的轨迹为首端, 并且我们不断学习状态的轨迹, 我们从当前政策所生成的轨迹- 进入的轨迹定位的轨迹, 方向的演进到随机的演中的演进式 。