In imitation learning, imitators and demonstrators are policies for picking actions given past interactions with the environment. If we run an imitator, we probably want events to unfold similarly to the way they would have if the demonstrator had been acting the whole time. In general, one mistake during learning can lead to completely different events. In the special setting of environments that restart, existing work provides formal guidance in how to imitate so that events unfold similarly, but outside that setting, no formal guidance exists. We address a fully general setting, in which the (stochastic) environment and demonstrator never reset, not even for training purposes, and we allow our imitator to learn online from the demonstrator. Our new conservative Bayesian imitation learner underestimates the probabilities of each available action, and queries for more data with the remaining probability. Our main result: if an event would have been unlikely had the demonstrator acted the whole time, that event's likelihood can be bounded above when running the (initially totally ignorant) imitator instead. Meanwhile, queries to the demonstrator rapidly diminish in frequency. If any such event qualifies as "dangerous", our imitator would have the notable distinction of being relatively "safe".
翻译:在模仿学习中,模仿者和示威者是选择过去与环境互动中的行为的政策。如果我们运行一个模仿器,我们很可能希望事件与示范器整个时间都在发生的情况下发生的情况相似。一般来说,学习过程中的一个错误会导致完全不同的事件。在重新开始的特别环境中,现有工作为如何模仿事件以同样的方式发生提供了正式指导,但在此环境之外,没有正式的指导。我们处理一个全面的总体环境,在这个环境里,(随机)环境和示范器从未重新设定,甚至为了训练目的,我们让模拟器在网上学习,如果我们的模拟器在网上学习的话。我们新的保守的Bayesian模拟学习器低估了每一种现有行动的概率,并用剩余的可能性来查询更多的数据。我们的主要结果:如果一个事件不可能让恶魔全程地进行,那么当运行(最初完全无知的)模仿器时,该事件的可能性就会被束缚在上。同时,对恶魔的查询将会在频率上迅速减少。如果任何这种事件都比较安全,那么,我们的模仿器就会变得比较“危险 ” 。