Behavioral cloning (BC) can recover a good policy from abundant expert data, but may fail when expert data is insufficient. This paper considers a situation where, besides the small amount of expert data, a supplementary dataset is available, which can be collected cheaply from sub-optimal policies. Imitation learning with a supplementary dataset is an emergent practical framework, but its theoretical foundation remains under-developed. To advance understanding, we first investigate a direct extension of BC, called NBCU, that learns from the union of all available data. Our analysis shows that, although NBCU suffers an imitation gap that is larger than BC in the worst case, there exist special cases where NBCU performs better than or equally well as BC. This discovery implies that noisy data can also be helpful if utilized elaborately. Therefore, we further introduce a discriminator-based importance sampling technique to re-weight the supplementary data, proposing the WBCU method. With our newly developed landscape-based analysis, we prove that WBCU can outperform BC in mild conditions. Empirical studies show that WBCU simultaneously achieves the best performance on two challenging tasks where prior state-of-the-art methods fail.
翻译:行为性克隆(BC)可以从丰富的专家数据中恢复一项良好的政策,但在专家数据不足时可能会失败。本文认为,除了少量的专家数据外,还存在一个补充数据集,可以从亚最佳政策中廉价收集。以补充数据集进行模仿学习是一个新兴的实用框架,但其理论基础仍然发展不足。为了增进理解,我们首先调查从所有可用数据结合情况中学习的称为NBCU的BC直接扩展。我们的分析表明,尽管NBCU的模仿差距大于不列颠哥伦比亚,但在最坏的情况下,还存在一些特殊的情况,即NBCU的表现优于或同样优于不列颠哥伦比亚。这一发现表明,如果精心使用,噪音数据也会有所帮助。因此,我们进一步采用基于歧视因素的抽样技术来重新加权补充补充数据,并提议WBCU方法。我们新开发的基于景观的分析证明,WBCU在温和条件下可以超过不列颠哥伦比亚。根据经验进行的研究表明,BBC公司同时取得最佳业绩。