Recent advances in ML suggest that the quantity of data available to a model is one of the primary bottlenecks to high performance. Although for language-based tasks there exist almost unlimited amounts of reasonably coherent data to train from, this is generally not the case for Reinforcement Learning, especially when dealing with a novel environment. In effect, even a relatively trivial continuous environment has an almost limitless number of states, but simply sampling random states and actions will likely not provide transitions that are interesting or useful for any potential downstream task. How should one generate massive amounts of useful data given only an MDP with no indication of downstream tasks? Are the quantity and quality of data truly transformative to the performance of a general controller? We propose to answer both of these questions. First, we introduce a principled unsupervised exploration method, ChronoGEM, which aims to achieve uniform coverage over the manifold of achievable states, which we believe is the most reasonable goal given no prior task information. Secondly, we investigate the effects of both data quantity and data quality on the training of a downstream goal-achievement policy, and show that both large quantities and high-quality of data are essential to train a general controller: a high-precision pose-achievement policy capable of attaining a large number of poses over numerous continuous control embodiments including humanoid.
翻译:最新ML的进展表明,模型可获得的数据数量是高绩效的主要瓶颈之一。虽然对于基于语言的任务来说,有几乎无限数量的合理一致的数据可以从中培训,但对于加强学习而言,特别是处理新环境而言,情况一般并非如此。实际上,即使是相对微不足道的连续环境,也有几乎无限数量的国家,但仅仅抽样随机的状态和行动可能不会为任何潜在的下游任务提供有趣或有用的过渡。如果只给MDP提供大量有用的数据,而没有下游任务的迹象,那么人们应该如何产生大量有用的数据?数据的数量和质量是否真正对总控制者的业绩产生改变?我们提议回答这两个问题。首先,我们采用原则性、不受监督的探索方法,即ChrronoGEM,其目的是对可实现的状态的方方面实现统一覆盖,我们认为,鉴于以前没有任务信息,这是最合理的目标。第二,我们调查数据数量和数据质量对下游目标实现政策培训的影响,并表明大量和高质量的数据对于培训一个大型总控制系统至关重要,包括不断的高度控制。