Training novice users to operate an excavator for learning different skills requires the presence of expert teachers. Considering the complexity of the problem, it is comparatively expensive to find skilled experts as the process is time-consuming and requires precise focus. Moreover, since humans tend to be biased, the evaluation process is noisy and will lead to high variance in the final score of different operators with similar skills. In this work, we address these issues and propose a novel strategy for the automatic evaluation of excavator operators. We take into account the internal dynamics of the excavator and the safety criterion at every time step to evaluate the performance. To further validate our approach, we use this score prediction model as a source of reward for a reinforcement learning agent to learn the task of maneuvering an excavator in a simulated environment that closely replicates the real-world dynamics. For a policy learned using these external reward prediction models, our results demonstrate safer solutions following the required dynamic constraints when compared to policy trained with task-based reward functions only, making it one step closer to real-life adoption. For future research, we release our codebase at https://github.com/pranavAL/InvRL_Auto-Evaluate and video results https://drive.google.com/file/d/1jR1otOAu8zrY8mkhUOUZW9jkBOAKK71Z/view?usp=share_link .
翻译:考虑到问题的复杂性,找到熟练的专家比较昂贵,因为这一过程耗时费时,需要精确的焦点。此外,由于人往往有偏向,评价过程很吵,将导致具有类似技能的不同操作者的最后分数出现很大差异。在这项工作中,我们解决这些问题,并为自动评价挖掘机操作员提出新的战略。我们考虑到挖掘机的内部动态和每个步骤的安全标准,以评价业绩。为了进一步验证我们的方法,我们用这个计分预测模型作为奖励来源,奖励一个强化学习代理机构学习在模拟环境中操纵挖掘机的任务,这种模拟环境将密切复制现实世界的动态。对于利用这些外部奖励预测模型学习的政策,我们的结果显示在与仅受基于任务的奖励功能培训的政策相比,在遇到所需的动态限制后更安全的解决办法,使之更接近于现实生活。为了未来研究,我们发布了在 https://gigh_Ohus_BOio_Acom/provaravalal_ httpsurus_Oqrus_Oqrus_Angou_Angru_Agroval_Inavalalal/ALAL。