The environments of such large industrial machines as waste cranes in waste incineration plants are often weakly observable, where little information about the environmental state is contained in the observations due to technical difficulty or maintenance cost (e.g., no sensors for observing the state of the garbage to be handled). Based on the findings that skilled operators in such environments choose predetermined control strategies (e.g., grasping and scattering) and their durations based on sensor values, %thereby improving the robustness of their actions, we propose a novel non-parametric policy search algorithm: Gaussian process self-triggered policy search (GPSTPS). GPSTPS has two types of control policies: action and duration. A gating mechanism either maintains the action selected by the action policy for the duration specified by the duration policy or updates the action and duration by passing new observations to the policy; therefore, it is categorized as self-triggered. GPSTPS simultaneously learns both policies by trial and error based on sparse GP priors and variational learning to maximize the return. To verify the performance of our proposed method, we conducted experiments on garbage-grasping-scattering task for a waste crane with weak observations using a simulation and a robotic waste crane system. As experimental results, the proposed method acquired suitable policies to determine the action and duration based on the garbage's characteristics.
翻译:由于技术困难或维修成本(例如,没有观测待处理垃圾状况的传感器),观测中很少载有关于环境状况的信息。 根据以下调查结果,在这种环境中,熟练的操作者根据传感器价值选择预先确定的控制战略(例如,掌握和分散)及其持续时间,%通过提高其行动的稳健性,我们提出了一个新的非参数政策搜索算法:高萨进程自我触发政策搜索(GPSTPS)。 GPSTPS有两种类型的控制政策:行动和期限。 要么维持行动政策所选定的行动,期限由持续政策规定,要么通过对政策进行新的观察来更新行动和期限;因此,它被归类为自我触发的。 GPSTPS同时通过试验和错误来学习政策,根据微弱的GPS以前和变异性学习,以最大限度地实现回收。为了核查我们拟议方法的绩效,我们进行了有关行动政策的实验:行动和持续时间。