Benchmarks offer a scientific way to compare algorithms using objective performance metrics. Good benchmarks have two features: (a) they should be widely useful for many research groups; (b) and they should produce reproducible findings. In robotic manipulation research, there is a trade-off between reproducibility and broad accessibility. If the benchmark is kept restrictive (fixed hardware, objects), the numbers are reproducible but the setup becomes less general. On the other hand, a benchmark could be a loose set of protocols (e.g. object sets) but the underlying variation in setups make the results non-reproducible. In this paper, we re-imagine benchmarking for robotic manipulation as state-of-the-art algorithmic implementations, alongside the usual set of tasks and experimental protocols. The added baseline implementations will provide a way to easily recreate SOTA numbers in a new local robotic setup, thus providing credible relative rankings between existing approaches and new work. However, these local rankings could vary between different setups. To resolve this issue, we build a mechanism for pooling experimental data between labs, and thus we establish a single global ranking for existing (and proposed) SOTA algorithms. Our benchmark, called Ranking-Based Robotics Benchmark (RB2), is evaluated on tasks that are inspired from clinically validated Southampton Hand Assessment Procedures. Our benchmark was run across two different labs and reveals several surprising findings. For example, extremely simple baselines like open-loop behavior cloning, outperform more complicated models (e.g. closed loop, RNN, Offline-RL, etc.) that are preferred by the field. We hope our fellow researchers will use RB2 to improve their research's quality and rigor.
翻译:基准提供了使用客观业绩衡量标准比较算法的科学方法。 良好的基准有两个特点:(a) 基准应该对许多研究组广泛有用;(b) 并且它们应该产生可复制的结论。 在机器人操纵研究中,在复制和广泛可获取之间有一个权衡。 如果基准保持限制性( 固定硬件、 对象), 数字是可以再复制的, 但设置的设置会比较松散( 如对象组), 但设置的基本变异使得结果无法复制。 在本文中, 我们重新想象机器人操作的基准基准, 把它当作最先进的算法执行, 以及通常的任务和实验性协议。 如果增加基准执行将提供一个方法, 很容易在新的本地机器人设置中重新创建 SOTA 数字, 从而提供现有方法与新工作之间可信的相对排序。 然而, 这些本地的分类可能因不同的设置而不同。 为了解决这个问题, 我们在实验室之间建立一个实验性数据集中的实地评估机制, 并且我们用一种最精确的 RBAR 标准来评估我们现有的标准 。