Evaluation in machine learning is usually informed by past choices, for example which datasets or metrics to use. This standardization enables the comparison on equal footing using leaderboards, but the evaluation choices become sub-optimal as better alternatives arise. This problem is especially pertinent in natural language generation which requires ever-improving suites of datasets, metrics, and human evaluation to make definitive claims. To make following best model evaluation practices easier, we introduce GEMv2. The new version of the Generation, Evaluation, and Metrics Benchmark introduces a modular infrastructure for dataset, model, and metric developers to benefit from each others work. GEMv2 supports 40 documented datasets in 51 languages. Models for all datasets can be evaluated online and our interactive data card creation and rendering tools make it easier to add new datasets to the living benchmark.
翻译:机器学习中的评价通常以以往的选择为基础,例如使用哪些数据集或衡量标准。这种标准化使得能够使用领导板在平等基础上进行比较,但随着更好的替代方法的出现,评价选择会变得亚于最佳。这个问题在自然语言生成方面特别相关,需要不断改进数据集、指标和人文评价的组合,以便提出明确的要求。为了更容易采用最佳模式评价做法,我们引入了GEMv2。 新一代的生成、评估和计量基准为数据集、模型和计量开发者引入了模块化基础设施,以便相互受益。 GEMv2 支持51种语言的40个记录数据集。 所有数据集的模型都可以在线评估,我们的互动式数据卡创建和工具可以更容易地将新的数据集添加到活的基准中。