Online competitive games have become a mainstream entertainment platform. To create a fair and exciting experience, these games use rating systems to match players with similar skills. While there has been an increasing amount of research on improving the performance of these systems, less attention has been paid to how their performance is evaluated. In this paper, we explore the utility of several metrics for evaluating three popular rating systems on a real-world dataset of over 25,000 team battle royale matches. Our results suggest considerable differences in their evaluation patterns. Some metrics were highly impacted by the inclusion of new players. Many could not capture the real differences between certain groups of players. Among all metrics studied, normalized discounted cumulative gain (NDCG) demonstrated more reliable performance and more flexibility. It alleviated most of the challenges faced by the other metrics while adding the freedom to adjust the focus of the evaluations on different groups of players.
翻译:在线竞争游戏已成为主流娱乐平台。为了创造公平和令人振奋的经验,这些游戏使用评级系统来匹配具有类似技能的球员。虽然关于改进这些球员的成绩的研究越来越多,但对于如何评价这些球员的成绩却不太重视。在本文中,我们探讨了在25 000多个球队皇家比赛的真实世界数据集上评价三种流行评级系统的效用。我们的结果显示,它们的评价模式有很大差异。一些指标因新球员的加入而受到很大影响。许多指标无法捕捉某些球员群体之间的真正差异。在所有所研究的指标中,标准化的折扣累积收益显示了更可靠的业绩和更灵活。它减轻了其他球员所面临的大部分挑战,同时增加了调整不同球员群体评价重点的自由。