对国际足足联2018年世界杯足球赛的预测 -- -- 随机森林办法,重点是估计团队能力参数 (Prediction of the FIFA World Cup 2018 - A random forest approach with an emphasis on estimated team ability parameters)

from arxiv, First revised version, corrected typo in introduction when referring to the winning probabilities derived by Zeileis, Leitner, and Hornik (2018), which are for Germany 15.8% instead of 12.8%. Second revised version, slight changes in notation in Section 3.3

In this work, we compare three different modeling approaches for the scores of soccer matches with regard to their predictive performances based on all matches from the four previous FIFA World Cups 2002 - 2014: Poisson regression models, random forests and ranking methods. While the former two are based on the teams' covariate information, the latter method estimates adequate ability parameters that reflect the current strength of the teams best. Within this comparison the best-performing prediction methods on the training data turn out to be the ranking methods and the random forests. However, we show that by combining the random forest with the team ability parameters from the ranking methods as an additional covariate we can improve the predictive power substantially. Finally, this combination of methods is chosen as the final model and based on its estimates, the FIFA World Cup 2018 is simulated repeatedly and winning probabilities are obtained for all teams. The model slightly favors Spain before the defending champion Germany. Additionally, we provide survival probabilities for all teams and at all tournament stages as well as the most probable tournament outcome.

翻译：在这项工作中,我们根据前四次国际足联2002-2014年世界杯比赛的所有比赛的预测性表现,比较了足球比赛分数的三个不同的模型:Poisson回归模型、随机森林和排名方法。前者基于团队的共变信息,而后一种方法则估算出反映团队当前实力的适当能力参数。在这一比较中,培训数据的最佳业绩预测方法最终成为排名方法和随机森林。然而,通过将随机森林与分级方法中的团队能力参数相结合,作为额外的共变数,我们可以大大改善预测力。最后,选择了这一方法的组合,作为最终模式,并基于其估计,对2018年国际足联世界杯进行了多次模拟,并获得了所有团队的概率。模型在捍卫德国冠军之前略微有利于西班牙。此外,我们为所有团队和所有竞赛阶段提供生存概率,以及最有可能的比赛结果。