As with most Machine Learning systems, recommender systems are typically evaluated through performance metrics computed over held-out data points. However, real-world behavior is undoubtedly nuanced: ad hoc error analysis and deployment-specific tests must be employed to ensure the desired quality in actual deployments. In this paper, we propose RecList, a behavioral-based testing methodology. RecList organizes recommender systems by use case and introduces a general plug-and-play procedure to scale up behavioral testing. We demonstrate its capabilities by analyzing known algorithms and black-box commercial systems, and we release RecList as an open source, extensible package for the community.
翻译:与大多数机械学习系统一样,建议者系统通常通过按搁置数据点计算的业绩衡量标准进行评估。然而,现实世界的行为无疑是细微的:必须采用专门的错误分析和具体部署的测试来确保实际部署的预期质量。在本文中,我们建议采用基于行为的测试方法RecList,即基于行为的测试方法。 RecList通过使用案例组织建议系统,并引入一般的插件程序来扩大行为测试。我们通过分析已知的算法和黑盒商业系统来展示其能力,我们发布RecList作为开放源,为社区提供可扩展的软件包。