Machine learning has the potential to fuel further advances in data science, but it is greatly hindered by an ad hoc design process, poor data hygiene, and a lack of statistical rigor in model evaluation. Recently, these issues have begun to attract more attention as they have caused public and embarrassing issues in research and development. Drawing from our experience as machine learning researchers, we follow the machine learning process from algorithm design to data collection to model evaluation, drawing attention to common pitfalls and providing practical recommendations for improvements. At each step, case studies are introduced to highlight how these pitfalls occur in practice, and where things could be improved.
翻译:机器学习有可能推动数据科学的进一步发展,但受到特别设计过程、数据卫生差和模型评估缺乏统计严谨性等极大阻碍。 最近,这些问题开始引起更多关注,因为它们在研发过程中引起了公众和尴尬的问题。 根据我们作为机器学习研究人员的经验,我们遵循机器学习过程,从算法设计到数据收集到模型评估,提请注意常见的缺陷,并提出切实可行的改进建议。 每一步,都进行个案研究,以突出这些缺陷如何在实践中发生,哪些方面可以改进。