The field of data science currently enjoys a broad definition that includes a wide array of activities which borrow from many other established fields of study. Having such a vague characterization of a field in the early stages might be natural, but over time maintaining such a broad definition becomes unwieldy and impedes progress. In particular, the teaching of data science is hampered by the seeming need to cover many different points of interest. Data scientists must ultimately identify the core of the field by determining what makes the field unique and what it means to develop new knowledge in data science. In this review we attempt to distill some core ideas from data science by focusing on the iterative process of data analysis and develop some generalizations from past experience. Generalizations of this nature could form the basis of a theory of data science and would serve to unify and scale the teaching of data science to large audiences.
翻译:数据科学领域目前具有广泛的定义,其中包括从许多其他既定研究领域借用的大量活动。在早期阶段对一个领域作出如此模糊的定性可能是自然的,但随着时间的推移,维持这样一个广泛的定义变得不易操作,阻碍进步。特别是,数据科学的教学受到数据科学的阻碍,因为似乎需要涵盖许多不同的关注点。数据科学家必须最终确定该领域的核心,确定哪些领域是独特的领域,什么是发展数据科学方面的新知识。在这次审查中,我们试图从数据科学中提炼一些核心思想,侧重于数据分析的迭接过程,并根据过去的经验发展一些概括性。这种性质的概括性可以构成数据科学理论的基础,有助于将数据科学的教学统一起来,并将数据科学的教学范围扩大到广大受众。