We thought data to be simply given, but reality tells otherwise; it is costly, situation-dependent, and muddled with dilemmas, constantly requiring human intervention. The ML community's focus on quality data is increasing in the same vein, as good data is vital for successful ML systems. Nonetheless, few works have investigated the dataset builders and the specifics of what they do and struggle to make good data. In this study, through semi-structured interviews with 19 ML experts, we present what humans actually do and consider in each step of the data construction pipeline. We further organize their struggles under three themes: 1) trade-offs from real-world constraints; 2) harmonizing assorted data workers for consistency; 3) the necessity of human intuition and tacit knowledge for processing data. Finally, we discuss why such struggles are inevitable for good data and what practitioners aspire, toward providing systematic support for data works.
翻译:我们认为数据是简单的,但现实却相反;数据费用昂贵,取决于情况,而且与困境混为一谈,不断需要人类干预。 ML社区对高质量数据的关注程度正在同样增加,因为良好的数据对于成功的ML系统至关重要。然而,很少有工作调查数据集构建者及其所做工作和努力取得良好数据的具体细节。在这项研究中,我们通过与19 ML专家的半结构性访谈,展示了人类的实际工作,并考虑了数据建设管道的每一步。我们进一步组织他们根据三个主题进行的斗争:(1) 交换现实世界的限制;(2) 协调各种数据工作者以保持一致;(3) 处理数据时需要人的直觉和隐性知识。最后,我们讨论了为什么这种斗争对于良好的数据是不可避免的,而实践者则渴望什么,为数据工作提供系统的支持。