The shift towards publicly available text sources has enabled language processing at unprecedented scale, yet leaves under-serviced the domains where public and openly licensed data is scarce. Proactively collecting text data for research is a viable strategy to address this scarcity, but lacks systematic methodology taking into account the many ethical, legal and confidentiality-related aspects of data collection. Our work presents a case study on proactive data collection in peer review -- a challenging and under-resourced NLP domain. We outline ethical and legal desiderata for proactive data collection and introduce "Yes-Yes-Yes", the first donation-based peer reviewing data collection workflow that meets these requirements. We report on the implementation of Yes-Yes-Yes at ACL Rolling Review and empirically study the implications of proactive data collection for dataset size and the biases induced by the donation behavior on the peer reviewing platform.
翻译:向公开的文本源的转变使语言处理达到了前所未有的规模,但是却使公共和公开许可数据稀缺的领域服务不足。为研究而主动收集文本数据是解决这一稀缺问题的可行战略,但缺乏考虑到数据收集的许多道德、法律和保密方面的系统方法。我们的工作是在同行审议中进行积极主动数据收集的案例研究 -- -- 这是一个富有挑战性和资源不足的NLP域。我们概述了主动收集数据的道德和法律侧面,并介绍了第一个基于捐赠的同行审查符合这些要求的工作流程“是-是”。我们在ACL滚动审查中报告了执行“是-是”的数据收集工作,并实证地研究了主动收集数据对数据集规模的影响以及捐赠行为在同行审议平台上的偏差。