数据集与现实:从信息需求的角度了解示范业绩 (Dataset vs Reality: Understanding Model Performance from the Perspective of Information Need)

Deep learning technologies have brought us many models that outperform human beings on a few benchmarks. An interesting question is: can these models well solve real-world problems with similar settings (e.g., same input/output) to the benchmark datasets? We argue that a model is trained to answer the same information need for which the training dataset is created. Although some datasets may share high structural similarities, e.g., question-answer pairs for the question answering (QA) task and image-caption pairs for the image captioning (IC) task, not all datasets are created for the same information need. To support our argument, we conduct a comprehensive analysis on widely used benchmark datasets for both QA and IC tasks. We compare the dataset creation process (e.g., crowdsourced, or collected data from real users or content providers) from the perspective of information need in the context of information retrieval. To show the differences between datasets, we perform both word-level and sentence-level analysis. We show that data collected from real users or content providers tend to have richer, more diverse, and more specific words than data annotated by crowdworkers. At sentence level, data by crowdworkers share similar dependency distributions and higher similarities in sentence structure, compared to data collected from content providers. We believe our findings could partially explain why some datasets are considered more challenging than others, for similar tasks. Our findings may also be helpful in guiding new dataset construction.

翻译：深层次的学习技术给我们带来了许多模型,这些模型在几个基准基准上优于人。一个有趣的问题是:这些模型能否在相似的设置(例如相同的投入/产出)下解决基准数据集中的真实世界问题?我们争辩说,一个模型经过培训,能够满足培训数据集所要创建的相同信息需求。虽然一些数据集在结构上可能具有很高的相似性,例如问题解答(QA)任务和图像说明(IC)任务图像描述配对的问答配对,但并非所有数据集都是为相同的信息需要而创建的。为了支持我们的论点,我们对QA和IC任务中广泛使用的基准数据集(例如相同的投入/产出)进行了全面分析。我们从信息检索中的信息需求角度来比较数据集创建过程(例如众源,或从实际用户或内容提供者收集的数据)。为了显示数据集之间的差异,我们进行字级和句级分析时,我们发现从实际用户或内容提供者那里收集的数据往往更丰富、更多样化、更具挑战性、更具体的数据分布比从其他数据结构中比较数据结构。我们从收集的数据分析的难度更高。我们认为,从更富有、更富有、更具有挑战性、更具有挑战性的数据分配程度的数据,从更像性的数据分配比从我们的数据结构的版本和具体的数据结构比从我们的数据分配。我们从更相信数据结构。