数据集与现实：从信息需求的角度理解模型性能 (Dataset vs Reality: Understanding Model Performance from the Perspective of Information Need)

Deep learning technologies have brought us many models that outperform human beings on a few benchmarks. An interesting question is: can these models well solve real-world problems with similar settings (e.g., identical input/output) to the benchmark datasets? We argue that a model is trained to answer the same information need for which the training dataset is created. Although some datasets may share high structural similarities, e.g., question-answer pairs for the question answering (QA) task and image-caption pairs for the image captioning (IC) task, they may represent different research tasks aiming for answering different information needs. To support our argument, we use the QA task and IC task as two case studies and compare their widely used benchmark datasets. From the perspective of information need in the context of information retrieval, we show the differences in the dataset creation processes, and the differences in morphosyntactic properties between datasets. The differences in these datasets can be attributed to the different information needs of the specific research tasks. We encourage all researchers to consider the information need the perspective of a research task before utilizing a dataset to train a model. Likewise, while creating a dataset, researchers may also incorporate the information need perspective as a factor to determine the degree to which the dataset accurately reflects the research task they intend to tackle.

翻译：深度学习技术带来了许多在一些基准测试上优于人类表现的模型。一个有趣的问题是：这些模型能否很好地解决具有类似设置（例如相同的输入/输出）的真实世界问题，类似于基准数据集？我们认为，模型训练的目的是回答为其创建训练数据集的相同信息需求。尽管一些数据集可能具有高度结构相似性，例如用于问答（QA）任务的问题-答案对和用于图像描述（IC）任务的图像-字幕对，但它们可能代表旨在回答不同信息需求的不同研究任务。为了支持我们的观点，我们使用QA任务和IC任务作为两个案例研究，并比较它们广泛使用的基准数据集。从信息检索的角度来看，在信息需求的上下文中，我们展示了数据集创建过程的差异以及数据集之间形态句法属性的差异。这些数据集之间的差异可以归因于特定研究任务的不同信息需求。我们鼓励所有的研究者在利用数据集训练模型之前，考虑到研究任务所需要的信息需求的角度。同样，在创建数据集时，研究者也可以将信息需求的角度作为一个因素来决定数据集对其要解决的研究任务的准确反映程度。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【2022新书】视觉问答 (VQA)：从理论到应用

专知会员服务

63+阅读 · 2022年5月24日

【Meta AI】多模态理解研究进展，Advances in multimodal understanding research at Meta AI

专知会员服务

68+阅读 · 2022年3月20日

如何使用TensorFlow 排序构建推荐系统? How to build a recommendation system using TensorFlow Ranking?

专知会员服务

19+阅读 · 2022年3月13日

【伯克利Roshan Rao博士论文】训练，评估和理解蛋白质序列的进化模型，Training, Evaluating, and Understanding Evolutionary Models for Protein Sequences

专知会员服务

17+阅读 · 2022年3月6日