通过语文说明发现分布差异</s> (Goal Driven Discovery of Distributional Differences via Language Descriptions)

Mining large corpora can generate useful discoveries but is time-consuming for humans. We formulate a new task, D5, that automatically discovers differences between two large corpora in a goal-driven way. The task input is a problem comprising a research goal "$\textit{comparing the side effects of drug A and drug B}$" and a corpus pair (two large collections of patients' self-reported reactions after taking each drug). The output is a language description (discovery) of how these corpora differ (patients taking drug A "$\textit{mention feelings of paranoia}$" more often). We build a D5 system, and to quantitatively measure its performance, we 1) contribute a meta-dataset, OpenD5, aggregating 675 open-ended problems ranging across business, social sciences, humanities, machine learning, and health, and 2) propose a set of unified evaluation metrics: validity, relevance, novelty, and significance. With the dataset and the unified metrics, we confirm that language models can use the goals to propose more relevant, novel, and significant candidate discoveries. Finally, our system produces discoveries previously unknown to the authors on a wide range of applications in OpenD5, including temporal and demographic differences in discussion topics, political stances and stereotypes in speech, insights in commercial reviews, and error patterns in NLP models.

翻译：采矿业大型公司可以产生有用的发现,但对人来说却耗费时间。我们制定了一个新的任务, D5, 以目标驱动的方式自动发现两大公司之间的差异。任务投入是一个问题, 包括一个研究目标“$\textit{比较药物A和药物B的副作用”和一对一揽子研究( 两次大量收集病人在服用每种药物后自我报告的反应) 。产出是一个语言描述( 发现), 说明这些公司如何不同( 病人服用药物 A $\ textit{ paranoia的情感 $ ) 。我们建立一个D5 系统, 并用数量衡量其绩效, 我们1) 贡献了一个元数据集, OpenD5, 汇集了675个开放问题, 涉及商业、社会科学、人文、机器学习和健康, 2 提出了一套统一的评价指标: 有效性、相关性、新颖性和重要性。有了数据集和统一的衡量标准, 我们确认语言模型可以使用目标来提出更相关、和重要的候选发现。最后, 系统在政治见解、分析中, 分析中, 分析分析和分析分析分析分析分析分析的的分析和分析分析分析分析分析分析的的分析分析和分析分析的分析分析分析分析分析分析的分析的的分析分析的和分析的的, 分析分析分析分析的的和分析分析的分析的的的分析分析分析分析分析的的分析分析分析的分析的的的的的分析分析分析分析分析分析分析的的分析的的的的分析分析分析分析的分析分析分析分析分析分析分析分析的的分析的的分析的的的的和分析分析分析分析分析的分析的的分析分析分析分析分析分析分析的分析分析 </s>

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

【论文翻译】NLP注意力机制综述论文翻译，Attention, please! A Critical Review of Neural Attention Models in Natural Language Processing

专知会员服务

96+阅读 · 2020年4月18日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日