实时视觉反馈:对制定基准指南的实时视觉反馈:人与元在卢博的工作流程 (Real-Time Visual Feedback to Guide Benchmark Creation: A Human-and-Metric-in-the-Loop Workflow)

Recent research has shown that language models exploit `artifacts' in benchmarks to solve tasks, rather than truly learning them, leading to inflated model performance. In pursuit of creating better benchmarks, we propose VAIDA, a novel benchmark creation paradigm for NLP, that focuses on guiding crowdworkers, an under-explored facet of addressing benchmark idiosyncrasies. VAIDA facilitates sample correction by providing realtime visual feedback and recommendations to improve sample quality. Our approach is domain, model, task, and metric agnostic, and constitutes a paradigm shift for robust, validated, and dynamic benchmark creation via human-and-metric-in-the-loop workflows. We evaluate via expert review and a user study with NASA TLX. We find that VAIDA decreases effort, frustration, mental, and temporal demands of crowdworkers and analysts, simultaneously increasing the performance of both user groups with a 45.8% decrease in the level of artifacts in created samples. As a by product of our user study, we observe that created samples are adversarial across models, leading to decreases of 31.3% (BERT), 22.5% (RoBERTa), 14.98% (GPT-3 fewshot) in performance.

翻译：最近的研究显示,语言模式利用了解决任务的基准中的`艺术行为',而不是真正地学习它们,从而导致模型业绩的夸大。我们建议VAIDA,这是国家劳工局新的基准创建范例,重点是指导人群工人,这是解决基准特殊性特点的探索不足的一面。 VAIDA通过提供实时视觉反馈和建议来帮助校正样本,以提高样本质量。我们的方法是领域、模型、任务和计量的洞察力,并构成通过人和计量流动工作流程实现强健、验证和动态基准创建的范式转变。我们通过专家审查和与美国航天局TLX的用户研究来进行评估。我们发现,VAIDA降低了人群工人和分析人员的努力、挫折、精神和时间需求,同时提高了两个用户群体的业绩,在创建样本中的工艺品水平下降了45.8%。我们通过用户研究发现,创建的样本是针对各种模型的对立,导致31.3%(BERT),22.5%(ROBERT-TA),14.98(PT)的绩效下降。

相关内容

MoDELS

关注 0

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/