Recent research has shown that language models exploit `artifacts' in benchmarks to solve tasks, rather than truly learning them, leading to inflated model performance. In pursuit of creating better benchmarks, we propose VAIDA, a novel benchmark creation paradigm for NLP, that focuses on guiding crowdworkers, an under-explored facet of addressing benchmark idiosyncrasies. VAIDA facilitates sample correction by providing realtime visual feedback and recommendations to improve sample quality. Our approach is domain, model, task, and metric agnostic, and constitutes a paradigm shift for robust, validated, and dynamic benchmark creation via human-and-metric-in-the-loop workflows. We evaluate via expert review and a user study with NASA TLX. We find that VAIDA decreases effort, frustration, mental, and temporal demands of crowdworkers and analysts, simultaneously increasing the performance of both user groups with a 45.8% decrease in the level of artifacts in created samples. As a by product of our user study, we observe that created samples are adversarial across models, leading to decreases of 31.3% (BERT), 22.5% (RoBERTa), 14.98% (GPT-3 fewshot) in performance.
翻译:最近的研究显示,语言模式利用了解决任务的基准中的`艺术行为',而不是真正地学习它们,从而导致模型业绩的夸大。我们建议VAIDA,这是国家劳工局新的基准创建范例,重点是指导人群工人,这是解决基准特殊性特点的探索不足的一面。 VAIDA通过提供实时视觉反馈和建议来帮助校正样本,以提高样本质量。我们的方法是领域、模型、任务和计量的洞察力,并构成通过人和计量流动工作流程实现强健、验证和动态基准创建的范式转变。我们通过专家审查和与美国航天局TLX的用户研究来进行评估。我们发现,VAIDA降低了人群工人和分析人员的努力、挫折、精神和时间需求,同时提高了两个用户群体的业绩,在创建样本中的工艺品水平下降了45.8%。我们通过用户研究发现,创建的样本是针对各种模型的对立,导致31.3%(BERT),22.5%(ROBERT-TA),14.98(PT)的绩效下降。