The identification of analytic tasks from free text is critical for visualization-oriented natural language interfaces (V-NLIs) to suggest effective visualizations. However, it is challenging due to the ambiguity and complexity nature of human language. To address this challenge, we present a new dataset, called Quda, that aims to help V-NLIs recognize analytic tasks from free-form natural language by training and evaluating cutting-edge multi-label classification models. Our dataset contains $14,035$ diverse user queries, and each is annotated with one or multiple analytic tasks. We achieve this goal by first gathering seed queries with data analysts and then employing extensive crowd force for paraphrase generation and validation. We demonstrate the usefulness of Quda through three applications. This work is the first attempt to construct a large-scale corpus for recognizing analytic tasks. With the release of Quda, we hope it will boost the research and development of V-NLIs in data analysis and visualization.
翻译:从免费文本中确定分析任务对于直观的自然语言界面(V-NLIs)建议有效的可视化至关重要。然而,由于人类语言的模糊性和复杂性,这具有挑战性。为了应对这一挑战,我们提出了一个称为Quda的新的数据集,旨在帮助V-NLIs通过培训和评价尖端多标签分类模式,识别自由形式的自然语言分析任务。我们的数据集包含14 035美元的各种用户查询,每个数据集都附有一项或多项分析任务的附加说明。我们通过首先与数据分析员收集种子查询,然后利用广泛的人群力量来生成和验证语音。我们通过三个应用程序展示了Quda的效用。这是为识别解析性任务而建立大规模系统的第一个尝试。随着Quda的发布,我们希望它将推动数据分析和可视化V-NLIs的研究与发展。