Food Computing is currently a fast-growing field of research. Natural language processing (NLP) is also increasingly essential in this field, especially for recognising food entities. However, there are still only a few well-defined tasks that serve as benchmarks for solutions in this area. We introduce a new dataset -- called \textit{TASTEset} -- to bridge this gap. In this dataset, Named Entity Recognition (NER) models are expected to find or infer various types of entities helpful in processing recipes, e.g.~food products, quantities and their units, names of cooking processes, physical quality of ingredients, their purpose, taste. The dataset consists of 700 recipes with more than 13,000 entities to extract. We provide a few state-of-the-art baselines of named entity recognition models, which show that our dataset poses a solid challenge to existing models. The best model achieved, on average, 0.95 $F_1$ score, depending on the entity type -- from 0.781 to 0.982. We share the dataset and the task to encourage progress on more in-depth and complex information extraction from recipes.
翻译:目前,食品计算是一个快速增长的研究领域。自然语言处理(NLP)在这一领域也越来越重要,特别是对于识别食品实体。然而,仍然只有几个明确界定的任务,作为该领域解决方案的基准。我们引入了一个新的数据集 -- -- 称为\ textit{TASTEset} -- -- 以弥补这一差距。在这个数据集中,命名实体识别模型预期会找到或推断出有助于加工配方的各种类型的实体,例如:~食品产品、数量及其单位、烹饪工艺名称、原料的物理质量、目的、品味。数据集由700种配方组成,有13 000多个实体需要提取。我们提供了几个命名实体识别模型的最新基线,表明我们的数据集对现有模型提出了坚实挑战。根据实体类型,平均实现0.95 $F_1分,从0.781到0.982。我们分享数据集,并承担了鼓励在从配方中更深入、更复杂的提取信息方面取得进展的任务。