Food is significant to human daily life. In this paper, we are interested in learning structural representations for lengthy recipes, that can benefit the recipe generation and food retrieval tasks. We mainly investigate an open research task of generating cooking instructions based on food images and ingredients, which is similar to the image captioning task. However, compared with image captioning datasets, the target recipes are lengthy paragraphs and do not have annotations on structure information. To address the above limitations, we propose a novel framework of Structure-aware Generation Network (SGN) to tackle the food recipe generation task. Our approach brings together several novel ideas in a systematic framework: (1) exploiting an unsupervised learning approach to obtain the sentence-level tree structure labels before training; (2) generating trees of target recipes from images with the supervision of tree structure labels learned from (1); and (3) integrating the inferred tree structures into the recipe generation procedure. Our proposed model can produce high-quality and coherent recipes, and achieve the state-of-the-art performance on the benchmark Recipe1M dataset. We also validate the usefulness of our learned tree structures in the food cross-modal retrieval task, where the proposed model with tree representations can outperform state-of-the-art benchmark results.
翻译:食物对人类日常生活意义重大。在本文中,我们有兴趣学习长期食谱的结构表述方法,这有利于食谱的制作和食品回收任务。我们主要调查一项开放的研究任务,即根据食品图象和成份制作烹饪指令,这与图像说明任务相似。然而,与图像字幕数据集相比,目标配方是长长的段落,没有结构信息说明。为解决上述局限性,我们提议了一个结构认知生成网络的新框架,以解决食品配方的生成任务。我们的方法在一个系统框架内汇集了几个新颖想法:(1) 利用未经监督的学习方法,在培训前获得判决一级的树结构标签;(2) 从图像中产生目标配方配方的树木,同时监督从树结构标签中学习(1) ;(3) 将推断的树结构纳入制成程序。我们提议的模型可以产生高质量和连贯的配方,并在基准Retipe1M数据集上实现最先进的业绩。我们还验证了我们在食品跨模式检索基准中学习过的树结构的有用性。