This paper discusses a facial expression recognition model and a description generation model to build descriptive sentences for images and facial expressions of people in images. Our study shows that YOLOv5 achieves better results than a traditional CNN for all emotions on the KDEF dataset. In particular, the accuracies of the CNN and YOLOv5 models for emotion recognition are 0.853 and 0.938, respectively. A model for generating descriptions for images based on a merged architecture is proposed using VGG16 with the descriptions encoded over an LSTM model. YOLOv5 is also used to recognize dominant colors of objects in the images and correct the color words in the descriptions generated if it is necessary. If the description contains words referring to a person, we recognize the emotion of the person in the image. Finally, we combine the results of all models to create sentences that describe the visual content and the human emotions in the images. Experimental results on the Flickr8k dataset in Vietnamese achieve BLEU-1, BLEU-2, BLEU-3, BLEU-4 scores of 0.628; 0.425; 0.280; and 0.174, respectively.
翻译:本文讨论面部表达识别模型和描述生成模型, 以构建图像中人们的图像和面部表达的描述性句子。 我们的研究显示, YOLOv5 取得了比传统CNN更好的效果, 包括 KDEF 数据集中的所有情感。 特别是CNN 和 YOLOv5 情感识别模型的精度分别为 0.853 和 0. 938 。 使用 VGG16 和 LSTM 模型的编码描述来生成图像描述模型的图像描述模型 。 YOLOv5 还用于识别图像中物体的主要颜色, 并在必要时纠正描述中生成的颜色词。 如果描述包含指一个人的词, 我们就会识别图像中的人的情感。 最后, 我们将所有模型的结果结合起来, 来创建描述图像中视觉内容和人类情感的句子。 越南FlFlick8k数据集的实验结果分别达到 BLEU-1、 BLEU-2、 BLEU-2、 BLEU-3、 BLEU-4分0.628、 0.4 0. 42; 0. 0. 0. 和0. 0. 280; 0. 0. 和0. 0. 0. 0. 和0. 0. 0. 0. 和0. 0. 774 。