教程 | 从头开始在Python中开发深度学习字幕生成模型

会员服务 ·

教程 | 从头开始在Python中开发深度学习字幕生成模型

2017 年 12 月 12 日 机器之心

选自Machine Learning Mastery

作者：Jason Brownlee

机器之心编译

参与：路雪、蒋思源

本文从数据预处理开始详细地描述了如何使用 VGG 和循环神经网络构建图像描述系统，对读者使用 Keras 和 TensorFlow 理解与实现自动图像描述很有帮助。本文的代码都有解释，非常适合图像描述任务的入门读者详细了解这一过程。

图像描述是一个有挑战性的人工智能问题，涉及为给定图像生成文本描述。字幕生成是一个有挑战性的人工智能问题，涉及为给定图像生成文本描述。

一般图像描述或字幕生成需要使用计算机视觉方法来了解图像内容，也需要自然语言处理模型将对图像的理解转换成正确顺序的文字。近期，深度学习方法在该问题的多个示例上获得了顶尖结果。

深度学习方法在字幕生成问题上展现了顶尖的结果。这些方法最令人印象深刻的地方：给定一个图像，我们无需复杂的数据准备和特殊设计的流程，就可以使用端到端的方式预测字幕。

本教程将介绍如何从头开发能生成图像字幕的深度学习模型。

完成本教程，你将学会：

如何为训练深度学习模型准备图像和文本数据。
如何设计和训练深度学习字幕生成模型。
如何评估一个训练后的字幕生成模型，并使用它为全新的图像生成字幕。

教程概览

该教程共分为 6 部分：

1. 图像和字幕数据集

2. 准备图像数据

3. 准备文本数据

4. 开发深度学习模型

5. 评估模型

6. 生成新的图像字幕

Python 环境

本教程假设你已经安装了 Python SciPy 环境，该环境完美适合 Python 3。你必须安装 Keras（2.0 版本或更高），TensorFlow 或 Theano 后端。本教程还假设你已经安装了 scikit-learn、Pandas、NumPy 和 Matplotlib 等科学计算与绘图软件库。

我推荐在 GPU 系统上运行代码。你可以在 Amazon Web Services 上用廉价的方式获取 GPU：如何在 AWS GPU 上运行 Jupyter noterbook？

图像和字幕数据集

图像字幕生成可使用的优秀数据集有 Flickr8K 数据集。原因在于它逼真且相对较小，即使你的工作站使用的是 CPU 也可以下载它，并用于构建模型。

对该数据集的明确描述见 2013 年的论文《Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics》。

作者对该数据集的描述如下：

我们介绍了一种用于基于句子的图像描述和搜索的新型基准集合，包括 8000 张图像，每个图像有五个不同的字幕描述对突出实体和事件提供清晰描述。

图像选自六个不同的 Flickr 组，往往不包含名人或有名的地点，而是手动选择多种场景和情形。

该数据集可免费获取。你必须填写一份申请表，然后就可以通过电子邮箱收到数据集。申请表链接：https://illinois.edu/fb/sec/1713398。

很快，你会收到电子邮件，包含以下两个文件的链接：

Flickr8k_Dataset.zip（1 Gigabyte）包含所有图像。
Flickr8k_text.zip（2.2 Megabytes）包含所有图像文本描述。

下载数据集，并在当前工作文件夹里进行解压缩。你将得到两个目录：

Flicker8k_Dataset：包含 8092 张 JPEG 格式图像。
Flickr8k_text：包含大量不同来源的图像描述文件。

该数据集包含一个预制训练数据集（6000 张图像）、开发数据集（1000 张图像）和测试数据集（1000 张图像）。

用于评估模型技能的一个指标是 BLEU 值。对于推断，下面是一些精巧的模型在测试数据集上进行评估时获得的大概 BLEU 值（来源：2017 年论文《Where to put the Image in an Image Caption Generator》）：

BLEU-1: 0.401 to 0.578.
BLEU-2: 0.176 to 0.390.
BLEU-3: 0.099 to 0.260.
BLEU-4: 0.059 to 0.170.

稍后在评估模型部分将详细介绍 BLEU 值。下面，我们来看一下如何加载图像。

准备图像数据

我们将使用预训练模型解析图像内容，且目前有很多可选模型。在这种情况下，我们将使用 Oxford Visual Geometry Group 或 VGG（该模型赢得了 2014 年 ImageNet 竞赛冠军）。

Keras 可直接提供该预训练模型。注意，第一次使用该模型时，Keras 将从互联网上下载模型权重，大概 500Megabytes。这可能需要一段时间（时间长度取决于你的网络连接）。

我们可以将该模型作为更大的图像字幕生成模型的一部分。问题在于模型太大，每次我们想测试新语言模型配置（下行）时在该网络中运行每张图像非常冗余。

我们可以使用预训练模型对「图像特征」进行预计算，并保存至文件中。然后加载这些特征，将其馈送至模型中作为数据集中给定图像的描述。在完整的 VGG 模型中运行图像也是这样，我们需要提前运行该步骤。

优化可以加快模型训练过程，消耗更少内存。我们可以使用 VGG class 在 Keras 中运行 VGG 模型。我们将移除加载模型的最后一层，因为该层用于预测图像的分类。我们对图像分类不感兴趣，我们感兴趣的是分类之前图像的内部表征。这些就是模型从图像中提取出的「特征」。

Keras 还提供工具将加载图像改造成模型的偏好大小（如 3 通道 224 x 224 像素图像）。

下面是 extract_features() 函数，即给出一个目录名，该函数将加载每个图像、为 VGG 准备图像数据，并从 VGG 模型中收集预测到的特征。图像特征是包含 4096 个元素的向量，该函数向图像特征返回一个图像标识符（identifier）词典。

# extract features from each photo in the directorydef extract_features(directory): # load the model model = VGG16() # re-structure the model model.layers.pop() model = Model(inputs=model.inputs, outputs=model.layers[-1].output) # summarize print(model.summary()) # extract features from each photo features = dict() for name in listdir(directory): # load an image from file filename = directory + '/' + name image = load_img(filename, target_size=(224, 224)) # convert the image pixels to a numpy array image = img_to_array(image) # reshape data for the model image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2])) # prepare the image for the VGG model image = preprocess_input(image) # get features feature = model.predict(image, verbose=0) # get image id image_id = name.split('.')[0] # store feature features[image_id] = feature print('>%s' % name) return features

我们调用该函数为模型测试准备图像数据，然后将词典保存至 features.pkl 文件。

完整示例如下：

from os import listdirfrom pickle import dumpfrom keras.applications.vgg16 import VGG16from keras.preprocessing.image import load_imgfrom keras.preprocessing.image import img_to_arrayfrom keras.applications.vgg16 import preprocess_inputfrom keras.models import Model# extract features from each photo in the directorydef extract_features(directory): # load the model model = VGG16() # re-structure the model model.layers.pop() model = Model(inputs=model.inputs, outputs=model.layers[-1].output) # summarize print(model.summary()) # extract features from each photo features = dict() for name in listdir(directory): # load an image from file filename = directory + '/' + name image = load_img(filename, target_size=(224, 224)) # convert the image pixels to a numpy array image = img_to_array(image) # reshape data for the model image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2])) # prepare the image for the VGG model image = preprocess_input(image) # get features feature = model.predict(image, verbose=0) # get image id image_id = name.split('.')[0] # store feature features[image_id] = feature print('>%s' % name) return features# extract features from all imagesdirectory = 'Flicker8k_Dataset'features = extract_features(directory) print('Extracted Features: %d' % len(features))# save to filedump(features, open('features.pkl', 'wb'))

运行该数据准备步骤可能需要一点时间，时间长度取决于你的硬件，带有 CPU 的现代工作站可能需要一个小时。

运行结束时，提取出的特征将存储在 features.pkl 文件中以备后用。该文件大概 127 Megabytes 大小。

准备文本数据

该数据集中每个图像有多个描述，文本描述需要进行最低限度的清洗。首先，加载包含所有文本描述的文件。

# load doc into memorydef load_doc(filename): # open the file as read only file = open(filename, 'r') # read all text text = file.read() # close the file file.close() return text filename = 'Flickr8k_text/Flickr8k.token.txt'# load descriptionsdoc = load_doc(filename)

每个图像有一个独有的标识符，该标识符出现在文件名和文本描述文件中。

接下来，我们将逐步对图像描述进行操作。下面定义一个 load_descriptions() 函数：给出一个需要加载的文本文档，该函数将返回图像标识符词典。每个图像标识符映射到一或多个文本描述。

# extract descriptions for imagesdef load_descriptions(doc): mapping = dict() # process lines for line in doc.split('\n'): # split line by white space tokens = line.split() if len(line) < 2: continue # take the first token as the image id, the rest as the description image_id, image_desc = tokens[0], tokens[1:] # remove filename from image id image_id = image_id.split('.')[0] # convert description tokens back to string image_desc = ' '.join(image_desc) # create the list if needed if image_id not in mapping: mapping[image_id] = list() # store description mapping[image_id].append(image_desc) return mapping# parse descriptionsdescriptions = load_descriptions(doc) print('Loaded: %d ' % len(descriptions))

下面，我们需要清洗描述文本。因为描述已经经过符号化，所以它十分易于处理。

我们将用以下方式清洗文本，以减少需要处理的词汇量：

所有单词全部转换成小写。
移除所有标点符号。
移除所有少于或等于一个字符的单词（如 a）。
移除所有带数字的单词。

下面定义了 clean_descriptions() 函数：给出描述的图像标识符词典，遍历每个描述，清洗文本。

import stringdef clean_descriptions(descriptions): # prepare translation table for removing punctuation table = str.maketrans('', '', string.punctuation) for key, desc_list in descriptions.items(): for i in range(len(desc_list)): desc = desc_list[i] # tokenize desc = desc.split() # convert to lower case desc = [word.lower() for word in desc] # remove punctuation from each token desc = [w.translate(table) for w in desc] # remove hanging 's' and 'a' desc = [word for word in desc if len(word)>1] # remove tokens with numbers in them desc = [word for word in desc if word.isalpha()] # store as string desc_list[i] =  ' '.join(desc)# clean descriptionsclean_descriptions(descriptions)

清洗后，我们可以总结词汇量。

理想情况下，我们希望使用尽可能少的词汇而得到强大的表达性。词汇越少则模型越小、训练速度越快。

对于推断，我们可以将干净的描述转换成一个集，将它的规模打印出来，这样就可以了解我们的数据集词汇量的大小了。

# convert the loaded descriptions into a vocabulary of wordsdef to_vocabulary(descriptions): # build a list of all description strings all_desc = set() for key in descriptions.keys(): [all_desc.update(d.split()) for d in descriptions[key]] return all_desc# summarize vocabularyvocabulary = to_vocabulary(descriptions) print('Vocabulary Size: %d' % len(vocabulary))

最后，我们保存图像标识符词典和描述至一个新文本 descriptions.txt，该文件中每行只有一个图像和一个描述。

下面我们定义了 save_doc() 函数，即给出一个包含标识符和描述之间映射的词典和文件名，将该映射保存至文件中。

# save descriptions to file, one per linedef save_descriptions(descriptions, filename): lines = list() for key, desc_list in descriptions.items(): for desc in desc_list: lines.append(key + ' ' + desc) data = '\n'.join(lines) file = open(filename, 'w') file.write(data) file.close()# save descriptionssave_doc(descriptions, 'descriptions.txt')

汇总起来，完整的函数定义如下所示：

import string# load doc into memorydef load_doc(filename): # open the file as read only file = open(filename, 'r') # read all text text = file.read() # close the file file.close() return text# extract descriptions for imagesdef load_descriptions(doc): mapping = dict() # process lines for line in doc.split('\n'): # split line by white space tokens = line.split() if len(line) < 2: continue # take the first token as the image id, the rest as the description image_id, image_desc = tokens[0], tokens[1:] # remove filename from image id image_id = image_id.split('.')[0] # convert description tokens back to string image_desc = ' '.join(image_desc) # create the list if needed if image_id not in mapping: mapping[image_id] = list() # store description mapping[image_id].append(image_desc) return mappingdef clean_descriptions(descriptions): # prepare translation table for removing punctuation table = str.maketrans('', '', string.punctuation) for key, desc_list in descriptions.items(): for i in range(len(desc_list)): desc = desc_list[i] # tokenize desc = desc.split() # convert to lower case desc = [word.lower() for word in desc] # remove punctuation from each token desc = [w.translate(table) for w in desc] # remove hanging 's' and 'a' desc = [word for word in desc if len(word)>1] # remove tokens with numbers in them desc = [word for word in desc if word.isalpha()] # store as string desc_list[i] =  ' '.join(desc)# convert the loaded descriptions into a vocabulary of wordsdef to_vocabulary(descriptions): # build a list of all description strings all_desc = set() for key in descriptions.keys(): [all_desc.update(d.split()) for d in descriptions[key]] return all_desc# save descriptions to file, one per linedef save_descriptions(descriptions, filename): lines = list() for key, desc_list in descriptions.items(): for desc in desc_list: lines.append(key + ' ' + desc) data = '\n'.join(lines) file = open(filename, 'w') file.write(data) file.close() filename = 'Flickr8k_text/Flickr8k.token.txt'# load descriptionsdoc = load_doc(filename)# parse descriptionsdescriptions = load_descriptions(doc) print('Loaded: %d ' % len(descriptions))# clean descriptionsclean_descriptions(descriptions)# summarize vocabularyvocabulary = to_vocabulary(descriptions) print('Vocabulary Size: %d' % len(vocabulary))# save to filesave_descriptions(descriptions, 'descriptions.txt')

运行示例首先打印出加载图像描述的数量（8092）和干净词汇量的规模（8763 个单词）。

Loaded: 8,092Vocabulary Size: 8,763

最后，把干净的描述写入 descriptions.txt。

查看文件，我们能够看到该描述可用于建模。文件中描述的顺序可能会发生改变。

2252123185_487f21e336 bunch on people are seated in stadium2252123185_487f21e336 crowded stadium is full of people watching an event2252123185_487f21e336 crowd of people fill up packed stadium2252123185_487f21e336 crowd sitting in an indoor stadium2252123185_487f21e336 stadium full of people watch game ...

开发深度学习模型

本节我们将定义深度学习模型，在训练数据集上进行拟合。本节分为以下几部分：

1. 加载数据。

2. 定义模型。

3. 拟合模型。

4. 完成示例。

加载数据

首先，我们必须加载准备好的图像和文本数据来拟合模型。

我们将在训练数据集中的所有图像和描述上训练数据。训练过程中，我们计划在开发数据集上监控模型性能，使用该性能确定什么时候保存模型至文件。

训练和开发数据集已经预制好，并分别保存在 Flickr_8k.trainImages.txt 和 Flickr_8k.devImages.txt 文件中，二者均包含图像文件名列表。从这些文件名中，我们可以提取图像标识符，并使用它们为每个集过滤图像和描述。

如下所示，load_set() 函数将根据训练或开发集文件名加载一个预定义标识符集。

# load doc into memorydef load_doc(filename): # open the file as read only file = open(filename, 'r') # read all text text = file.read() # close the file file.close() return text# load a pre-defined list of photo identifiersdef load_set(filename): doc = load_doc(filename) dataset = list() # process line by line for line in doc.split('\n'): # skip empty lines if len(line) < 1: continue # get the image identifier identifier = line.split('.')[0] dataset.append(identifier) return set(dataset)

现在，我们可以使用预定义训练或开发标识符集加载图像和描述了。

下面是 load_clean_descriptions() 函数，该函数从给定标识符集的 descriptions.txt 中加载干净的文本描述，并向文本描述列表返回标识符词典。

我们将要开发的模型能够生成给定图像的字幕，一次生成一个单词。先前生成的单词序列作为输入。因此，我们需要一个 first word 来开启生成步骤和一个 last word 来表示字幕生成结束。

我们将使用字符串 startseq 和 endseq 完成该目的。这些标记被添加至加载描述，像它们本身就是加载出的那样。在对文本进行编码之前进行该操作非常重要，这样这些标记才能得到正确编码。

# load clean descriptions into memorydef load_clean_descriptions(filename, dataset): # load document doc = load_doc(filename) descriptions = dict() for line in doc.split('\n'): # split line by white space tokens = line.split() # split id from description image_id, image_desc = tokens[0], tokens[1:] # skip images not in the set if image_id in dataset: # create list if image_id not in descriptions: descriptions[image_id] = list() # wrap description in tokens desc = 'startseq ' + ' '.join(image_desc) + ' endseq' # store descriptions[image_id].append(desc) return descriptions

接下来，我们可以为给定数据集加载图像特征。

下面定义了 load_photo_features() 函数，该函数加载了整个图像描述集，然后返回给定图像标识符集你感兴趣的子集。

这不是很高效，但是，这可以帮助我们启动，快速运行。

# load photo featuresdef load_photo_features(filename, dataset): # load all features all_features = load(open(filename, 'rb')) # filter features features = {k: all_features[k] for k in dataset} return features

我们可以在这里暂停一下，测试目前开发的所有内容。

完整的代码示例如下：

# load doc into memorydef load_doc(filename): # open the file as read only file = open(filename, 'r') # read all text text = file.read() # close the file file.close() return text# load a pre-defined list of photo identifiersdef load_set(filename): doc = load_doc(filename) dataset = list() # process line by line for line in doc.split('\n'): # skip empty lines if len(line) < 1: continue # get the image identifier identifier = line.split('.')[0] dataset.append(identifier) return set(dataset)# load clean descriptions into memorydef load_clean_descriptions(filename, dataset): # load document doc = load_doc(filename) descriptions = dict() for line in doc.split('\n'): # split line by white space tokens = line.split() # split id from description image_id, image_desc = tokens[0], tokens[1:] # skip images not in the set if image_id in dataset: # create list if image_id not in descriptions: descriptions[image_id] = list() # wrap description in tokens desc = 'startseq ' + ' '.join(image_desc) + ' endseq' # store descriptions[image_id].append(desc) return descriptions# load photo featuresdef load_photo_features(filename, dataset): # load all features all_features = load(open(filename, 'rb')) # filter features features = {k: all_features[k] for k in dataset} return features# load training dataset (6K)filename = 'Flickr8k_text/Flickr_8k.trainImages.txt'train = load_set(filename) print('Dataset: %d' % len(train))# descriptionstrain_descriptions = load_clean_descriptions('descriptions.txt', train) print('Descriptions: train=%d' % len(train_descriptions))# photo featurestrain_features = load_photo_features('features.pkl', train) print('Photos: train=%d' % len(train_features))

运行该示例首先在测试数据集中加载 6000 张图像标识符。这些特征之后将用于加载干净描述文本和预计算的图像特征。

Dataset: 6,000Descriptions: train=6,000Photos: train=6,000

描述文本在作为输入馈送至模型或与模型预测进行对比之前需要先编码成数值。

编码数据的第一步是创建单词到唯一整数值之间的持续映射。Keras 提供 Tokenizer class，可根据加载的描述数据学习该映射。

下面定义了用于将描述词典转换成字符串列表的 to_lines() 函数，和对加载图像描述文本拟合 Tokenizer 的 create_tokenizer() 函数。

# convert a dictionary of clean descriptions to a list of descriptionsdef to_lines(descriptions): all_desc = list() for key in descriptions.keys(): [all_desc.append(d) for d in descriptions[key]] return all_desc# fit a tokenizer given caption descriptionsdef create_tokenizer(descriptions): lines = to_lines(descriptions) tokenizer = Tokenizer() tokenizer.fit_on_texts(lines) return tokenizer# prepare tokenizertokenizer = create_tokenizer(train_descriptions) vocab_size = len(tokenizer.word_index) + 1print('Vocabulary Size: %d' % vocab_size)

我们现在对文本进行编码。

每个描述将被分割成单词。我们向该模型提供一个单词和图像，然后模型生成下一个单词。描述的前两个单词和图像将作为模型输入以生成下一个单词，这就是该模型的训练方式。

例如，输入序列「a little girl running in field」将被分割成 6 个输入-输出对来训练该模型：

X1, X2 (text sequence), y (word) photo startseq, little photo startseq, little, girl photo startseq, little, girl, running photo startseq, little, girl, running, inphoto startseq, little, girl, running, in, field photo startseq, little, girl, running, in, field, endseq

稍后，当模型用于生成描述时，生成的单词将被连结起来，递归地作为输入以生成图像字幕。

下面是 create_sequences() 函数，给出 tokenizer、最大序列长度和所有描述和图像的词典，该函数将这些数据转换成输入-输出对来训练模型。该模型有两个输入数组：一个用于图像特征，一个用于编码文本。模型输出是文本序列中编码的下一个单词。

输入文本被编码为整数，被馈送至词嵌入层。图像特征将被直接馈送至模型的另一部分。该模型输出的预测是所有单词在词汇表中的概率分布。

因此，输出数据是每个单词的 one-hot 编码，它表示一种理想化的概率分布，即除了实际词位置之外所有词位置的值都为 0，实际词位置的值为 1。

# create sequences of images, input sequences and output words for an imagedef create_sequences(tokenizer, max_length, descriptions, photos): X1, X2, y = list(), list(), list() # walk through each image identifier for key, desc_list in descriptions.items(): # walk through each description for the image for desc in desc_list: # encode the sequence seq = tokenizer.texts_to_sequences([desc])[0] # split one sequence into multiple X,y pairs for i in range(1, len(seq)): # split into input and output pair in_seq, out_seq = seq[:i], seq[i] # pad input sequence in_seq = pad_sequences([in_seq], maxlen=max_length)[0] # encode output sequence out_seq = to_categorical([out_seq], num_classes=vocab_size)[0] # store X1.append(photos[key][0]) X2.append(in_seq) y.append(out_seq) return array(X1), array(X2), array(y)

我们需要计算最长描述中单词的最大数量。下面是一个有帮助的函数 max_length()。

# calculate the length of the description with the most wordsdef max_length(descriptions): lines = to_lines(descriptions) return max(len(d.split()) for d in lines)

现在我们可以为训练和开发数据集加载数据，并将加载数据转换成输入-输出对来拟合深度学习模型。

定义模型

我们将根据 Marc Tanti, et al. 在 2017 年论文中描述的「merge-model」定义深度学习模型。

Where to put the Image in an Image Caption Generator，2017
What is the Role of Recurrent Neural Networks (RNNs) in an Image Caption Generator?，2017

论文作者提供了该模型的简图，如下所示：

我们将从三部分描述该模型：

图像特征提取器：这是一个在 ImageNet 数据集上预训练的 16 层 VGG 模型。我们已经使用 VGG 模型（没有输出层）对图像进行预处理，并将使用该模型预测的提取特征作为输入。
序列处理器：合适一个词嵌入层，用于处理文本输入，后面是长短期记忆（LSTM）循环神经网络层。
解码器：特征提取器和序列处理器输出一个固定长度向量。这些向量由密集层（Dense layer）融合和处理，来进行最终预测。

图像特征提取器模型的输入图像特征是维度为 4096 的向量，这些向量经过全连接层处理并生成图像的 256 元素表征。

序列处理器模型期望馈送至嵌入层的预定义长度（34 个单词）输入序列使用掩码来忽略 padded 值。之后是具备 256 个循环单元的 LSTM 层。

两个输入模型均输出 256 元素的向量。此外，输入模型以 50% 的 dropout 率使用正则化，旨在减少训练数据集的过拟合情况，因为该模型配置学习非常快。

解码器模型使用额外的操作融合来自两个输入模型的向量。然后将其馈送至 256 个神经元的密集层，然后输送至最终输出密集层，从而在所有输出词汇上对序列中的下一个单词进行 softmax 预测。

下面的 define_model() 函数定义和返回要拟合的模型。

# define the captioning modeldef define_model(vocab_size, max_length): # feature extractor model inputs1 = Input(shape=(4096,)) fe1 = Dropout(0.5)(inputs1) fe2 = Dense(256, activation='relu')(fe1) # sequence model inputs2 = Input(shape=(max_length,)) se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2) se2 = Dropout(0.5)(se1) se3 = LSTM(256)(se2) # decoder model decoder1 = add([fe2, se3]) decoder2 = Dense(256, activation='relu')(decoder1) outputs = Dense(vocab_size, activation='softmax')(decoder2) # tie it together [image, seq] [word] model = Model(inputs=[inputs1, inputs2], outputs=outputs) model.compile(loss='categorical_crossentropy', optimizer='adam') # summarize model print(model.summary()) plot_model(model, to_file='model.png', show_shapes=True) return model

要了解模型结构，特别是层的形状，请参考下表中的总结。

____________________________________________________________________________________________________ Layer (type)                     Output Shape          Param #     Connected to==================================================================================================== input_2 (InputLayer)             (None, 34)            0____________________________________________________________________________________________________ input_1 (InputLayer)             (None, 4096)          0____________________________________________________________________________________________________ embedding_1 (Embedding)          (None, 34, 256)       1940224     input_2[0][0] ____________________________________________________________________________________________________ dropout_1 (Dropout)              (None, 4096)          0           input_1[0][0] ____________________________________________________________________________________________________ dropout_2 (Dropout)              (None, 34, 256)       0           embedding_1[0][0] ____________________________________________________________________________________________________ dense_1 (Dense)                  (None, 256)           1048832     dropout_1[0][0] ____________________________________________________________________________________________________ lstm_1 (LSTM)                    (None, 256)           525312      dropout_2[0][0] ____________________________________________________________________________________________________ add_1 (Add)                      (None, 256)           0           dense_1[0][0]                                                                   lstm_1[0][0] ____________________________________________________________________________________________________ dense_2 (Dense)                  (None, 256)           65792       add_1[0][0] ____________________________________________________________________________________________________ dense_3 (Dense)                  (None, 7579)          1947803     dense_2[0][0] ==================================================================================================== Total params: 5,527,963Trainable params: 5,527,963Non-trainable params: 0____________________________________________________________________________________________________

我们还创建了一幅图来可视化网络结构，帮助理解两个输入流。

图像字幕生成深度学习模型示意图

拟合模型

现在我们已经了解如何定义模型了，那么接下来我们要在训练数据集上拟合模型。

该模型学习速度快，很快就会对训练数据集产生过拟合。因此，我们需要在留出的开发数据集上监控训练模型的泛化情况。如果模型在开发数据集上的技能在每个 epoch 结束时有所提升，则我们将整个模型保存至文件。

在运行结束时，我们能够使用训练数据集上具备最优技能的模型作为最终模型。

通过在 Keras 中定义 ModelCheckpoint，使之监控验证数据集上的最小损失，我们可以实现以上目的。然后将该模型保存至文件名中包含训练损失和验证损失的文件中。

# define checkpoint callbackfilepath = 'model-ep{epoch:03d}-loss{loss:.3f}-val_loss{val_loss:.3f}.h5'checkpoint = ModelCheckpoint(filepath, monitor='val_loss', verbose=1, save_best_only=True, mode='min')

之后，通过 fit() 中的 callbacks 参数指定检查点。我们还需要 fit() 中的 validation_data 参数指定开发数据集。

我们仅拟合模型 20 epoch，给出一定量的训练数据，在一般硬件上每个 epoch 可能需要 30 分钟。

# fit modelmodel.fit([X1train, X2train], ytrain, epochs=20, verbose=2, callbacks=[checkpoint], validation_data=([X1test, X2test], ytest))

完成示例

在训练数据上拟合模型的完整示例如下：

from numpy import arrayfrom pickle import loadfrom keras.preprocessing.text import Tokenizerfrom keras.preprocessing.sequence import pad_sequencesfrom keras.utils import to_categoricalfrom keras.utils import plot_modelfrom keras.models import Modelfrom keras.layers import Inputfrom keras.layers import Densefrom keras.layers import LSTMfrom keras.layers import Embeddingfrom keras.layers import Dropoutfrom keras.layers.merge import addfrom keras.callbacks import ModelCheckpoint# load doc into memorydef load_doc(filename): # open the file as read only file = open(filename, 'r') # read all text text = file.read() # close the file file.close() return text# load a pre-defined list of photo identifiersdef load_set(filename): doc = load_doc(filename) dataset = list() # process line by line for line in doc.split('\n'): # skip empty lines if len(line) < 1: continue # get the image identifier identifier = line.split('.')[0] dataset.append(identifier) return set(dataset)# load clean descriptions into memorydef load_clean_descriptions(filename, dataset): # load document doc = load_doc(filename) descriptions = dict() for line in doc.split('\n'): # split line by white space tokens = line.split() # split id from description image_id, image_desc = tokens[0], tokens[1:] # skip images not in the set if image_id in dataset: # create list if image_id not in descriptions: descriptions[image_id] = list() # wrap description in tokens desc = 'startseq ' + ' '.join(image_desc) + ' endseq' # store descriptions[image_id].append(desc) return descriptions# load photo featuresdef load_photo_features(filename, dataset): # load all features all_features = load(open(filename, 'rb')) # filter features features = {k: all_features[k] for k in dataset} return features# covert a dictionary of clean descriptions to a list of descriptionsdef to_lines(descriptions): all_desc = list() for key in descriptions.keys(): [all_desc.append(d) for d in descriptions[key]] return all_desc# fit a tokenizer given caption descriptionsdef create_tokenizer(descriptions): lines = to_lines(descriptions) tokenizer = Tokenizer() tokenizer.fit_on_texts(lines) return tokenizer# calculate the length of the description with the most wordsdef max_length(descriptions): lines = to_lines(descriptions) return max(len(d.split()) for d in lines)# create sequences of images, input sequences and output words for an imagedef create_sequences(tokenizer, max_length, descriptions, photos): X1, X2, y = list(), list(), list() # walk through each image identifier for key, desc_list in descriptions.items(): # walk through each description for the image for desc in desc_list: # encode the sequence seq = tokenizer.texts_to_sequences([desc])[0] # split one sequence into multiple X,y pairs for i in range(1, len(seq)): # split into input and output pair in_seq, out_seq = seq[:i], seq[i] # pad input sequence in_seq = pad_sequences([in_seq], maxlen=max_length)[0] # encode output sequence out_seq = to_categorical([out_seq], num_classes=vocab_size)[0] # store X1.append(photos[key][0]) X2.append(in_seq) y.append(out_seq) return array(X1), array(X2), array(y)# define the captioning modeldef define_model(vocab_size, max_length): # feature extractor model inputs1 = Input(shape=(4096,)) fe1 = Dropout(0.5)(inputs1) fe2 = Dense(256, activation='relu')(fe1) # sequence model inputs2 = Input(shape=(max_length,)) se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2) se2 = Dropout(0.5)(se1) se3 = LSTM(256)(se2) # decoder model decoder1 = add([fe2, se3]) decoder2 = Dense(256, activation='relu')(decoder1) outputs = Dense(vocab_size, activation='softmax')(decoder2) # tie it together [image, seq] [word] model = Model(inputs=[inputs1, inputs2], outputs=outputs) model.compile(loss='categorical_crossentropy', optimizer='adam') # summarize model print(model.summary()) plot_model(model, to_file='model.png', show_shapes=True) return model# train dataset# load training dataset (6K)filename = 'Flickr8k_text/Flickr_8k.trainImages.txt'train = load_set(filename) print('Dataset: %d' % len(train))# descriptionstrain_descriptions = load_clean_descriptions('descriptions.txt', train) print('Descriptions: train=%d' % len(train_descriptions))# photo featurestrain_features = load_photo_features('features.pkl', train) print('Photos: train=%d' % len(train_features))# prepare tokenizertokenizer = create_tokenizer(train_descriptions) vocab_size = len(tokenizer.word_index) + 1print('Vocabulary Size: %d' % vocab_size)# determine the maximum sequence lengthmax_length = max_length(train_descriptions) print('Description Length: %d' % max_length)# prepare sequencesX1train, X2train, ytrain = create_sequences(tokenizer, max_length, train_descriptions, train_features)# dev dataset# load test setfilename = 'Flickr8k_text/Flickr_8k.devImages.txt'test = load_set(filename) print('Dataset: %d' % len(test))# descriptionstest_descriptions = load_clean_descriptions('descriptions.txt', test) print('Descriptions: test=%d' % len(test_descriptions))# photo featurestest_features = load_photo_features('features.pkl', test) print('Photos: test=%d' % len(test_features))# prepare sequencesX1test, X2test, ytest = create_sequences(tokenizer, max_length, test_descriptions, test_features)# fit model# define the modelmodel = define_model(vocab_size, max_length)# define checkpoint callbackfilepath = 'model-ep{epoch:03d}-loss{loss:.3f}-val_loss{val_loss:.3f}.h5'checkpoint = ModelCheckpoint(filepath, monitor='val_loss', verbose=1, save_best_only=True, mode='min')# fit modelmodel.fit([X1train, X2train], ytrain, epochs=20, verbose=2, callbacks=[checkpoint], validation_data=([X1test, X2test], ytest))

运行该示例首先打印加载训练和开发数据集的摘要。

Dataset: 6,000Descriptions: train=6,000Photos: train=6,000Vocabulary Size: 7,579Description Length: 34Dataset: 1,000Descriptions: test=1,000Photos: test=1,000

之后，我们可以了解训练和验证（开发）输入-输出对的整体数量。

Train on 306,404 samples, validate on 50,903 samples

然后运行模型，将最优模型保存至.h5 文件。

在运行过程中，我把最优验证结果的模型保存至文件中：

model-ep002-loss3.245-val_loss3.612.h5

该模型在第 2 个 epoch 中结束时被保存，在训练数据集上的损失为 3.245，在开发数据集上的损失为 3.612，每个人的具体结果不同。如果你在 AWS 中运行上述示例，那么将模型文件复制回你当前的工作文件夹。

评估模型

模型拟合之后，我们可以在留出的测试数据集上评估它的预测技能。

使模型对测试数据集中的所有图像生成描述，使用标准代价函数评估预测，从而评估模型。

首先，我们需要使用训练模型对图像生成描述。输入开始描述的标记「startseq」，生成一个单词，然后递归地用生成单词作为输入启用模型直到序列标记到「endseq」或达到最大描述长度。

下面的 generate_desc() 函数实现该行为，并基于给定训练模型和作为输入的准备图像生成文本描述。它启用 word_for_id() 函数以映射整数预测至单词。

# map an integer to a worddef word_for_id(integer, tokenizer): for word, index in tokenizer.word_index.items(): if index == integer: return word return None# generate a description for an imagedef generate_desc(model, tokenizer, photo, max_length): # seed the generation process in_text = 'startseq' # iterate over the whole length of the sequence for i in range(max_length): # integer encode input sequence sequence = tokenizer.texts_to_sequences([in_text])[0] # pad input sequence = pad_sequences([sequence], maxlen=max_length) # predict next word yhat = model.predict([photo,sequence], verbose=0) # convert probability to integer yhat = argmax(yhat) # map integer to word word = word_for_id(yhat, tokenizer) # stop if we cannot map the word if word is None: break # append as input for generating the next word in_text += ' ' + word # stop if we predict the end of the sequence if word == 'endseq': break return in_text

我们将为测试数据集和训练数据集中的所有图像生成预测。

下面的 evaluate_model() 基于给定图像描述数据集和图像特征评估训练模型。收集实际和预测描述，使用语料库 BLEU 值对它们进行评估。语料库 BLEU 值总结了生成文本和期望文本之间的相似度。

# evaluate the skill of the modeldef evaluate_model(model, descriptions, photos, tokenizer, max_length): actual, predicted = list(), list() # step over the whole set for key, desc_list in descriptions.items(): # generate description yhat = generate_desc(model, tokenizer, photos[key], max_length) # store actual and predicted references = [d.split() for d in desc_list] actual.append(references) predicted.append(yhat.split()) # calculate BLEU score print('BLEU-1: %f' % corpus_bleu(actual, predicted, weights=(1.0, 0, 0, 0))) print('BLEU-2: %f' % corpus_bleu(actual, predicted, weights=(0.5, 0.5, 0, 0))) print('BLEU-3: %f' % corpus_bleu(actual, predicted, weights=(0.3, 0.3, 0.3, 0))) print('BLEU-4: %f' % corpus_bleu(actual, predicted, weights=(0.25, 0.25, 0.25, 0.25)))

BLEU 值用于在文本翻译中评估译文和一或多个参考译文的相似度。

这里，我们将每个生成描述与该图像的所有参考描述进行对比，然后计算 1、2、3、4 等 n 元语言模型的 BLEU 值。

NLTK Python 库在 corpus_bleu() 函数中实现了 BLEU 值计算。分值越接近 1.0 越好，越接近 0 越差。

我们可以结合前面加载数据部分中的函数。首先加载训练数据集来准备 Tokenizer，以使我们将生成单词编码成模型的输入序列。使用模型训练时使用的编码机制对生成单词进行编码非常关键。

然后使用这些函数加载测试数据集。完整示例如下：

from numpy import argmaxfrom pickle import loadfrom keras.preprocessing.text import Tokenizerfrom keras.preprocessing.sequence import pad_sequencesfrom keras.models import load_modelfrom nltk.translate.bleu_score import corpus_bleu# load doc into memorydef load_doc(filename): # open the file as read only file = open(filename, 'r') # read all text text = file.read() # close the file file.close() return text# load a pre-defined list of photo identifiersdef load_set(filename): doc = load_doc(filename) dataset = list() # process line by line for line in doc.split('\n'): # skip empty lines if len(line) < 1: continue # get the image identifier identifier = line.split('.')[0] dataset.append(identifier) return set(dataset)# load clean descriptions into memorydef load_clean_descriptions(filename, dataset): # load document doc = load_doc(filename) descriptions = dict() for line in doc.split('\n'): # split line by white space tokens = line.split() # split id from description image_id, image_desc = tokens[0], tokens[1:] # skip images not in the set if image_id in dataset: # create list if image_id not in descriptions: descriptions[image_id] = list() # wrap description in tokens desc = 'startseq ' + ' '.join(image_desc) + ' endseq' # store descriptions[image_id].append(desc) return descriptions# load photo featuresdef load_photo_features(filename, dataset): # load all features all_features = load(open(filename, 'rb')) # filter features features = {k: all_features[k] for k in dataset} return features# covert a dictionary of clean descriptions to a list of descriptionsdef to_lines(descriptions): all_desc = list() for key in descriptions.keys(): [all_desc.append(d) for d in descriptions[key]] return all_desc# fit a tokenizer given caption descriptionsdef create_tokenizer(descriptions): lines = to_lines(descriptions) tokenizer = Tokenizer() tokenizer.fit_on_texts(lines) return tokenizer# calculate the length of the description with the most wordsdef max_length(descriptions): lines = to_lines(descriptions) return max(len(d.split()) for d in lines)# map an integer to a worddef word_for_id(integer, tokenizer): for word, index in tokenizer.word_index.items(): if index == integer: return word return None# generate a description for an imagedef generate_desc(model, tokenizer, photo, max_length): # seed the generation process in_text = 'startseq' # iterate over the whole length of the sequence for i in range(max_length): # integer encode input sequence sequence = tokenizer.texts_to_sequences([in_text])[0] # pad input sequence = pad_sequences([sequence], maxlen=max_length) # predict next word yhat = model.predict([photo,sequence], verbose=0) # convert probability to integer yhat = argmax(yhat) # map integer to word word = word_for_id(yhat, tokenizer) # stop if we cannot map the word if word is None: break # append as input for generating the next word in_text += ' ' + word # stop if we predict the end of the sequence if word == 'endseq': break return in_text# evaluate the skill of the modeldef evaluate_model(model, descriptions, photos, tokenizer, max_length): actual, predicted = list(), list() # step over the whole set for key, desc_list in descriptions.items(): # generate description yhat = generate_desc(model, tokenizer, photos[key], max_length) # store actual and predicted references = [d.split() for d in desc_list] actual.append(references) predicted.append(yhat.split()) # calculate BLEU score print('BLEU-1: %f' % corpus_bleu(actual, predicted, weights=(1.0, 0, 0, 0))) print('BLEU-2: %f' % corpus_bleu(actual, predicted, weights=(0.5, 0.5, 0, 0))) print('BLEU-3: %f' % corpus_bleu(actual, predicted, weights=(0.3, 0.3, 0.3, 0))) print('BLEU-4: %f' % corpus_bleu(actual, predicted, weights=(0.25, 0.25, 0.25, 0.25)))# prepare tokenizer on train set# load training dataset (6K)filename = 'Flickr8k_text/Flickr_8k.trainImages.txt'train = load_set(filename) print('Dataset: %d' % len(train))# descriptionstrain_descriptions = load_clean_descriptions('descriptions.txt', train) print('Descriptions: train=%d' % len(train_descriptions))# prepare tokenizertokenizer = create_tokenizer(train_descriptions) vocab_size = len(tokenizer.word_index) + 1print('Vocabulary Size: %d' % vocab_size)# determine the maximum sequence lengthmax_length = max_length(train_descriptions) print('Description Length: %d' % max_length)# prepare test set# load test setfilename = 'Flickr8k_text/Flickr_8k.testImages.txt'test = load_set(filename) print('Dataset: %d' % len(test))# descriptionstest_descriptions = load_clean_descriptions('descriptions.txt', test) print('Descriptions: test=%d' % len(test_descriptions))# photo featurestest_features = load_photo_features('features.pkl', test) print('Photos: test=%d' % len(test_features))# load the modelfilename = 'model-ep002-loss3.245-val_loss3.612.h5'model = load_model(filename)# evaluate modelevaluate_model(model, test_descriptions, test_features, tokenizer, max_length)

运行示例打印 BLEU 值。我们可以看到 BLEU 值处于该问题较优的期望范围内，且接近最优水平。并且我们并没有对选择的模型配置进行特别的优化。

BLEU-1: 0.579114BLEU-2: 0.344856BLEU-3: 0.252154BLEU-4: 0.131446

生成新的图像字幕

现在我们了解了如何开发和评估字幕生成模型，那么我们如何使用它呢？

我们需要模型文件中全新的图像，还需要 Tokenizer 用于对模型生成单词进行编码，生成序列和定义模型时使用的输入序列最大长度。

我们可以对最大序列长度进行硬编码。文本编码后，我们就可以创建 tokenizer，并将其保存至文件，这样我们可以在需要的时候快速加载，无需整个 Flickr8K 数据集。另一个方法是使用我们自己的词汇文件，在训练过程中将其映射到取整函数。

我们可以按照之前的方式创建 Tokenizer，并将其保存为 pickle 文件 tokenizer.pkl。完整示例如下：

from keras.preprocessing.text import Tokenizerfrom pickle import dump# load doc into memorydef load_doc(filename): # open the file as read only file = open(filename, 'r') # read all text text = file.read() # close the file file.close() return text# load a pre-defined list of photo identifiersdef load_set(filename): doc = load_doc(filename) dataset = list() # process line by line for line in doc.split('\n'): # skip empty lines if len(line) < 1: continue # get the image identifier identifier = line.split('.')[0] dataset.append(identifier) return set(dataset)# load clean descriptions into memorydef load_clean_descriptions(filename, dataset): # load document doc = load_doc(filename) descriptions = dict() for line in doc.split('\n'): # split line by white space tokens = line.split() # split id from description image_id, image_desc = tokens[0], tokens[1:] # skip images not in the set if image_id in dataset: # create list if image_id not in descriptions: descriptions[image_id] = list() # wrap description in tokens desc = 'startseq ' + ' '.join(image_desc) + ' endseq' # store descriptions[image_id].append(desc) return descriptions# covert a dictionary of clean descriptions to a list of descriptionsdef to_lines(descriptions): all_desc = list() for key in descriptions.keys(): [all_desc.append(d) for d in descriptions[key]] return all_desc# fit a tokenizer given caption descriptionsdef create_tokenizer(descriptions): lines = to_lines(descriptions) tokenizer = Tokenizer() tokenizer.fit_on_texts(lines) return tokenizer# load training dataset (6K)filename = 'Flickr8k_text/Flickr_8k.trainImages.txt'train = load_set(filename) print('Dataset: %d' % len(train))# descriptionstrain_descriptions = load_clean_descriptions('descriptions.txt', train) print('Descriptions: train=%d' % len(train_descriptions))# prepare tokenizertokenizer = create_tokenizer(train_descriptions)# save the tokenizerdump(tokenizer, open('tokenizer.pkl', 'wb'))

现在我们可以在需要的时候加载 tokenizer，无需加载整个标注训练数据集。下面，我们来为一个新图像生成描述，下面这张图是我从 Flickr 中随机选的一张图像。

海滩上的狗

我们将使用模型为它生成描述。首先下载图像，保存至本地文件夹，文件名设置为「example.jpg」。然后，我们必须从 tokenizer.pkl 中加载 Tokenizer，定义生成序列的最大长度，在对输入数据进行填充时需要该信息。

# load the tokenizertokenizer = load(open('tokenizer.pkl', 'rb'))# pre-define the max sequence length (from training)max_length = 34

然后我们必须加载模型，如前所述。

# load the modelmodel = load_model('model-ep002-loss3.245-val_loss3.612.h5')

接下来，我们必须加载要描述和提取特征的图像。

重定义该模型、向其中添加 VGG-16 模型，或者使用 VGG 模型来预测特征，使用这些特征作为现有模型的输入。我们将使用后一种方法，使用数据准备阶段所用的 extract_features() 函数的修正版本，该版本适合处理单个图像。

# extract features from each photo in the directorydef extract_features(filename): # load the model model = VGG16() # re-structure the model model.layers.pop() model = Model(inputs=model.inputs, outputs=model.layers[-1].output) # load the photo image = load_img(filename, target_size=(224, 224)) # convert the image pixels to a numpy array image = img_to_array(image) # reshape data for the model image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2])) # prepare the image for the VGG model image = preprocess_input(image) # get features feature = model.predict(image, verbose=0) return feature# load and prepare the photographphoto = extract_features('example.jpg')

之后使用评估模型定义的 generate_desc() 函数生成图像描述。为单个全新图像生成描述的完整示例如下：

from pickle import loadfrom numpy import argmaxfrom keras.preprocessing.sequence import pad_sequencesfrom keras.applications.vgg16 import VGG16from keras.preprocessing.image import load_imgfrom keras.preprocessing.image import img_to_arrayfrom keras.applications.vgg16 import preprocess_inputfrom keras.models import Modelfrom keras.models import load_model# extract features from each photo in the directorydef extract_features(filename): # load the model model = VGG16() # re-structure the model model.layers.pop() model = Model(inputs=model.inputs, outputs=model.layers[-1].output) # load the photo image = load_img(filename, target_size=(224, 224)) # convert the image pixels to a numpy array image = img_to_array(image) # reshape data for the model image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2])) # prepare the image for the VGG model image = preprocess_input(image) # get features feature = model.predict(image, verbose=0) return feature# map an integer to a worddef word_for_id(integer, tokenizer): for word, index in tokenizer.word_index.items(): if index == integer: return word return None# generate a description for an imagedef generate_desc(model, tokenizer, photo, max_length): # seed the generation process in_text = 'startseq' # iterate over the whole length of the sequence for i in range(max_length): # integer encode input sequence sequence = tokenizer.texts_to_sequences([in_text])[0] # pad input sequence = pad_sequences([sequence], maxlen=max_length) # predict next word yhat = model.predict([photo,sequence], verbose=0) # convert probability to integer yhat = argmax(yhat) # map integer to word word = word_for_id(yhat, tokenizer) # stop if we cannot map the word if word is None: break # append as input for generating the next word in_text += ' ' + word # stop if we predict the end of the sequence if word == 'endseq': break return in_text# load the tokenizertokenizer = load(open('tokenizer.pkl', 'rb'))# pre-define the max sequence length (from training)max_length = 34# load the modelmodel = load_model('model-ep002-loss3.245-val_loss3.612.h5')# load and prepare the photographphoto = extract_features('example.jpg')# generate descriptiondescription = generate_desc(model, tokenizer, photo, max_length) print(description)