聚焦制造业用例，看 Coral 如何轻松构建落地场景的 ML 解决方案

2021 年 8 月 24 日 TensorFlow

发布人：Coral 的 Michael Brooks

3 年多来，Coral 一直专注于用低功耗、高性能的产品来实现保护隐私的 Edge ML。我们已经发布了许多案例和项目，旨在助您快速提高 ML 的速度，以满足您特定的需求。在探索了 Coral 模型和项目后，我们收到最常见的需求之一是：我们如何投入生产？

项目
https://github.com/google-coral/
模型
https://coral.ai/models/

基于这一点，本文将介绍我们首个用例的具体演示版。这些演示版的目的是充分利用 Coral Edge TPU™，借助可轻松定制的高性能、高生产质量代码，满足您的 ML 要求。在此演示版中，我们将主要目光投向制造业的特定用例；工人安全和质量分级/目视检测。

Coral Edge TPU
https://coral.ai/docs/edgetpu/faq/

演示版概览

Coral 制造业演示版的目标是 x86 系统或具有 OpenGL 加速功能的强大 ARM64 系统，后者可以处理和显示两个同步输入。使用附带示例视频的默认演示版如下所示：

Coral 制造业演示版
https://github.com/google-coral/demo-manufacturing

正在运行的两个示例是：

●

员工安全：执行一般的人员检测（经由 COCO 训练的 SSDLite MobileDet 提供支持），然后通过运行简单的算法来检测边界框碰撞的情况，以此查看人员所在区域是否安全。

●

目视检测：执行苹果检测（与执行员工安全检测相同，使用经 COCO 训练的 SSDLite MobileDet），然后给检测到的苹果加框架，并通过运行经过重新训练的 MobileNetV2，对新鲜与腐烂的苹果进行分类。

通过结合这两个示例，我们能够展示可以实现此处理的多个 Coral 功能，包括：

●

共同编译

●

级联模型（使用一个模型的输出来馈送另一个模型）

●

分类再训练

●

多个输入的实时处理

创建演示版

在设计一个新的 ML 应用时，确保能满足延迟时间和准确性要求至关重要。就本文描述的两个应用而言，我们都经历了以下流程：选择模型、训练这些模型，并部署到 EdgeTPU。这是开始创建任何 Coral 应用时都应该遵循的流程。

选择模型

决定使用哪一个模型时，从访问 Coral 模型页面入手是理想之选。就这个演示版而言，我们知道需要使用检测模型（用于检测人员和苹果）以及分类模型。

Coral 模型页面
https://coral.ai/models/all/

检测

从检测模型页面中挑选检测模型时，我们要考虑模型的四个方面：

检测模型页面
https://coral.ai/models/object-detection/

1. 训练数据集：在模型页面中，所有的正常检测模型都会使用 COCO 数据集。通过参照标签，我们可以同时找到“苹果”和“人物”，所以我们只需用一个模型就能完成这两个检测任务。

2. 延迟时间：针对每帧至少需要运行 3 个推理，并能够跟上我们的输入速度 (30 FPS)。这意味着我们要尽可能提升检测速度。在模型页面中，我们可以看到两个不错的选择：SSD MobileNet v2（7.4 毫秒）和 MobileDet（8.0 毫秒）。这是 Coral 具有明显优势的第一点——参考 x86+USB CTS 输出底部的基准，我们可以看到，即使是在功能强大的工作站上，这项工作的耗时也会达到是 90 毫秒和 123 毫秒。

3. 准确率/精确度：我们也希望有一个尽可能准确的模型，所以通常会使用 COCO 评估指标中的主要挑战指标进行评估。如你所见，MobileDet (32.8%) 的表现明显优于 MobileNet V2 (25.7%)。

COCO 评估指标
https://cocodataset.org/#detection-eval

4. 大小：为了将这个检测模型与下文的分类模型进行完全共同编译，我们需要确保能在 Edge TPU 上的 8MB 缓存中容纳这两个模型。这意味着我们需要尽可能小的模型，但MobileDet 是 5.1 MB， MobileNet V2 则是 6.6 MB。

考虑到上述因素，我们选择 SSDLite MobileDet。

分类

对于新鲜或腐烂的苹果分类，Coral 分类页面上还有更多选项。我们要检查的内容是一样的：

Coral 分类页面
https://coral.ai/models/image-classification/

1. 训练数据集：我们将利用新的数据集开展再训练，所以这点在此应用中并不关键。

2. 延迟时间：我们希望充分提高分类速度。幸运的是，相对于 30 FPS 帧率这一基础需求，我们页面上的许多模型都非常快。考虑到这一点，我们可以排除所有的 Inception 模型和 ResNet-50。

3.准确率：目前提供了 Top-1 和 Top-5 的准确率。尽管我们希望 Top-1 的准确率尽可能高（因为我们只检查新鲜还是腐烂），但仍然需要考虑延迟时间问题。考虑到这一点，我们排除了 MobileNet v1。

4. 大小：如上所述，我们要确保能同时容纳检测模型和分类模型（或尽量容纳），这样我们就能轻松地排除 EfficientNet 选项。

这就只剩下了 MobileNet v2 和 MobileNet v3。根据有关再训练此模型的现有教程，我们选择了 v2。

再训练分类

选择好模型之后，我们现在需要保留分类模型来识别新鲜和腐烂的苹果。Coral.ai 提供了 CoLab（使用训练后量化）和 Docker（使用量化感知训练）格式的训练教程，但我们也在这个演示版的代码库中加入了再训练的 python 脚本。

CoLab
https://colab.sandbox.google.com/github/google-coral/tutorials/blob/master/retrain_classification_ptq_tf2.ipynb
Docker
https://coral.ai/docs/edgetpu/retrain-classification/

我们的新鲜/腐烂数据来自“用于分类的新鲜和腐烂水果”数据集，这里只是省略了除苹果以外的所有东西。

在脚本中，我们首先加载标准的 Keras MobileNetV2，冻结前 100 层并在最后添加一些额外的层：

脚本
https://github.com/google-coral/demo-manufacturing/blob/main/models/retraining/train_classifier.py

base_model = tf.keras.applications.MobileNetV2(input_shape=input_shape,
                                                    include_top=False,
                                                    classifier_activation='softmax',
                                                    weights='imagenet')
# Freeze first 100 layers
base_model.trainable = True
for layer in base_model.layers[:100]:
  layer.trainable = False
model = tf.keras.Sequential([
    base_model,
    tf.keras.layers.Conv2D(filters=32, kernel_size=3, activation='relu'),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.GlobalAveragePooling2D(),
    tf.keras.layers.Dense(units=2, activation='softmax')
])
model.compile(loss='categorical_crossentropy',
              optimizer=tf.keras.optimizers.RMSprop(lr=1e-5),
              metrics=['accuracy'])
print(model.summary())

接下来，利用通过下载到 ./dataset 的数据集训练模型：

train_datagen = ImageDataGenerator(rescale=1./255,
                                  zoom_range=0.3,
                                  rotation_range=50,
                                  width_shift_range=0.2,
                                  height_shift_range=0.2,
                                  shear_range=0.2,
                                  horizontal_flip=True,
                                  fill_mode='nearest')
val_datagen = ImageDataGenerator(rescale=1./255)
dataset_path = './dataset'
train_set_path = os.path.join(dataset_path, 'train')
val_set_path = os.path.join(dataset_path, 'test')
batch_size = 64
train_generator = train_datagen.flow_from_directory(train_set_path,
                                                    target_size=input_size,
                                                    batch_size=batch_size,
                                                    class_mode='categorical')
val_generator = val_datagen.flow_from_directory(val_set_path,
                                                target_size=input_size,
                                                batch_size=batch_size,
                                                class_mode='categorical')
epochs = 15
history = model.fit(train_generator,
                    steps_per_epoch=train_generator.n // batch_size,
                    epochs=epochs,
                    validation_data=val_generator,
                    validation_steps=val_generator.n // batch_size,
                    verbose=1)

请注意，我们只运行了 15 个周期。在利用另一个数据集再训练时，很可能需要更多的周期。就“苹果”数据集而言，我们可以看到这个模型很快就达到了非常高的准确率：

就您自己所用的数据集和模型来说，可能需要更长的周期（脚本将生成上述图例来进行验证）。

现在我们得到了 Keras 模型，可以用于检查苹果质量。为了在 Coral Edge TPU 上运行这个模型，必须对该模型进行量化并转换为 TF Lite。为此，我们将使用训练后的量化结果，即在训练后对有代表性的数据集进行量化处理：

def representative_data_gen():
  dataset_list = tf.data.Dataset.list_files('./dataset/test/*/*')
  for i in range(100):
    image = next(iter(dataset_list))
    image = tf.io.read_file(image)
    image = tf.io.decode_jpeg(image, channels=3)
    image = tf.image.resize(image, input_size)
    image = tf.cast(image / 255., tf.float32)
    image = tf.expand_dims(image, 0)
    yield [image]
model.input.set_shape((1,) + model.input.shape[1:])
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_data_gen
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.target_spec.supported_types = [tf.int8]
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8
tflite_model = converter.convert()

然后，该脚本将编译模型并评估 Keras 和 TF Lite 模型，但我们需要在脚本之外采取额外的步骤：即，必须使用 Edge TPU 编译器来共同编译分类模型和检测模型。

共同编译模型

现在有两个经过量化处理的 TF Lite 模型：classifier.tflite 和 MobileDet 的默认 CPU 模型（取自 Coral 模型页面），我们可以同时编译这两者，以确保其共享缓存令牌，使得系统在请求任意一个模型时，都会缓存参数数据。只需要把两个模型都传递给编译器即可：

MobileDet 的默认 CPU 模型
https://github.com/google-coral/test_data/raw/master/ssdlite_mobiledet_coco_qat_postprocess.tflite

edgetpu_compiler ssdlite_mobiledet_coco_qat_postprocess.tflite classifier.tflite
Edge TPU Compiler version 15.0.340273435

Models compiled successfully in 1770 ms.

Input model: ssdlite_mobiledet_coco_qat_postprocess.tflite
Input size: 4.08MiB
Output model: ssdlite_mobiledet_coco_qat_postprocess_edgetpu.tflite
Output size: 5.12MiB
On-chip memory used for caching model parameters: 4.89MiB
On-chip memory remaining for caching model parameters: 2.74MiB
Off-chip memory used for streaming uncached model parameters: 0.00B
Number of Edge TPU subgraphs: 1
Total number of operations: 125
Operation log: ssdlite_mobiledet_coco_qat_postprocess_edgetpu.log

Model successfully compiled but not all operations are supported by the Edge TPU. A percentage of the model will instead run on the CPU, which is slower. If possible, consider updating your model to use only operations supported by the Edge TPU. For details, visit g.co/coral/model-reqs.
Number of operations that will run on Edge TPU: 124
Number of operations that will run on CPU: 1
See the operation log file for individual operation details.

Input model: classifier.tflite
Input size: 3.07MiB
Output model: classifier_edgetpu.tflite
Output size: 3.13MiB
On-chip memory used for caching model parameters: 2.74MiB
On-chip memory remaining for caching model parameters: 0.00B
Off-chip memory used for streaming uncached model parameters: 584.06KiB
Number of Edge TPU subgraphs: 1
Total number of operations: 72
Operation log: classifier_edgetpu.log
See the operation log file for individual operation details.

在此观察记录中，有两件事需要注意。首先，我们发现系统会如期在用于检测模型的 CPU 上运行操作。TF Lite SSD 的后处理将始终在 CPU 上运行。第二，我们无法刚好把所有东西都放在芯片内存中，需要为分类器使用 584 kB 的芯片外内存。这并无大碍，我们已经大大减少了所需的 IO 时间。两个模型现在都会在同一个文件夹里，但是因为我们对其进行了共同编译，所以它们知道彼此的存在，而且会在缓存中持续保留两个模型的参数。