A lack of sufficient training data, both in terms of variety and quantity, is often the bottleneck in the development of machine learning (ML) applications in any domain. For agricultural applications, ML-based models designed to perform tasks such as autonomous plant classification will typically be coupled to just one or perhaps a few plant species. As a consequence, each crop-specific task is very likely to require its own specialized training data, and the question of how to serve this need for data now often overshadows the more routine exercise of actually training such models. To tackle this problem, we have developed an embedded robotic system to automatically generate and label large datasets of plant images for ML applications in agriculture. The system can image plants from virtually any angle, thereby ensuring a wide variety of data; and with an imaging rate of up to one image per second, it can produce lableled datasets on the scale of thousands to tens of thousands of images per day. As such, this system offers an important alternative to time- and cost-intensive methods of manual generation and labeling. Furthermore, the use of a uniform background made of blue keying fabric enables additional image processing techniques such as background replacement and plant segmentation. It also helps in the training process, essentially forcing the model to focus on the plant features and eliminating random correlations. To demonstrate the capabilities of our system, we generated a dataset of over 34,000 labeled images, with which we trained an ML-model to distinguish grasses from non-grasses in test data from a variety of sources. We now plan to generate much larger datasets of Canadian crop plants and weeds that will be made publicly available in the hope of further enabling ML applications in the agriculture sector.
翻译:缺乏足够的培训数据,无论是在种类和数量方面,往往是在任何领域机械学习(ML)应用开发过程中遇到的瓶颈。对于农业应用而言,设计用于诸如自主植物分类等任务的基于ML模型通常只与一个或也许几个植物物种相伴。因此,每个作物的具体任务都非常可能需要自己的专门培训数据,而如何满足这一数据需求的问题现在往往掩盖了实际培训此类模型的更例行工作。为解决这一问题,我们开发了一个嵌入式机器人系统,自动生成和标注大型植物图像数据集,供加拿大农业应用ML应用的大型植物图像数据集。这个系统可以从几乎任何角度对植物进行图像绘制,从而确保数据种类的多样化;如果每秒摄制一个图像的成像率高达一个或一个,那么,每个作物的具体成像率就很可能需要自己的专门培训数据,因此,如何满足这种数据需求的问题往往掩盖了实际培训此类模型的更常规和成本密集的方法。此外,我们开发一个统一的蓝键结构背景,使得现在的加拿大的工厂应用更大规模非图像处理技术,从而确保数据种类的多样化,从而在基本地进行背景更新和植物分类中显示我们所创造的数据分析的能力。