AI application developers typically begin with a dataset of interest and a vision of the end analytic or insight they wish to gain from the data at hand. Although these are two very important components of an AI workflow, one often spends the first few weeks (sometimes months) in the phase we refer to as data conditioning. This step typically includes tasks such as figuring out how to prepare data for analytics, dealing with inconsistencies in the dataset, and determining which algorithm (or set of algorithms) will be best suited for the application. Larger, faster, and messier datasets such as those from Internet of Things sensors, medical devices or autonomous vehicles only amplify these issues. These challenges, often referred to as the three Vs (volume, velocity, variety) of Big Data, require low-level tools for data management, preparation and integration. In most applications, data can come from structured and/or unstructured sources and often includes inconsistencies, formatting differences, and a lack of ground-truth labels. In this report, we highlight a number of tools that can be used to simplify data integration and preparation steps. Specifically, we focus on data integration tools and techniques, a deep dive into an exemplar data integration tool, and a deep-dive in the evolving field of knowledge graphs. Finally, we provide readers with a list of practical steps and considerations that they can use to simplify the data integration challenge. The goal of this report is to provide readers with a view of state-of-the-art as well as practical tips that can be used by data creators that make data integration more seamless.
翻译:AI 应用程序开发者通常从感兴趣的数据集和他们希望从手头数据中获得的终极分析或洞察力的愿景开始。 虽然这是AI工作流程的两个非常重要的组成部分, 但这些是AI工作流程的两个非常重要的组成部分, 但其中的一个通常是我们所说的数据调节阶段的前几周( 有时是几个月) 。 这个步骤通常包括的任务包括: 如何为分析准备数据, 处理数据集中的不一致问题, 以及确定哪些算法( 或一套算法) 最适合应用程序。 更大的、 更快的、 以及更混乱的数据集, 如来自Tings 互联网传感器、 医疗装置或自主工具的数据集, 只会放大这些问题。 这些挑战, 通常被称为大数据的三个V( 数量、 速度、 种类), 需要低层次的数据管理、 准备和整合工具。 在大多数应用中, 数据来自结构化和/ 或无结构化的源源, 往往包括不一致、 格式差异和缺乏实际标签。 在本报告中, 我们强调一些工具可以用来简化数据整合和准备步骤。 具体地说, 我们注重数据整合, 将数据整合作为深度数据整合的工具, 和深层次的整合工具, 最后, 将数据整合成为我们用一个工具的实地的集成一个工具,, 的集成一个工具, 将数据整合, 将数据集成一个更精确的集成一个数据工具, 的集成一个工具, 以提供一个更精确的集成为图表。