Many Big Data applications include the processing of data streams on semi-structured data formats such as JSON. A disadvantage of such formats is that an application may spend a significant amount of processing time just on unselectively parsing all data. To relax this issue, the concept of raw filtering is proposed with the idea to remove data from a stream prior to the costly parsing stage. However, as accurate filtering of raw data is often only possible after the data has been parsed, raw filters are designed to be approximate in the sense of allowing false-positives in order to be implemented efficiently. Contrary to previously proposed CPU-based raw filtering techniques that are restricted to string matching, we present FPGA-based primitives for filtering strings, numbers and also number ranges. In addition, a primitive respecting the basic structure of JSON data is proposed that can be used to further increase the accuracy of introduced raw filters. The proposed raw filter primitives are designed to allow for their composition according to a given filter expression of a query. Thus, complex raw filters can be created for FPGAs which enable a drastical decrease in the amount of generated false-positives, particularly for IoT workload. As there exists a trade-off between accuracy and resource consumption, we evaluate primitives as well as composed raw filters using different queries from the RiotBench benchmark. Our results show that up to 94.3% of the raw data can be filtered without producing any observed false-positives using only a few hundred LUTs.
翻译:许多大数据应用程序包括处理半结构化数据格式(如JSON)的数据流,如JSON。这种格式的一个缺点是,一个应用程序可能花费大量处理时间,仅仅在非选择性地解析所有数据。为缓解这一问题,提出了原始过滤的概念,在费用昂贵的剖析阶段之前将数据从流中去除。然而,由于原始数据的准确过滤通常只有在数据经过分析后才有可能,因此,原始过滤器的设计要接近于允许假阳性以便有效实施。与先前提议的基于CPU的原始过滤技术相反,这种技术仅限于字符串匹配,我们提出了基于FPGA的原始技术,用于过滤字符串、数字和数字范围。此外,还提出了原始过滤概念,以在费用昂贵的剖析阶段之前将数据从流中去除。由于拟议的原始过滤器原始数据过滤,因此只能允许其组成符合任何过滤器的精度。因此,可以为PPGA创建复杂的原始过滤器,而这种筛选技术仅限于字符串匹配,我们用原始的精确度来进行原始的精确度分析,我们所观察到的精确度在原始交易中的精度上,我们所观察到的精度是原始的精度,我们所观察到的精度是原始的精度的精度的精度,我们所观测到的精度的精度的精度。