Unstructured data (e.g., video or text) is now commonly queried by using computationally expensive deep neural networks or human labelers to produce structured information, e.g., object types and positions in video. To accelerate queries, many recent systems (e.g., BlazeIt, NoScope, Tahoma, SUPG, etc.) train a query-specific proxy model to approximate a large target labelers (i.e., these expensive neural networks or human labelers). These models return proxy scores that are then used in query processing algorithms. Unfortunately, proxy models usually have to be trained per query and require large amounts of annotations from the target labelers. In this work, we develop an index (trainable semantic index, TASTI) that simultaneously removes the need for per-query proxies and is more efficient to construct than prior indexes. TASTI accomplishes this by leveraging semantic similarity across records in a given dataset. Specifically, it produces embeddings for each record such that records with close embeddings have similar target labeler outputs. TASTI then generates high-quality proxy scores via embeddings without needing to train a per-query proxy. These scores can be used in existing proxy-based query processing algorithms (e.g., for aggregation, selection, etc.). We theoretically analyze TASTI and show that a low embedding training error guarantees downstream query accuracy for a natural class of queries. We evaluate TASTI on five video, text, and speech datasets, and three query types. We show that TASTI's indexes can be 10$\times$ less expensive to construct than generating annotations for current proxy-based methods, and accelerate queries by up to 24$\times$.
翻译:非结构化数据(例如视频或文本)现在通常通过使用成本高昂的深度神经网络或人类标签器来生成结构化信息(例如视频中的物体类型和位置)来查询。为了加速查询,许多最近的系统(例如BlazeIT、NoScope、Tahoma、SUPG等)都训练了一个针对具体查询的代理模型,以接近大型目标标签(例如,这些昂贵的神经网络或人类标签 $ ) 。这些模型返回了在查询处理算法中使用的下游代理数据分数。不幸的是,代理模型通常需要接受每查询的训练,并且需要目标标签标签标签标签标签上的大量说明。在这项工作中,我们开发一个指数(例如,Blazeit、NoScope、TAhoma、SUPGLT等),同时消除对每件代理文件的需求,并且比先前的索引更高效。TASTI可以利用基于的语系的语系相似的语系,我们可以通过一个基于当前选择的语系的语系来做到这一点。具体地,它会为每份记录中包含近嵌式标签标签类的类的递变变变变。