内红处理分析引擎:实现高效的下一轮模拟分析 (Analytical Engines With Context-Rich Processing: Towards Efficient Next-Generation Analytics)

As modern data pipelines continue to collect, produce, and store a variety of data formats, extracting and combining value from traditional and context-rich sources such as strings, text, video, audio, and logs becomes a manual process where such formats are unsuitable for RDBMS. To tap into the dark data, domain experts analyze and extract insights and integrate them into the data repositories. This process can involve out-of-DBMS, ad-hoc analysis, and processing resulting in ETL, engineering effort, and suboptimal performance. While AI systems based on ML models can automate the analysis process, they often further generate context-rich answers. Using multiple sources of truth, for either training the models or in the form of knowledge bases, further exacerbates the problem of consolidating the data of interest. We envision an analytical engine co-optimized with components that enable context-rich analysis. Firstly, as the data from different sources or resulting from model answering cannot be cleaned ahead of time, we propose using online data integration via model-assisted similarity operations. Secondly, we aim for a holistic pipeline cost- and rule-based optimization across relational and model-based operators. Thirdly, with increasingly heterogeneous hardware and equally heterogeneous workloads ranging from traditional relational analytics to generative model inference, we envision a system that just-in-time adapts to the complex analytical query requirements. To solve increasingly complex analytical problems, ML offers attractive solutions that must be combined with traditional analytical processing and benefit from decades of database community research to achieve scalability and performance effortless for the end user.

翻译：随着现代数据管道继续收集、制作和储存各种数据格式,从传统和背景丰富的来源,如字符串、文本、视频、音像、录音和日志提取和合并价值,成为一种手工过程,这种格式不适合区域数据库管理系统。利用暗数据,域专家分析和提取洞见并将其纳入数据储存库。这一过程可以涉及外部数据管理系统、临时分析和处理,从而导致ETL、工程努力和不最优化的绩效。虽然基于ML模型的AI系统可以使分析过程自动化,但它们往往进一步产生环境丰富的答案。利用多种真相来源,培训模型或以知识基础的形式,进一步加剧整合相关数据的问题。我们设想了一种分析引擎,与能够进行内容丰富的分析的组件共同优化。首先,不同来源或模型解答产生的数据无法提前清理,我们提议通过模型辅助的类似操作,使用在线数据整合,从而实现成本和规则性综合分析努力的答案。其次,我们的目标是利用多种渠道进行成本和规则的优化,在关系和模型上,从一个日益具有吸引力的分析性的分类化的系统到一个越来越具有吸引力的复杂、具有代表性的复杂、具有代表性的变式的变式的系统,我们必须,我们的目标是从一个从一个从一个从结构的系统,从一个从一个越来越具有稳定的变现的复杂、从结构的变的变的变的变的系统,从一个从一个从一个不断的变式的变式的变式的变的变的变的变的变的变的变的变的变的变的变的变的变的变的系统到变的变的变的变的变的变的变的变的变的变的变的系统,必须的变的变的变的变的变的变的系统,必须的变的变的变的变的变的变的变的变的变的变的系统,必须的变的变的变的变的变的系统,必须的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变。