Integrating machine learning techniques into RDBMSs is an important task since there are many real applications that require modeling (e.g., business intelligence, strategic analysis) as well as querying data in RDBMSs. In this paper, we provide an SQL solution that has the potential to support different machine learning modelings. As an example, we study how to support unsupervised probabilistic modeling, that has a wide range of applications in clustering, density estimation and data summarization, and focus on Expectation-Maximization (EM) algorithms, which is a general technique for finding maximum likelihood estimators. To train a model by EM, it needs to update the model parameters by an E-step and an M-step in a while-loop iteratively until it converges to a level controled by some threshold or repeats a certain number of iterations. To support EM in RDBMSs, we show our answers to the matrix/vectors representations in RDBMSs, the relational algebra operations to support the linear algebra operations required by EM, parameters update by relational algebra, and the support of a while-loop. It is important to note that the SQL'99 recursion cannot be used to handle such a while-loop since the M-step is non-monotonic. In addition, assume that a model has been trained by an EM algorithm, we further design an automatic in-database model maintenance mechanism to maintain the model when the underlying training data changes.We have conducted experimental studies and will report our findings in this paper.
翻译:将机器学习技术纳入数据库管理系统是一项重要任务,因为有许多真正的应用程序需要建模(例如,商业情报、战略分析)和在数据库管理系统中查询数据。在本文中,我们提供SQL解决方案,该解决方案有可能支持不同的机器学习模型。举例来说,我们研究如何支持未经监督的概率模型,该模型在集群、密度估计和数据汇总方面有着广泛的应用,并侧重于期望-最大高度化算法,这是寻找最大概率估计器的一般技术。要用电子数据管理系统培训一个模型,它需要通过E级和M级步骤在多动中更新模型参数,直到它达到某种阈值或重复一定数量的迭代模式控制水平。为了支持数据组合、密度估计和数据汇总,我们展示了RDBMSs的矩阵/矢量表达方式,为了支持这种不线性测值计算操作而将高位值操作作为总技术。为了使用电子数据模型进行在线测算器操作,在使用SEM-L系统的重要计算过程中,一个数据参数将更新到直线性平流的计算模型。