Regression uses supervised machine learning to find a model that combines several independent variables to predict a dependent variable based on ground truth (labeled) data, i.e., tuples of independent and dependent variables (labels). Similarly, aggregation also combines several independent variables to a dependent variable. The dependent variable should preserve properties of the independent variables, e.g., the ranking or relative distance of the independent variable tuples, and/or represent a latent ground truth that is a function of these independent variables. However, ground truth data is not available for finding the aggregation model. Consequently, aggregation models are data agnostic or can only be derived with unsupervised machine learning approaches. We introduce a novel unsupervised aggregation approach based on intrinsic properties of unlabeled training data, such as the cumulative probability distributions of the single independent variables and their mutual dependencies. We present an empirical evaluation framework that allows assessing the proposed approach against other aggregation approaches from two perspectives: (i) how well the aggregation output represents properties of the input tuples, and (ii) how well can aggregated output predict a latent ground truth. To this end, we use data sets for assessing supervised regression approaches that contain explicit ground truth labels. However, the ground truth is not used for deriving the aggregation models, but it allows for the assessment from a perspective (ii). More specifically, we use regression data sets from the UCI machine learning repository and benchmark several data-agnostic and unsupervised approaches for aggregation against ours. The benchmark results indicate that our approach outperforms the other data-agnostic and unsupervised aggregation approaches. It is almost on par with linear regression.
翻译:回归利用受监督的机器学习找到一种模型,该模型将若干独立的变量结合起来,以预测基于地面真实数据(标签)的依附变量,即独立和依附变量(标签)的图符。同样,汇总还将若干独立的变量与依附变量(标签)相结合。依附变量应保留独立变量的属性,例如独立变量的排名或相对距离,和/或代表一种由这些独立变量函数函数组成的潜伏地面真相。然而,为寻找聚合模型,无法获得地面真实数据。因此,集成模型是数据不可知性,或只能通过不受监督的基底数据学习方法。至此,我们使用单一独立变量的累积概率分布及其相互依存性等新的未经监督的汇总汇总方法来保存独立变量的属性。我们提出了一个实验性评估框架,从两个角度来评估拟议的方法:(一) 汇总方法如何很好地代表投入图的属性,以及(二) 集式模型如何很好地根据未受监督的基底基数据预测的数值。至此,我们使用明确的基底数据评估方法,但更能用于实地数据评估。