Many real-world data sets can be presented in the form of a matrix whose entries correspond to the interaction between two entities of different natures (number of times a web user visits a web page, a student's grade in a subject, a patient's rating of a doctor, etc.). We assume in this paper that the mentioned interaction is determined by unobservable latent variables describing each entity. Our objective is to estimate the conditional expectation of the data matrix given the unobservable variables. This is presented as a problem of estimation of a bivariate function referred to as graphon. We study the cases of piecewise constant and H\"older-continuous graphons. We establish finite sample risk bounds for the least squares estimator and the exponentially weighted aggregate. These bounds highlight the dependence of the estimation error on the size of the data set, the maximum intensity of the interactions, and the level of noise. As the analyzed least-squares estimator is intractable, we propose an adaptation of Lloyd's alternating minimization algorithm to compute an approximation of the least-squares estimator. Finally, we present numerical experiments in order to illustrate the empirical performance of the graphon estimator on synthetic data sets.
翻译:许多真实世界的数据集可以表示为矩阵,其条目对应于两个不同本质的实体之间的交互(网页用户访问网页的次数,学生在学科中的成绩,病人对医生的评级等)。我们在本文中假设所述交互是由描述每个实体的不可观测潜在变量确定的。我们的目标是估计给定不可观测变量的数据矩阵的条件期望。这被提出为一个估计称为图核的二元函数的问题。我们研究了分段常数和H\"older-连续图核的情况。我们为最小二乘估计和指数加权聚合建立了有限样本风险界限。这些边界突出了估计误差与数据集大小、交互的最大强度以及噪声水平的依赖关系。由于分析的最小二乘估计是不可解的,我们提出了劳埃德交替最小化算法的一种适应性来计算最小二乘估计的近似值。最后,我们展示了数值实验,以说明图核估计器在合成数据集上的实际性能。