We study a regression problem where for some part of the data we observe both the label variable ($Y$) and the predictors (${\bf X}$), while for other part of the data only the predictors are given. Such a problem arises, for example, when observations of the label variable are costly and may require a skilled human agent. When the conditional expectation $E[Y | {\bf X}]$ is not exactly linear, one can consider the best linear approximation to the conditional expectation, which can be estimated consistently by the least squares estimates (LSE). The latter depends only on the labeled data. We suggest improved alternative estimates to the LSE that use also the unlabeled data. Our estimation method can be easily implemented and has simply described asymptotic properties.The new estimates asymptotically dominate the usual standard procedures under certain non-linearity condition of $E[Y | {\bf X}]$; otherwise, they are asymptotically equivalent.The performance of the new estimator for small sample size is investigated in an extensive simulation study. A real data example of inferring homeless population is used to illustrate the new methodology.
翻译:我们研究一个回归问题,即对于部分数据,我们观察标签变量($Y)和预测值($$bf X}美元),而对于数据的其他部分则只提供预测值。例如,当对标签变量的观察费用昂贵,可能需要熟练的人体代理人员时,就会产生这样的问题。当有条件的预期值[Y ⁇ ⁇ {bf X}美元并非完全线性时,人们可以将最接近于有条件期望值的线性近似视为最佳的线性近似值,而有条件期望值可以由最小的平方估计(LSE)一致估算。后者仅取决于标签数据。我们建议改进LSE的替代估计值,同时使用未标的数据。我们的估计方法可以很容易地实施,并简单地描述非典型的特性。新的估计值在某些非线性条件($[Y ⁇ ⁇ {bf X}美元)下,以非线性标准程序为主;否则,人们可以认为它们与有条件的预期值相同。新的样本大小估测算器的绩效在广泛模拟研究中进行调查。一个真实的数据例子说明无家可归人口所使用的方法。