For many tasks of data analysis, we may only have the information of the explanatory variable and the evaluation of the response values are quite expensive. While it is impractical or too costly to obtain the responses of all units, a natural remedy is to judiciously select a good sample of units, for which the responses are to be evaluated. In this paper, we adopt the classical criteria in design of experiments to quantify the information of a given sample regarding parameter estimation. Then, we provide a theoretical justification for approximating the optimal sample problem by a continuous problem, for which fast algorithms can be further developed with the guarantee of global convergence. Our results have the following novelties: (i) The statistical efficiency of any candidate sample can be evaluated without knowing the exact optimal sample; (ii) It can be applied to a very wide class of statistical models; (iii) It can be integrated with a broad class of information criteria; (iv) It is much faster than existing algorithms. $(v)$ A geometric interpretation is adopted to theoretically justify the relaxation of the original combinatorial problem to continuous optimization problem.
翻译:对于许多数据分析任务,我们可能只有解释变量的信息,而评估响应值却非常昂贵。当我们无法获得所有单位的响应时,一种自然的解决方案是精心选择一个好的样本单元,对其进行响应值的评估。在本文中,我们采用实验设计中的经典准则来量化给定样本关于参数估计的信息。然后,我们提供了一个理论上的解释,用于将最优样本问题近似为连续问题,进一步开发具有全局收敛保证的快速算法。我们的结果具有以下创新点:(i) 任何候选样本的统计效率均可在不Knowing确定最优样本的情况下进行评估;(ii) 它可以应用于非常广泛的统计模型;(iii) 可以与各种信息准则结合使用;(iv) 它比现有算法快得多。(v)采用几何解释以理论上证明原始组合问题到连续优化问题的放松。