We consider the problem of estimating the joint distribution of $n$ independent random variables. Our approach is based on a family of candidate probabilities that we shall call a model and which is chosen to either contain the true distribution of the data or at least to provide a good approximation of it with respect to some loss function. The aim of the present paper is to describe a general estimation strategy that allows to adapt to both the specific features of the model and the choice of the loss function in view of designing an estimator with good estimation properties. The losses we have in mind are based on the total variation, Hellinger, Wasserstein and $\mathbb{L}_p$-distances to name a few. We show that the risk of the resulting estimator with respect to the loss function can be bounded by the sum of an approximation term accounting for the loss between the true distribution and the model and a complexity term that corresponds to the bound we would get if this distribution did belong to the model. Our results hold under mild assumptions on the true distribution of the data and are based on exponential deviation inequalities that are non-asymptotic and involve explicit constants. When the model reduces to two distinct probabilities, we show how our estimation strategy leads to a robust test whose errors of first and second kinds only depend on the losses between the true distribution and the two tested probabilities.
翻译:我们考虑的是估算美元独立随机变量联合分布的问题。我们的方法基于一个候选人概率的组合,我们称之为模型,选择它要么包含数据的真实分布,或者至少在某些损失函数方面提供对数据准确的近似值。本文件的目的是描述一个总体估算战略,既能够适应模型的具体特点,又能够根据设计具有良好估计属性的估算符选择损失函数。我们想到的损失是基于总变差,即Hellinger、Wasserstein和$\mathbb{L ⁇ p$-距离来点几个名字。我们表明,由此得出的估算值在损失函数方面的风险可能受一个估计值术语的总和的制约,即计算模型和模型和复杂术语之间的损失,如果这一分布属于模型,则与我们所受的约束相对应。我们根据对数据真实分布的假设进行了温和的假设,并且我们基于指数偏差的不平等,即非假设值和偏差{L$_p_$-距离来点点。我们显示,由此得出的估算值取决于两个模型和精确度的精确度,我们只能根据两个模型和精确的测算。