是否应该丢弃数据? (Should data ever be thrown away? Pooling interval-censored data sets with different precision)

Data quality is an important consideration in many engineering applications and projects. Data collection procedures do not always involve careful utilization of the most precise instruments and strictest protocols. As a consequence, data are invariably affected by imprecision and sometimes sharply varying levels of quality of the data. Different mathematical representations of imprecision have been suggested, including a classical approach to censored data which is considered optimal when the proposed error model is correct, and a weaker approach called interval statistics based on partial identification that makes fewer assumptions. Maximizing the quality of statistical results is often crucial to the success of many engineering projects, and a natural question that arises is whether data of differing qualities should be pooled together or we should include only precise measurements and disregard imprecise data. Some worry that combining precise and imprecise measurements can depreciate the overall quality of the pooled data. Some fear that excluding data of lesser precision can increase their overall uncertainty about results because lower sample size implies more sampling uncertainty. This paper explores these concerns and describes simulation results that show when it is advisable to combine fairly precise data with rather imprecise data by comparing analyses using different mathematical representations of imprecision. Pooling data sets is preferred when the low-quality data set does not exceed a certain level of uncertainty. However, so long as the data are random, it may be legitimate to reject the low-quality data if its reduction of sampling uncertainty does not counterbalance the effect of its imprecision on the overall uncertainty.

翻译：在许多工程应用和项目中,数据质量是一个重要的考虑因素。数据收集程序并不总是涉及认真使用最精确的工具和最严格的协议。因此,数据总是受到数据质量不精确的影响,有时甚至差异很大。提出了不同不精确的数学表达方式。提出了不同不精确的数学表达方式,其中包括在拟议的错误模型正确时,对审查数据采取传统方法,认为这种方法最理想,而较弱的方法则称为基于部分识别的间隔统计,这种方法较少作出假设。尽量提高统计结果的质量往往对许多工程项目的成功至关重要,产生的一个自然问题是,不同质量的数据是否应合并在一起,还是我们应只包括精确的测量和忽略不精确的数据。有些人担心,精确和不精确的测量会降低集合数据的总体质量。有些人担心,排除不精确程度较低的数据会增加其对结果的总体不确定性,因为较低的抽样规模意味着更多的抽样不确定性。本文探讨了这些关切,并描述了模拟结果,表明,如果利用不同不精确的数学表达方式比较比较比较精确的数据,那么将数据与相当不精确的数据合并是十分关键的。如果低质量的抽样数据不会比低,则比较数据集比较不准确,则比较不准确,因为低质量数据比低的低的精确程度可能比低的精确程度。