In a subjective experiment to evaluate the perceptual audiovisual quality of multimedia and television services, raw opinion scores collected from test subjects are often noisy and unreliable. To produce the final mean opinion scores (MOS), recommendations such as ITU-R BT.500, ITU-T P.910 and ITU-T P.913 standardize post-test screening procedures to clean up the raw opinion scores, using techniques such as subject outlier rejection and bias removal. In this paper, we analyze the prior standardized techniques to demonstrate their weaknesses. As an alternative, we propose a simple model to account for two of the most dominant behaviors of subject inaccuracy: bias and inconsistency. We further show that this model can also effectively deal with inattentive subjects that give random scores. We propose to use maximum likelihood estimation to jointly solve the model parameters, and present two numeric solvers: the first based on the Newton-Raphson method, and the second based on an alternating projection (AP). We show that the AP solver generalizes the ITU-T P.913 post-test screening procedure by weighing a subject's contribution to the true quality score by her consistency (thus, the quality scores estimated can be interpreted as bias-subtracted consistency-weighted MOS). We compare the proposed methods with the standardized techniques using real datasets and synthetic simulations, and demonstrate that the proposed methods are the most valuable when the test conditions are challenging (for example, crowdsourcing and cross-lab studies), offering advantages such as better model-data fit, tighter confidence intervals, better robustness against subject outliers, the absence of hard coded parameters and thresholds, and auxiliary information on test subjects. The code for this work is open-sourced at https://github.com/Netflix/sureal.
翻译:在评价多媒体和电视服务的视觉视听质量的主观实验中,从测试对象中收集的原始观点分数往往杂乱和不可靠。为了产生最后平均意见评分(MOS),建议如ITU-R BT.500、ITU-T P.910和ITU-T P.913等标准化测试后筛选程序以清理原始意见分数,使用主题外排斥和消除偏差等技术。在本文中,我们分析先前的标准技术以显示其弱点。作为一种替代办法,我们提出了一个简单的模型,以说明两个最突出的不精确的标数:偏差和不一致。我们进一步表明,这一模型还可以有效地处理提供随机分数的不稳性标数。我们提议利用最大的可能性估算来共同解决模型参数,并提出两个数字解算:第一个基于主题外向值拒绝和删除偏差的方法,以及第二个基于交替的预测(AP),我们表明,AP解算器在比较一个对象对真实质量评分数的贡献时,即偏差和错数后的筛选程序是稳性,我们用最准确性的方法来比较。我们提出的质量评标数的评标数方法是比较。我们提出的质量评标数是用来用来用来解释。