Distant and weak supervision allow to obtain large amounts of labeled training data quickly and cheaply, but these automatic annotations tend to contain a high amount of errors. A popular technique to overcome the negative effects of these noisy labels is noise modelling where the underlying noise process is modelled. In this work, we study the quality of these estimated noise models from the theoretical side by deriving the expected error of the noise model. Apart from evaluating the theoretical results on commonly used synthetic noise, we also publish NoisyNER, a new noisy label dataset from the NLP domain that was obtained through a realistic distant supervision technique. It provides seven sets of labels with differing noise patterns to evaluate different noise levels on the same instances. Parallel, clean labels are available making it possible to study scenarios where a small amount of gold-standard data can be leveraged. Our theoretical results and the corresponding experiments give insights into the factors that influence the noise model estimation like the noise distribution and the sampling technique.
翻译:在这项工作中,我们通过推断噪音模型的预期错误,从理论方面研究这些估计噪音模型的质量。除了评估常用合成噪音的理论结果外,我们还出版了NosyNER(NLP域的一个新的噪音标签数据集),该数据集是通过现实的远程监督技术获得的。它提供了七套具有不同噪音模式的标签,以评价同一实例的不同噪音水平。平行的、清洁的标签使得有可能研究能够利用少量黄金标准数据的设想。我们的理论结果和相应的实验揭示了影响噪音模型估计的因素,例如噪音分布和取样技术。