Training accurate intent classifiers requires labeled data, which can be costly to obtain. Data augmentation methods may ameliorate this issue, but the quality of the generated data varies significantly across techniques. We study the process of systematically producing pseudo-labeled data given a small seed set using a wide variety of data augmentation techniques, including mixing methods together. We find that while certain methods dramatically improve qualitative and quantitative performance, other methods have minimal or even negative impact. We also analyze key considerations when implementing data augmentation methods in production.
翻译:培训准确的意向分类师需要贴标签的数据,而获得这些数据的成本可能很高。数据扩充方法可以改善这一问题,但生成的数据的质量因技术不同而差别很大。我们研究系统制作假标签数据的过程,利用多种数据扩增技术,包括混合方法,提供一套小种子。我们发现,虽然某些方法在质量和数量上大大提高了绩效,但其他方法的影响很小,甚至有负面影响。我们还分析在生产中采用数据扩增方法时的主要考虑因素。