In natural language understanding (NLU) production systems, users' evolving needs necessitate the addition of new features over time, indexed by new symbols added to the meaning representation space. This requires additional training data and results in ever-growing datasets. We present the first systematic investigation of this incremental symbol learning scenario. Our analysis reveals a troubling quirk in building broad-coverage NLU systems: as the training dataset grows, performance on the new symbol often decreases if we do not accordingly increase its training data. This suggests that it becomes more difficult to learn new symbols with a larger training dataset. We show that this trend holds for multiple mainstream models on two common NLU tasks: intent recognition and semantic parsing. Rejecting class imbalance as the sole culprit, we reveal that the trend is closely associated with an effect we call source signal dilution, where strong lexical cues for the new symbol become diluted as the training dataset grows. Selectively dropping training examples to prevent dilution often reverses the trend, showing the over-reliance of mainstream neural NLU models on simple lexical cues. Code, models, and data are available at https://aka.ms/nlu-incremental-symbol-learning
翻译:在自然语言理解生产系统(NLU)中,用户不断演变的需求要求随着时间而增加新的特征,用代表空间增加的新符号进行索引。这需要额外的培训数据,并导致数据集不断增加。我们首次系统地调查了这种递增符号学习情景。我们的分析显示,在建立广泛覆盖的NLU系统方面出现了令人不安的问题:随着培训数据集的增加,新符号的性能往往会下降,如果我们不相应增加其培训数据的话。这表明,随着培训数据集的扩大,学习新符号的新符号将变得更加困难。我们表明,在两种共同的NLU任务中,这个趋势是多重主流模型的:目的识别和语义分析。拒绝将阶级不平衡作为唯一的罪魁祸首,我们揭示,这一趋势与我们称之为源信号稀释的效果密切相关,即当培训数据集的成长时,对新符号的强烈词汇提示会变得稀释。有选择地减少培训范例,以防止稀释,往往扭转这一趋势,显示主流神经模型过分依赖简单的字典提示。代码、模型和数据在 http://kasal-salinslimsteinsteinstein.