In natural language understanding (NLU) production systems, users' evolving needs necessitate the addition of new features over time, indexed by new symbols added to the meaning representation space. This requires additional training data and results in ever-growing datasets. We present the first systematic investigation into this incremental symbol learning scenario. Our analyses reveal a troubling quirk in building (broad-coverage) NLU systems: as the training dataset grows, more data is needed to learn new symbols, forming a vicious cycle. We show that this trend holds for multiple mainstream models on two common NLU tasks: intent recognition and semantic parsing. Rejecting class imbalance as the sole culprit, we reveal that the trend is closely associated with an effect we call source signal dilution, where strong lexical cues for the new symbol become diluted as the training dataset grows. Selectively dropping training examples to prevent dilution often reverses the trend, showing the over-reliance of mainstream neural NLU models on simple lexical cues and their lack of contextual understanding.
翻译:在自然语言理解生产系统(NLU)中,用户不断演变的需求要求随着时间而增加新的特征,用在表示空间中添加新的符号进行索引。这需要额外的培训数据,并导致数据集的不断增长。我们对这一递增的符号学习方案进行了首次系统调查。我们的分析显示,在建设(广域覆盖)NLU系统方面出现了令人不安的问题:随着培训数据集的不断增长,需要更多数据来学习新的符号,形成恶性循环。我们表明,这一趋势在两种共同的NLU任务上,即意向识别和语义解析等多重主流模型存在。拒绝将阶级失衡作为唯一的罪犯,我们揭示,这一趋势与我们称之为源信号稀释的效果密切相关,即新符号的强烈词汇提示随着培训数据集的成长而淡化。选择性地丢弃培训范例以防止稀释,往往扭转这一趋势,表明主流神经神经素模型过分依赖简单的词汇提示,缺乏背景理解。