Recently, the attention-enhanced multi-layer encoder, such as Transformer, has been extensively studied in Machine Reading Comprehension (MRC). To predict the answer, it is common practice to employ a predictor to draw information only from the final encoder layer which generates the \textit{coarse-grained} representations of the source sequences, i.e., passage and question. Previous studies have shown that the representation of source sequence becomes more \textit{coarse-grained} from \textit{fine-grained} as the encoding layer increases. It is generally believed that with the growing number of layers in deep neural networks, the encoding process will gather relevant information for each location increasingly, resulting in more \textit{coarse-grained} representations, which adds the likelihood of similarity to other locations (referring to homogeneity). Such a phenomenon will mislead the model to make wrong judgments so as to degrade the performance. To this end, we propose a novel approach called Adaptive Bidirectional Attention, which adaptively exploits the source representations of different levels to the predictor. Experimental results on the benchmark dataset, SQuAD 2.0 demonstrate the effectiveness of our approach, and the results are better than the previous state-of-the-art model by 2.5$\%$ EM and 2.3$\%$ F1 scores.
翻译:最近,诸如变换器等关注增强的多层编码器在机器阅读理解系统(MRC)中进行了广泛的研究。为了预测答案,通常的做法是使用预测器从最后的编码层层中提取信息,从而生成源序列的表达方式,即通道和问题。以前的研究表明,随着编码层的增加,源序列的表示方式会随着编码层的增加而变得更为\ textitit ${coasser-graint}。一般认为,随着深神经网络中层的增加,编码过程将越来越多地收集每个地点的相关信息,从而产生更多的\ textit{coasser-graint}表达方式,从而增加了与其他地点相似的可能性(指同质 ) 。 这样一种现象将误导模型做出错误判断的模式,从而降低性能。 为此,我们提议了一种名为适应性双向关注的新办法,即随着深神经网络中层层层的增加,编码过程将越来越多地收集每个地点的相关信息,从而导致更多的\ textitit{coar-gradead}表达方式,从而更准确地展示了我们之前的正标度数据的来源。