Training recurrent neural networks on long texts, in particular scholarly documents, causes problems for learning. While hierarchical attention networks (HANs) are effective in solving these problems, they still lose important information about the structure of the text. To tackle these problems, we propose the use of HANs combined with structure-tags which mark the role of sentences in the document. Adding tags to sentences, marking them as corresponding to title, abstract or main body text, yields improvements over the state-of-the-art for scholarly document quality prediction. The proposed system is applied to the task of accept/reject prediction on the PeerRead dataset and compared against a recent BiLSTM-based model and joint textual+visual model as well as against plain HANs. Compared to plain HANs, accuracy increases on all three domains. On the computation and language domain our new model works best overall, and increases accuracy 4.7% over the best literature result. We also obtain improvements when introducing the tags for prediction of the number of citations for 88k scientific publications that we compiled from the Allen AI S2ORC dataset. For our HAN-system with structure-tags we reach 28.5% explained variance, an improvement of 1.8% over our reimplementation of the BiLSTM-based model as well as 1.0% improvement over plain HANs.
翻译:长文本,特别是学术文件的经常性神经网络培训,会给学习造成问题。虽然分级关注网络(HANs)在解决这些问题上是有效的,但它们仍然失去了关于文本结构的重要信息。为了解决这些问题,我们提议使用HANs与结构标签相结合,以标志文件中的判决作用。在句子上添加标记,标记它们与标题、抽象或主体文本相对应,在学术文件质量预测方面比目前最先进的文件质量水平提高4.7%。拟议的系统用于接受/拒绝对PeerRead数据集的预测,并与最近的BILSTM模型和联合文本+视觉模型以及普通的HANs进行比较。与普通的HANs相比,所有三个领域的精确度都有提高。在计算和语言领域,我们的新模型总体上效果最佳,增加了准确性4.7%。我们采用标记来预测我们从Allen AI S2ORC数据集汇编的88k科学出版物的引用次数,我们从Allen S2ORC数据集中收集的88k 的预测数量,并与最近的BLSTM模型相比,我们的BALTM系统在结构上的改进率和1.%的改进方面,我们达到了28的改进。