Training recurrent neural networks on long texts, in particular scholarly documents, causes problems for learning. While hierarchical attention networks (HANs) are effective in solving these problems, they still lose important information about the structure of the text. To tackle these problems, we propose the use of HANs combined with structure-tags which mark the role of sentences in the document. Adding tags to sentences, marking them as corresponding to title, abstract or main body text, yields improvements over the state-of-the-art for scholarly document quality prediction: substantial gains on average against other models and consistent improvements over HANs without structure-tags. The proposed system is applied to the task of accept/reject prediction on the PeerRead dataset and compared against a recent BiLSTM-based model and joint textual+visual model. It gains 4.7% accuracy over the best of both models on the computation and language domain and loses 2.4% against the best of both on the machine learning domain. Compared to plain HANs, accuracy increases on both domains, with 1.5% and 2% respectively. We also obtain improvements when introducing the tags for prediction of the number of citations for 88k scientific publications that we compiled from the Allen AI S2ORC dataset. For our HAN-system with structure-tags we reach 28.5% explained variance, an improvement of 1.0% over HANs without structure-tags.
翻译:长文本,特别是学术文件的经常性神经网络培训,会给学习带来问题。 虽然分级关注网络(HANs)在解决这些问题上是有效的,但它们仍然失去了关于文本结构的重要信息。 为了解决这些问题, 我们提议使用HANs, 加上标志文档中判决作用的结构标签。 在句子上添加标记, 标记它们与标题、 抽象或主体文本相对应, 在学术文件质量预测方面比目前最先进的水平有所改进: 与其他模型相比, 相对于没有结构标签的HANs 而言, 平均大幅提高, 并不断改进。 在对PeerRead数据集进行接受/拒绝预测时, 并比照最近的BILSTM模型和联合文本+视觉模型。 在计算和语言领域的最佳模型上增加4. 7 % 的准确度, 在机器学习领域与最佳模型相比, 与学术文件质量预测领域相比, 两者的准确度都有提高, 分别为1.5 % 和2% 。 在对PealReRead数据集的预测数字时,我们也得到了改进。