改进学术性文件质量预测的文本分类 (Structure-Tags Improve Text Classification for Scholarly Document Quality Prediction)

from arxiv, This new version of the paper brings the paper up-to-date with the improved paper, published at the First Workshop on Scholarly Document Processing, at EMNLP 2020. .Additionally, minor corrections were made including addition of color to Figures 1,2. The changes in comparison to the first arXiv version are substantial, including various additional results, and substantial improvements to the text

Training recurrent neural networks on long texts, in particular scholarly documents, causes problems for learning. While hierarchical attention networks (HANs) are effective in solving these problems, they still lose important information about the structure of the text. To tackle these problems, we propose the use of HANs combined with structure-tags which mark the role of sentences in the document. Adding tags to sentences, marking them as corresponding to title, abstract or main body text, yields improvements over the state-of-the-art for scholarly document quality prediction. The proposed system is applied to the task of accept/reject prediction on the PeerRead dataset and compared against a recent BiLSTM-based model and joint textual+visual model as well as against plain HANs. Compared to plain HANs, accuracy increases on all three domains. On the computation and language domain our new model works best overall, and increases accuracy 4.7% over the best literature result. We also obtain improvements when introducing the tags for prediction of the number of citations for 88k scientific publications that we compiled from the Allen AI S2ORC dataset. For our HAN-system with structure-tags we reach 28.5% explained variance, an improvement of 1.8% over our reimplementation of the BiLSTM-based model as well as 1.0% improvement over plain HANs.

翻译：长文本,特别是学术文件的经常性神经网络培训,会给学习造成问题。虽然分级关注网络(HANs)在解决这些问题上是有效的,但它们仍然失去了关于文本结构的重要信息。为了解决这些问题,我们提议使用HANs与结构标签相结合,以标志文件中的判决作用。在句子上添加标记,标记它们与标题、抽象或主体文本相对应,在学术文件质量预测方面比目前最先进的文件质量水平提高4.7%。拟议的系统用于接受/拒绝对PeerRead数据集的预测,并与最近的BILSTM模型和联合文本+视觉模型以及普通的HANs进行比较。与普通的HANs相比,所有三个领域的精确度都有提高。在计算和语言领域,我们的新模型总体上效果最佳,增加了准确性4.7%。我们采用标记来预测我们从Allen AI S2ORC数据集汇编的88k科学出版物的引用次数,我们从Allen S2ORC数据集中收集的88k 的预测数量,并与最近的BLSTM模型相比,我们的BALTM系统在结构上的改进率和1.%的改进方面,我们达到了28的改进。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

知识图谱推理，50页ppt，Salesforce首席科学家Richard Socher

专知会员服务

111+阅读 · 2020年6月10日

【AAAI 2020】InteractE: 通过增加特征交互来改进基于卷积的知识图谱嵌入， InteractE: Improving Convolution-based Knowledge Graph Embeddings by Increasing Feature Interactions

专知会员服务

53+阅读 · 2020年6月7日