Recently, the structural reading comprehension (SRC) task on web pages has attracted increasing research interests. Although previous SRC work has leveraged extra information such as HTML tags or XPaths, the informative topology of web pages is not effectively exploited. In this work, we propose a Topological Information Enhanced model (TIE), which transforms the token-level task into a tag-level task by introducing a two-stage process (i.e. node locating and answer refining). Based on that, TIE integrates Graph Attention Network (GAT) and Pre-trained Language Model (PLM) to leverage the topological information of both logical structures and spatial structures. Experimental results demonstrate that our model outperforms strong baselines and achieves state-of-the-art performances on the web-based SRC benchmark WebSRC at the time of writing. The code of TIE will be publicly available at https://github.com/X-LANCE/TIE.
翻译:最近,网页上的结构性阅读理解(SRC)任务吸引了越来越多的研究兴趣,尽管SRC先前的工作利用了HTML标记或XPath等额外信息,但网页的信息型态没有得到有效利用。在这项工作中,我们提议了一个地形信息强化模型(TIE),通过引入一个两阶段过程(即节点定位和回答改进),将象征性任务转化为标签级任务。在此基础上,TIE整合了图表关注网络(GAT)和预先培训的语言模型(PLM),以利用逻辑结构和空间结构的表层信息。实验结果表明,我们的模型超越了强有力的基线,并在撰写时实现了基于网络的SRC基准网络的WebSRC最新业绩。 TIE的代码将在https://github.com/X-LNES/TIE上公布。