In a longer document, the topic often slightly shifts from one passage to the next, where topic boundaries are usually indicated by semantically coherent segments. Discovering this latent structure in a document improves the readability and is essential for passage retrieval and summarization tasks. We formulate the task of text segmentation as an independent supervised prediction task, making it suitable to train on Transformer-based language models. By fine-tuning on paragraphs of similar sections, we are able to show that learned features encode topic information, which can be used to find the section boundaries and divide the text into coherent segments. Unlike previous approaches, which mostly operate on sentence-level, we consistently use a broader context of an entire paragraph and assume topical independence of preceeding and succeeding text. We lastly introduce a novel large-scale dataset constructed from online Terms-of-Service documents, on which we compare against various traditional and deep learning baselines, showing significantly better performance of Transformer-based methods.
翻译:在较长的文件中,专题往往从一个段落略微转向下一个段落,其中主题的界限通常由语义一致的部分来表示。在文件中发现这种潜在结构提高了可读性,对于通过检索和总结任务至关重要。我们把文本分割的任务作为独立监督的预测任务来拟订,使之适合于对基于变压器的语言模型进行培训。通过对类似章节的段落进行微调,我们能够显示将专题信息编码的已学特点,这些信息可用来查找章节的界限,并将文字分为一致的部分。与以往的做法不同,以往的做法大多在句子上运作,我们一贯使用整个段落的更广泛背景,并承担了前文和后文的时文的时空独立。我们最后采用了从在线《服务条款》文件中构建的新颖的大规模数据集,我们将其与各种传统和深层学习基线进行比较,显示变压器方法的绩效显著提高。