Prior methods to text segmentation are mostly at token level. Despite the adequacy, this nature limits their full potential to capture the long-term dependencies among segments. In this work, we propose a novel framework that incrementally segments natural language sentences at segment level. For every step in segmentation, it recognizes the leftmost segment of the remaining sequence. Implementations involve LSTM-minus technique to construct the phrase representations and recurrent neural networks (RNN) to model the iterations of determining the leftmost segments. We have conducted extensive experiments on syntactic chunking and Chinese part-of-speech (POS) tagging across 3 datasets, demonstrating that our methods have significantly outperformed previous all baselines and achieved new state-of-the-art results. Moreover, qualitative analysis and the study on segmenting long-length sentences verify its effectiveness in modeling long-term dependencies.
翻译:先前的文本分割方法大多是象征性的。 尽管适当,但这种性质限制了它们捕捉各部分之间长期依赖关系的全部潜力。 在这项工作中,我们提议了一个新框架,在部分一级逐步分解自然语言句子。在分解的每一个步骤中,它都承认剩余序列中最左的部分。执行过程涉及LSTM-minus技术,以构建短语表达和经常性神经网络(RNN),以模拟确定最左部分的迭代。我们进行了广泛的合成块和中国部分语音(POS)标记在3个数据集上的实验,表明我们的方法大大超过以前的所有基线,并取得了新的最新结果。此外,定性分析和关于分解长句的研究也证实了其模型长期依赖性的有效性。