We investigate a lattice-structured LSTM model for Chinese NER, which encodes a sequence of input characters as well as all potential words that match a lexicon. Compared with character-based methods, our model explicitly leverages word and word sequence information. Compared with word-based methods, lattice LSTM does not suffer from segmentation errors. Gated recurrent cells allow our model to choose the most relevant characters and words from a sentence for better NER results. Experiments on various datasets show that lattice LSTM outperforms both word-based and character-based LSTM baselines, achieving the best results.
翻译:我们为中国 NER 调查一个由 lattice 结构的 LSTM 模型, 该模型将输入字符序列和所有与词汇匹配的潜在单词编码。 与基于字符的方法相比, 我们的模型明确利用了单词和单词序列信息。 与基于字的方法相比, lattice LSTM 不遭受分解错误的影响 。 Gated 重复单元格允许我们的模型从句子中选择最相关的字符和单词, 以取得更好的 NER 结果 。 对各种数据集的实验显示, lattice LSTM 测试显示, lattice LSTM 超越了基于字和基于字符的 LSTM 基线, 取得了最佳结果 。