Ancient Chinese is the essence of Chinese culture. There are several natural language processing tasks of ancient Chinese domain, such as ancient-modern Chinese translation, poem generation, and couplet generation. Previous studies usually use the supervised models which deeply rely on parallel data. However, it is difficult to obtain large-scale parallel data of ancient Chinese. In order to make full use of the more easily available monolingual ancient Chinese corpora, we release AnchiBERT, a pre-trained language model based on the architecture of BERT, which is trained on large-scale ancient Chinese corpora. We evaluate AnchiBERT on both language understanding and generation tasks, including poem classification, ancient-modern Chinese translation, poem generation, and couplet generation. The experimental results show that AnchiBERT outperforms BERT as well as the non-pretrained models and achieves state-of-the-art results in all cases.
翻译:古代中国文化是中国文化的精髓。 古代中国文化有多种自然语言处理任务, 如古代中文翻译、诗歌制作和对数代。 以前的研究通常使用高度依赖平行数据的受监督模型。 但是,很难获得大型的古代中国人平行数据。 为了充分利用比较容易获得的单语古代中国公司,我们发行了AnchiBERT, 这是一种基于BERT结构的预先培训的语言模型, 其基础是大型古代中国公司。 我们评估了AnchiBERT的语文理解和生成任务, 包括诗歌分类、古代中文翻译、诗歌制作和同一代。 实验结果表明, AnchiBERT 超越了古代公司以及非受培训的模型, 在所有案例中都取得了最新的结果。