Though there has been a large body of recent works in language modeling (LM) for high resource languages such as English and Chinese, the area is still unexplored for low resource languages like Bengali and Hindi. We propose an end to end trainable memory efficient CNN architecture named CoCNN to handle specific characteristics such as high inflection, morphological richness, flexible word order and phonetical spelling errors of Bengali and Hindi. In particular, we introduce two learnable convolutional sub-models at word and at sentence level that are end to end trainable. We show that state-of-the-art (SOTA) Transformer models including pretrained BERT do not necessarily yield the best performance for Bengali and Hindi. CoCNN outperforms pretrained BERT with 16X less parameters, and it achieves much better performance than SOTA LSTM models on multiple real-world datasets. This is the first study on the effectiveness of different architectures drawn from three deep learning paradigms - Convolution, Recurrent, and Transformer neural nets for modeling two widely used languages, Bengali and Hindi.
翻译:尽管最近在英语和汉语等高资源语言的语文模型(LM)方面做了大量近期工作,但该地区仍未为孟加拉语和印地语等低资源语言进行探索。我们提议结束可受训的记忆高效CNN架构,名为CCNN,处理孟加拉语和印地语高偏差、形态丰富、灵活字序和语音拼写错误等具体特征。我们特别在文字和句子层面引入了两个可学习的转动子模型,最终将可受训。我们展示了包括预先培训的BERT在内的最新艺术变型模型不一定能为孟加拉语和印地语产生最佳的性能。 CoCNN以16X的参数比SOTA LSTM模型在多个真实世界数据集上取得更好的性能。这是关于从三种深学习模式中提取的不同结构(演进、经常和变异式神经网,以建模两种广泛使用的语言(孟加拉语和印地语)。这是关于不同结构有效性的首项研究。