Natural Language Processing (NLP) and especially natural language text analysis have seen great advances in recent times. Usage of deep learning in text processing has revolutionized the techniques for text processing and achieved remarkable results. Different deep learning architectures like CNN, LSTM, and very recent Transformer have been used to achieve state of the art results variety on NLP tasks. In this work, we survey a host of deep learning architectures for text classification tasks. The work is specifically concerned with the classification of Hindi text. The research in the classification of morphologically rich and low resource Hindi language written in Devanagari script has been limited due to the absence of large labeled corpus. In this work, we used translated versions of English data-sets to evaluate models based on CNN, LSTM and Attention. Multilingual pre-trained sentence embeddings based on BERT and LASER are also compared to evaluate their effectiveness for the Hindi language. The paper also serves as a tutorial for popular text classification techniques.
翻译:近些年来,自然语言处理(NLP),特别是自然语言文本分析(自然语言处理(NLP)取得了巨大进步。在文本处理中,深层次学习的使用使文本处理技术发生了革命性的变化,并取得了显著的成果。各种深层次的学习结构,如CNN、LSTM和最近的变异器,被用于实现NLP任务的艺术成果多样化。在这项工作中,我们调查了一系列用于文本分类任务的深层次学习结构。这项工作具体涉及印地语文本的分类。由于缺少大标记的文体,对以Devanagari文字写成的形态丰富和低资源印地语的分类研究受到限制。在这项工作中,我们使用英文数据集的翻译版本来评价基于CNN、LSTM和注意的模型。基于BERT和LASER的多语言预先训练的句子也用来评价印地语文本的效用。该文件还用作流行文本分类技术的辅导。