To the best of our knowledge, this paper made the first attempt to answer whether word segmentation is necessary for Vietnamese sentiment classification. To do this, we presented five pre-trained monolingual S4- based language models for Vietnamese, including one model without word segmentation, and four models using RDRsegmenter, uitnlp, pyvi, or underthesea toolkits in the pre-processing data phase. According to comprehensive experimental results on two corpora, including the VLSP2016-SA corpus of technical article reviews from the news and social media and the UIT-VSFC corpus of the educational survey, we have two suggestions. Firstly, using traditional classifiers like Naive Bayes or Support Vector Machines, word segmentation maybe not be necessary for the Vietnamese sentiment classification corpus, which comes from the social domain. Secondly, word segmentation is necessary for Vietnamese sentiment classification when word segmentation is used before using the BPE method and feeding into the deep learning model. In this way, the RDRsegmenter is the stable toolkit for word segmentation among the uitnlp, pyvi, and underthesea toolkits.
翻译:就我们所知,本文首次试图回答越南情绪分类是否需要文字分割。 为此,我们为越南人提出了五种预先训练的单一语言 S4 语言模型,包括一个无字分割模型,以及四个模型,使用RDRsegramenter、uitnlp、pyvi或预处理数据阶段的底模亚工具包。根据两个公司的综合实验结果,包括新闻和社会媒体的技术文章审查VLSP2016-SA汇编以及教育调查UIT-VSFC文集,我们有两个建议。首先,使用传统分类器,如Naive Bayes或支持Victor机器,越南情绪分类器可能不需要字分割,这些软件来自社会领域。第二,在使用BPE方法和输入深层学习模型之前使用字分割时,越南情绪分类需要文字分割。这样,RDRsegiment是立方、pyvi和底部的词分割工具。