巴西葡萄牙语用户审查文本分类:从一袋字到变压器 (Embedding generation for text classification of Brazilian Portuguese user reviews: from bag-of-words to transformers)

Text classification is a natural language processing (NLP) task relevant to many commercial applications, like e-commerce and customer service. Naturally, classifying such excerpts accurately often represents a challenge, due to intrinsic language aspects, like irony and nuance. To accomplish this task, one must provide a robust numerical representation for documents, a process known as embedding. Embedding represents a key NLP field nowadays, having faced a significant advance in the last decade, especially after the introduction of the word-to-vector concept and the popularization of Deep Learning models for solving NLP tasks, including Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Transformer-based Language Models (TLMs). Despite the impressive achievements in this field, the literature coverage regarding generating embeddings for Brazilian Portuguese texts is scarce, especially when considering commercial user reviews. Therefore, this work aims to provide a comprehensive experimental study of embedding approaches targeting a binary sentiment classification of user reviews in Brazilian Portuguese. This study includes from classical (Bag-of-Words) to state-of-the-art (Transformer-based) NLP models. The methods are evaluated with five open-source databases with pre-defined data partitions made available in an open digital repository to encourage reproducibility. The Fine-tuned TLMs achieved the best results for all cases, being followed by the Feature-based TLM, LSTM, and CNN, with alternate ranks, depending on the database under analysis.

翻译：文本分类是一项与许多商业应用(如电子商务和客户服务)相关的自然语言处理(NLP)任务。自然,对此类节选进行准确分类往往是一项挑战,因为语言内在方面,例如讽刺和细微的内涵。要完成这项任务,就必须为文件提供强有力的数字代表,这是一个称为嵌入的过程。嵌入是过去十年中的一个重要NLP领域,特别是在引入字对字概念和普及深度学习模型以解决NLP任务之后,特别是在引入了词对字对字概念和普及深度学习模型,包括交替神经网络(CNNN)、经常神经网络(RNNNNS)和基于变换语言模型(TLMS)。尽管在这一领域取得了令人印象深刻的成就,但关于为巴西葡萄牙文本嵌入的文献覆盖面却很少,特别是在考虑商业用户审查时。因此,这项工作旨在为巴西葡萄牙语用户审查提供一种双向情感分类的嵌入方法进行全面的实验研究。这项研究包括从古典(Ng-Words)到动态网络网络网络(RDNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN和基于GNMLMLMLM),以及基于变换语言语言模型的语言模式,以及基于语言模型的语言模型的模型的模型模型模型模型模型模型模型模型。