Style analysis, which is relatively a less explored topic, enables several interesting applications. For instance, it allows authors to adjust their writing style to produce a more coherent document in collaboration. Similarly, style analysis can also be used for document provenance and authentication as a primary step. In this paper, we propose an ensemble-based text-processing framework for the classification of single and multi-authored documents, which is one of the key tasks in style analysis. The proposed framework incorporates several state-of-the-art text classification algorithms including classical Machine Learning (ML) algorithms, transformers, and deep learning algorithms both individually and in merit-based late fusion. For the merit-based late fusion, we employed several weight optimization and selection methods to assign merit-based weights to the individual text classification algorithms. We also analyze the impact of the characters on the task that are usually excluded in NLP applications during pre-processing by conducting experiments on both clean and un-clean data. The proposed framework is evaluated on a large-scale benchmark dataset, significantly improving performance over the existing solutions.
翻译:样式分析相对而言是一个探索较少的专题,它使几个有趣的应用得以实现。例如,它使作者能够调整其写法风格,以生成一个更连贯的协作文件。同样,风格分析也可以作为主要步骤用于文件出处和认证。在本文件中,我们提议为单一和多文本文档分类建立一个基于共同文本处理框架,这是风格分析的关键任务之一。拟议框架包含一些最先进的文本分类算法,包括古典机器学习算法、变压器和深层次学习算法,包括个人和基于功绩的晚期合并。对于基于功绩的延迟合并,我们采用了若干权重优化和选择方法,为单个文本分类算法分配基于功绩的权重。我们还通过对清洁和不清洁数据进行实验,分析了在预处理期间通常被排除在非专利程序应用中的各种字符对任务的影响。拟议框架在大规模基准数据集上进行了评估,大大改进了现有解决方案的性能。</s>