In stylometric investigations, frequencies of the most frequent words (MFWs) and character n-grams outperform other style-markers, even if their performance varies significantly across languages. In inflected languages, word endings play a prominent role, and hence different word forms cannot be recognized using generic text tokenization. Countless inflected word forms make frequencies sparse, making most statistical procedures complicated. Presumably, applying one of the NLP techniques, such as lemmatization and/or parsing, might increase the performance of classification. The aim of this paper is to examine the usefulness of grammatical features (as assessed via POS-tag n-grams) and lemmatized forms in recognizing authorial profiles, in order to address the underlying issue of the degree of freedom of choice within lexis and grammar. Using a corpus of Polish novels, we performed a series of supervised authorship attribution benchmarks, in order to compare the classification accuracy for different types of lexical and syntactic style-markers. Even if the performance of POS-tags as well as lemmatized forms was notoriously worse than that of lexical markers, the difference was not substantial and never exceeded ca. 15%.
翻译:在特征调查中,最常用单词(MFWs)的频率和字符n-gms的频率高于其他样式标记,即使其性能在各语言之间差异很大。在反映语言中,单词结尾起着突出的作用,因此使用通用文本符号无法确认不同的单词形式。无数偏差的单词形式使得频率稀少,使大多数统计程序变得复杂。大概地,应用国家语言定位方案的技术之一,例如利玛化和/或分类,可能会提高分类的性能。本文件的目的是审查语法特征(通过POS-tag ngss评估)和莱姆特化格式在承认作者特征方面的效用,以便解决Legasis和语法范围内选择自由程度的根本问题。我们使用波兰的一套小说,进行了一系列受监督的作者归属基准,以比较不同类型词汇和合成风格标志的分类准确性。即使POS-tags的性能(通过POS-tagn-glass nglass)和Lemmatical 15 exmatiquestal press press relate develop than birmatial destrateal degralations baslations) 也比更糟糕。