Authors writing documents imprint identifying information within their texts: vocabulary, registry, punctuation, misspellings, or even emoji usage. Finding these details is very relevant to profile authors, relating back to their gender, occupation, age, and so on. But most importantly, repeating writing patterns can help attributing authorship to a text. Previous works use hand-crafted features or classification tasks to train their authorship models, leading to poor performance on out-of-domain authors. A better approach to this task is to learn stylometric representations, but this by itself is an open research challenge. In this paper, we propose PART: a contrastively trained model fit to learn \textbf{authorship embeddings} instead of semantics. By comparing pairs of documents written by the same author, we are able to determine the proprietary of a text by evaluating the cosine similarity of the evaluated documents, a zero-shot generalization to authorship identification. To this end, a pre-trained Transformer with an LSTM head is trained with the contrastive training method. We train our model on a diverse set of authors, from literature, anonymous blog posters and corporate emails; a heterogeneous set with distinct and identifiable writing styles. The model is evaluated on these datasets, achieving zero-shot 72.39\% and 86.73\% accuracy and top-5 accuracy respectively on the joint evaluation dataset when determining authorship from a set of 250 different authors. We qualitatively assess the representations with different data visualizations on the available datasets, profiling features such as book types, gender, age, or occupation of the author.
翻译:撰写文件的作者在文字文本中打印文件, 标明信息: 词汇、 注册、 标点、 拼写错误, 甚至是 emoji 的用法。 查找这些细节对于剖析作者非常相关, 与其性别、 职业、 年龄等相关。 但最重要的是, 重复写作模式可以帮助将作者归为文本。 先前的作品使用手工制作的特征或分类任务来培训作者的作者模型, 导致校外作者的性能差。 这项任务的更好办法是学习外观表达, 但其本身是一个公开的研究挑战。 在本文中, 我们提议 : 一个对比性化的直观训练模型, 适合学习\ textbf{ 授权嵌入 。 但是, 最重要的是, 重复的写法模式可以帮助将作者的对文本进行对比。 先前的书写法或分类任务通过评估来决定文本的专有性, 使得所评价文件的精度相似性为零光的缩写。 至此目的, 一个经过预先训练、 LSTM 头部的变形的变形器, 用对比培训方法。 我们用一个比较式的直观的直观模型,, 用一个模型, 将一个模型用一个具有可识别性的数据模型,, 用一个不同格式化的模型, 以不同式的版本的版本的版本的版本的版本,,, 的模型, 用来在作者的版本的版本的模型,,,, 的版本的版本的造型的造型号的造型号的模型,,,,,, 以可辨化的造型式的造型式的造型的造型号的造型的造型的造型的造型的造型的模型, 的造型的造型的造型的造型, 和制的造型,,,,,, 的造型的造型的造型的造型的造型的造型的造型的造型的造型的造型的造型的造型的造型的造型的造型的造型的造型的造型的造型的造型的造型的造型的造型的造型的造型的造型的造型,