Native language identification (NLI) is the task of training (via supervised machine learning) a classifier that guesses the native language of the author of a text. This task has been extensively researched in the last decade, and the performance of NLI systems has steadily improved over the years. We focus on a different facet of the NLI task, i.e., that of analysing the internals of an NLI classifier trained by an \emph{explainable} machine learning algorithm, in order to obtain explanations of its classification decisions, with the ultimate goal of gaining insight into which linguistic phenomena ``give a speaker's native language away''. We use this perspective in order to tackle both NLI and a (much less researched) companion task, i.e., guessing whether a text has been written by a native or a non-native speaker. Using three datasets of different provenance (two datasets of English learners' essays and a dataset of social media posts), we investigate which kind of linguistic traits (lexical, morphological, syntactic, and statistical) are most effective for solving our two tasks, namely, are most indicative of a speaker's L1. We also present two case studies, one on Spanish and one on Italian learners of English, in which we analyse individual linguistic traits that the classifiers have singled out as most important for spotting these L1s. Overall, our study shows that the use of explainable machine learning can be a valuable tool for th
翻译:本地语言识别( NLI) 是培训任务( 通过监管机器学习) 的分类员, 猜到文本作者的母语。 这项工作在过去十年中已经进行了广泛的研究, NLI 系统的业绩多年来稳步改善。 我们侧重于NLI任务的一个不同方面, 即分析由\emph{可解释}机器学习算法培训的NLI分类员的内部, 以便获得对其分类决定的解释, 最终目标是了解哪些语言现象“ 将演讲者的母语移开 ” 。 我们利用这一视角来应对NLI 和(研究较少的)配套任务。 我们的焦点是国家语言分类任务的不同方面, 即分析由本地或非本地发言人撰写的文本。 使用三个不同来源的数据集( 两份英国学生的论文数据集和社交媒体文章的数据集), 我们调查哪类语言特征( 语言、 语言、 语言、 语言、 合成、 和统计学 ) 最能说明我们两个语言学家、 两个语言学家 、 、 两个语言、 两个语言学家 的分类分析员 、 最能说明我们两个语言学 的 、 、 两个语言学家 最精确的精确的 分析。