Measuring a document's complexity level is an open challenge, particularly when one is working on a diverse corpus of documents rather than comparing several documents on a similar topic or working on a language other than English. In this paper, we define a methodology to measure the complexity of French documents, using a new general and diversified corpus of texts, the "French Canadian complexity level corpus", and a wide range of metrics. We compare different learning algorithms to this task and contrast their performances and their observations on which characteristics of the texts are more significant to their complexity. Our results show that our methodology gives a general-purpose measurement of text complexity in French.
翻译:衡量文件的复杂程度是一个公开的挑战,特别是当人们正在编制各种文件,而不是比较关于类似主题的若干文件,或使用英文以外的其他语文时,我们便会使用新的一般和多样化的文本、“法属加拿大复杂程度”和一系列广泛的衡量标准,确定衡量法文文件复杂性的方法。我们比较了不同的学习算法和这项任务,比较了它们的业绩和对哪些文本的特征对其复杂性更为重要的看法。我们的结果显示,我们的方法提供了法文文本复杂性的通用计量方法。