Accurately dating historical texts is essential for organizing and interpreting cultural heritage collections. This article addresses temporal text classification using interpretable, feature-engineered tree-based machine learning models. We integrate five feature categories - compression-based, lexical structure, readability, neologism detection, and distance features - to predict the temporal origin of English texts spanning five centuries. Comparative analysis shows that these feature domains provide complementary temporal signals, with combined models outperforming any individual feature set. On a large-scale corpus, we achieve 76.7% accuracy for century-scale prediction and 26.1% for decade-scale classification, substantially above random baselines (20% and 2.3%). Under relaxed temporal precision, performance increases to 96.0% top-2 accuracy for centuries and 85.8% top-10 accuracy for decades. The final model exhibits strong ranking capabilities with AUCROC up to 94.8% and AUPRC up to 83.3%, and maintains controlled errors with mean absolute deviations of 27 years and 30 years, respectively. For authentication-style tasks, binary models around key thresholds (e.g., 1850-1900) reach 85-98% accuracy. Feature importance analysis identifies distance features and lexical structure as most informative, with compression-based features providing complementary signals. SHAP explainability reveals systematic linguistic evolution patterns, with the 19th century emerging as a pivot point across feature domains. Cross-dataset evaluation on Project Gutenberg highlights domain adaptation challenges, with accuracy dropping by 26.4 percentage points, yet the computational efficiency and interpretability of tree-based models still offer a scalable, explainable alternative to neural architectures.
翻译:准确确定历史文本的年代对于组织和阐释文化遗产收藏至关重要。本文采用可解释的、基于特征工程的树状机器学习模型来解决时序文本分类问题。我们整合了五种特征类别——基于压缩的特征、词汇结构、可读性、新词检测和距离特征——以预测跨越五个世纪的英语文本的时间起源。比较分析表明,这些特征域提供了互补的时间信号,组合模型的表现优于任何单一特征集。在大规模语料库上,我们在世纪尺度预测上达到了76.7%的准确率,在十年尺度分类上达到了26.1%的准确率,显著高于随机基线(分别为20%和2.3%)。在放宽时间精度要求的情况下,性能提升至世纪预测的96.0% top-2准确率和十年预测的85.8% top-10准确率。最终模型展现出强大的排序能力,AUCROC高达94.8%,AUPRC高达83.3%,并保持了可控的误差,平均绝对偏差分别为27年和30年。对于认证式任务,围绕关键阈值(例如1850-1900年)的二元模型达到了85-98%的准确率。特征重要性分析确定距离特征和词汇结构最具信息量,而基于压缩的特征提供了补充信号。SHAP可解释性揭示了系统性的语言演化模式,其中19世纪成为跨特征域的一个关键转折点。在Project Gutenberg语料库上的跨数据集评估突显了领域适应的挑战,准确率下降了26.4个百分点,然而树状模型的计算效率和可解释性仍为神经架构提供了一个可扩展、可解释的替代方案。