The study forms a technical report of various tasks that have been performed on the materials collected and published by Finnish ethnographer and linguist, Matthias Alexander Castr\'en (1813-1852). The Finno-Ugrian Society is publishing Castr\'en's manuscripts as new critical and digital editions, and at the same time different research groups have also paid attention to these materials. We discuss the workflows and technical infrastructure used, and consider how datasets that benefit different computational tasks could be created to further improve the usability of these materials, and also to aid the further processing of similar archived collections. We specifically focus on the parts of the collections that are processed in a way that improves their usability in more technical applications, complementing the earlier work on the cultural and linguistic aspects of these materials. Most of these datasets are openly available in Zenodo. The study points to specific areas where further research is needed, and provides benchmarks for text recognition tasks.
翻译:这项研究是一份技术报告,涉及芬兰人种学家和语言学家Matthias Alexander Castr\'en(1813-1852年)所收集和出版的材料所完成的各种任务。芬兰乌格里安学会正在出版卡斯特尔-恩的手稿,作为新的关键和数字版,与此同时,不同的研究小组也注意到这些材料。我们讨论了所使用的工作流程和技术基础设施,并考虑如何建立有利于不同计算任务的数据集,以进一步改善这些材料的可用性,并协助进一步处理类似的档案收藏。我们特别着重研究那些以在更技术性的应用中提高其可用性的方式处理的收藏部分,以补充关于这些材料的文化和语言方面的早期工作。这些数据集大多在泽诺多公开提供。研究指出了需要进一步研究的具体领域,并为文本识别任务提供了基准。