以少胜多：面向低资源语言的跨语言英语-波斯语论辩挖掘模型相较于LLM增强的优势 (Winning with Less for Low Resource Languages: Advantage of Cross-Lingual English_Persian Argument Mining Model over LLM Augmentation)

Argument mining is a subfield of natural language processing to identify and extract the argument components, like premises and conclusions, within a text and to recognize the relations between them. It reveals the logical structure of texts to be used in tasks like knowledge extraction. This paper aims at utilizing a cross-lingual approach to argument mining for low-resource languages, by constructing three training scenarios. We examine the models on English, as a high-resource language, and Persian, as a low-resource language. To this end, we evaluate the models based on the English Microtext corpus \citep{PeldszusStede2015}, and its parallel Persian translation. The learning scenarios are as follow: (i) zero-shot transfer, where the model is trained solely with the English data, (ii) English-only training enhanced by synthetic examples generated by Large Language Models (LLMs), and (iii) a cross-lingual model that combines the original English data with manually translated Persian sentences. The zero-shot transfer model attains F1 scores of 50.2\% on the English test set and 50.7\% on the Persian test set. LLM-based augmentation model improves the performance up to 59.2\% on English and 69.3\% on Persian. The cross-lingual model, trained on both languages but evaluated solely on the Persian test set, surpasses the LLM-based variant, by achieving a F1 of 74.8\%. Results indicate that a lightweight cross-lingual blend can outperform considerably the more resource-intensive augmentation pipelines, and it offers a practical pathway for the argument mining task to overcome data resource shortage on low-resource languages.

翻译：论辩挖掘是自然语言处理的一个子领域，旨在识别和提取文本中的论辩成分（如前提和结论），并识别它们之间的关系。它揭示了文本的逻辑结构，可用于知识提取等任务。本文旨在通过构建三种训练场景，利用跨语言方法进行低资源语言的论辩挖掘。我们以英语作为高资源语言、波斯语作为低资源语言对模型进行检验。为此，我们基于英语Microtext语料库（Peldszus和Stede，2015）及其平行波斯语翻译对模型进行评估。学习场景如下：（i）零样本迁移，模型仅使用英语数据训练；（ii）通过大型语言模型（LLMs）生成的合成示例增强的纯英语训练；以及（iii）将原始英语数据与人工翻译的波斯语句子相结合的跨语言模型。零样本迁移模型在英语测试集上获得50.2%的F1分数，在波斯语测试集上获得50.7%。基于LLM的增强模型将性能提升至英语59.2%、波斯语69.3%。跨语言模型在两种语言上训练但仅在波斯语测试集上评估，其F1分数达到74.8%，超越了基于LLM的变体。结果表明，轻量级的跨语言混合方法能够显著优于资源密集型的增强流程，并为论辩挖掘任务克服低资源语言数据短缺问题提供了一条实用路径。