Biomedical data and benchmarks are highly valuable yet very limited in low-resource languages other than English such as Vietnamese. In this paper, we make use of a state-of-the-art translation model in English-Vietnamese to translate and produce both pretrained as well as supervised data in the biomedical domains. Thanks to such large-scale translation, we introduce ViPubmedT5, a pretrained Encoder-Decoder Transformer model trained on 20 million translated abstracts from the high-quality public PubMed corpus. ViPubMedT5 demonstrates state-of-the-art results on two different biomedical benchmarks in summarization and acronym disambiguation. Further, we release ViMedNLI - a new NLP task in Vietnamese translated from MedNLI using the recently public En-vi translation model and carefully refined by human experts, with evaluations of existing methods against ViPubmedT5.
翻译:生物医学数据和基准在越南等英语以外的低资源语言中极有价值,但非常有限。在本文中,我们使用英语-越南语中最先进的翻译模型翻译和制作生物医学领域的预先培训和监督数据。由于如此大规模的翻译,我们引入了ViPubmedT5, 这是一种经过预先培训的Encoder-Decoder变异模型,该模型从高质量的公共普德文材料中得到的2 000万份翻译摘要中接受了培训。ViPubMedT5展示了两种不同生物医学基准在总和和缩略词模糊化方面的最新结果。此外,我们还释放了ViMedNLI, 这是一项由MedNLI翻译的越南新国家实验室任务,这是使用最近公开的En-vi翻译模型翻译的越南文,由人类专家仔细改进,并评估了与ViPubmedT5有关的现有方法。