Recent advances in contextualized word embeddings have greatly improved semantic tasks such as Word Sense Disambiguation (WSD) and contextual similarity, but most progress has been limited to high-resource languages like English. Vietnamese, in contrast, still lacks robust models and evaluation resources for fine-grained semantic understanding. In this paper, we present ViConBERT, a novel framework for learning Vietnamese contextualized embeddings that integrates contrastive learning (SimCLR) and gloss-based distillation to better capture word meaning. We also introduce ViConWSD, the first large-scale synthetic dataset for evaluating semantic understanding in Vietnamese, covering both WSD and contextual similarity. Experimental results show that ViConBERT outperforms strong baselines on WSD (F1 = 0.87) and achieves competitive performance on ViCon (AP = 0.88) and ViSim-400 (Spearman's rho = 0.60), demonstrating its effectiveness in modeling both discrete senses and graded semantic relations. Our code, models, and data are available at https://github.com/tkhangg0910/ViConBERT
翻译:近年来,语境化词嵌入技术的进展显著提升了词义消歧(WSD)和语境相似性等语义任务的表现,但多数成果仅限于英语等高资源语言。相比之下,越南语仍缺乏用于细粒度语义理解的鲁棒模型与评估资源。本文提出ViConBERT,一种学习越南语语境化嵌入的新颖框架,该框架融合对比学习(SimCLR)与基于释义的知识蒸馏,以更好地捕捉词汇含义。我们还引入了ViConWSD,这是首个用于评估越南语语义理解的大规模合成数据集,涵盖词义消歧和语境相似性任务。实验结果表明,ViConBERT在词义消歧任务上(F1 = 0.87)优于强基线模型,并在ViCon(AP = 0.88)和ViSim-400(Spearman's rho = 0.60)上取得竞争性性能,证明了其在建模离散词义与渐进语义关系方面的有效性。我们的代码、模型与数据已公开于https://github.com/tkhangg0910/ViConBERT。