Both in scientific literature and in industry,, Semantic and context-aware Natural Language Processing-based solutions have been gaining importance in recent years. The possibilities and performance shown by these models when dealing with complex Language Understanding tasks is unquestionable, from conversational agents to the fight against disinformation in social networks. In addition, considerable attention is also being paid to developing multilingual models to tackle the language bottleneck. The growing need to provide more complex models implementing all these features has been accompanied by an increase in their size, without being conservative in the number of dimensions required. This paper aims to give a comprehensive account of the impact of a wide variety of dimensional reduction techniques on the performance of different state-of-the-art multilingual Siamese Transformers, including unsupervised dimensional reduction techniques such as linear and nonlinear feature extraction, feature selection, and manifold techniques. In order to evaluate the effects of these techniques, we considered the multilingual extended version of Semantic Textual Similarity Benchmark (mSTSb) and two different baseline approaches, one using the pre-trained version of several models and another using their fine-tuned STS version. The results evidence that it is possible to achieve an average reduction in the number of dimensions of $91.58\% \pm 2.59\%$ and $54.65\% \pm 32.20\%$, respectively. This work has also considered the consequences of dimensionality reduction for visualization purposes. The results of this study will significantly contribute to the understanding of how different tuning approaches affect performance on semantic-aware tasks and how dimensional reduction techniques deal with the high-dimensional embeddings computed for the STS task and their potential for highly demanding NLP tasks
翻译:在科学文献和产业领域,基于语义和背景认识的自然语言处理解决方案近年来越来越重要。这些模型在处理复杂的语言理解任务时所展示的可能性和业绩是不容置疑的,从对话代理到打击社交网络中的虚假信息。此外,还相当重视开发多种语言模型以解决语言瓶颈问题。为了评估这些技术的影响,我们考虑了多种语言版本的扩大版《语言文本相似性基准》(mSTSb)和两种不同的基线方法,其中一种是使用一些模型的预培训版本,另一种是使用其精细的STS-多语言变异技术,包括线性和非线性特征提取、特征选择和多重技术等不受监督的元减排技术。为了评估这些技术的影响,我们考虑了多种语言的扩展版《语言文本相似性标准》(mSTSb)和两种不同的基线方法,一个是使用若干模型的预培训版本,另一个是使用其精细的STS-S-GL-变异性技术的性能、S-258和S-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-