The domain of natural language processing (NLP), which has greatly evolved over the last years, has highly benefited from the recent developments in word and sentence embeddings. Such embeddings enable the transformation of complex NLP tasks, like semantic similarity or Question and Answering (Q&A), into much simpler to perform vector comparisons. However, such a problem transformation raises new challenges like the efficient comparison of embeddings and their manipulation. In this work, we will discuss about various word and sentence embeddings algorithms, we will select a sentence embedding algorithm, BERT, as our algorithm of choice and we will evaluate the performance of two vector comparison approaches, FAISS and Elasticsearch, in the specific problem of sentence embeddings. According to the results, FAISS outperforms Elasticsearch when used in a centralized environment with only one node, especially when big datasets are included.
翻译:自然语言处理(NLP)领域在过去几年中发生了巨大变化,它从最近文字和句子嵌入的发展动态中获益匪浅。这种嵌入使得复杂的NLP任务,如语义相似性或问答(QQA),能够更简单地进行矢量比较。然而,这样的问题转换带来了新的挑战,如有效比较嵌入及其操纵。在这项工作中,我们将讨论各种词和句嵌入算法,我们将选择一个句子嵌入算法(BERT)作为我们的选择算法,并将评估两种矢量比较方法(FAISIS和Elasticsear)在嵌入具体问题中的性能。根据结果,FAISS在集中环境中仅使用一个节点时,尤其是在包含大数据集时,超越了弹性研究。