Sentence embedding methods offer a powerful approach for working with short textual constructs or sequences of words. By representing sentences as dense numerical vectors, many natural language processing (NLP) applications have improved their performance. However, relatively little is understood about the latent structure of sentence embeddings. Specifically, research has not addressed whether the length and structure of sentences impact the sentence embedding space and topology. This paper reports research on a set of comprehensive clustering and network analyses targeting sentence and sub-sentence embedding spaces. Results show that one method generates the most clusterable embeddings. In general, the embeddings of span sub-sentences have better clustering properties than the original sentences. The results have implications for future sentence embedding models and applications.
翻译:句子嵌入方法为处理短文本结构或词序列提供了强有力的方法。许多自然语言处理(NLP)应用将句子表述为密集数字矢量,从而改善了其性能。然而,对句子嵌入的潜在结构了解相对较少。具体地说,研究没有涉及句子长度和结构是否影响嵌入空间和地形的句子。本文报告了针对句子和次句嵌入空间的一组综合组合和网络分析研究。结果显示,一种方法生成了最密集的嵌入。一般而言,跨子句嵌入的组合属性优于原句子。结果对未来的句子嵌入模式和应用有影响。