A fundamental issue in natural language processing is the robustness of the models with respect to changes in the input. One critical step in this process is the embedding of documents, which transforms sequences of words or tokens into vector representations. Our work formally proves that popular embedding schemes, such as concatenation, TF-IDF, and Paragraph Vector (a.k.a. doc2vec), exhibit robustness in the H\"older or Lipschitz sense with respect to the Hamming distance. We provide quantitative bounds for these schemes and demonstrate how the constants involved are affected by the length of the document. These findings are exemplified through a series of numerical examples.
翻译:自然语言处理中的一个基本问题是模型对于投入变化的稳健性。这一过程的一个关键步骤是嵌入文件,将文字或象征的顺序转换为矢量表示。我们的工作正式证明流行的嵌入计划,如连接、TF-IDF和Vector(a.k.a.doc2vec)段,H\older或Lipschitz 感学在哈明距离方面的稳健性。我们为这些计划提供了数量界限,并展示了所涉常数如何受到文件长度的影响。这些结论通过一系列数字实例加以示范。</s>