Multilingual search can be achieved with subword tokenization. The accuracy of traditional TF-IDF approaches depend on manually curated tokenization, stop words and stemming rules, whereas subword TF-IDF (STF-IDF) can offer higher accuracy without such heuristics. Moreover, multilingual support can be incorporated inherently as part of the subword tokenization model training. XQuAD evaluation demonstrates the advantages of STF-IDF: superior information retrieval accuracy of 85.4% for English and over 80% for 10 other languages without any heuristics-based preprocessing. The software to reproduce these results are open-sourced as a part of Text2Text: https://github.com/artitw/text2text
翻译:多语种搜索可以用子词符号化来实现。传统的TF-IDF方法的准确性取决于手动翻译的象征性化、停止单词和断层规则,而TF-IDF(STF-IDF)的子词可以提供更高的准确性,而不需要这种杂费。此外,多语种支持可以作为子词符号化模式培训的一部分内在地纳入其中。 XQAD评价表明STF-IDF的优点:英语85.4%的高级信息检索准确性,而其他10种语言的80%以上的高级信息检索准确性,而没有以超语法为基础的预处理。这些结果的复制软件作为文本2Text的一部分是开源的:https://github.com/artitw/text2text。