It is a common belief in the NLP community that continuous bag-of-words (CBOW) word embeddings tend to underperform skip-gram (SG) embeddings. We find that this belief is founded less on theoretical differences in their training objectives but more on faulty CBOW implementations in standard software libraries such as the official implementation word2vec.c and Gensim. We show that our correct implementation of CBOW yields word embeddings that are fully competitive with SG on various intrinsic and extrinsic tasks while being more than three times as fast to train. We release our implementation, k\=oan, at https://github.com/bloomberg/koan.
翻译:我们发现,这种信念的根基不是其培训目标的理论差异,而是标准软件图书馆,如正式执行单词2vec.c和Gensim的错误执行。我们显示,我们正确执行标准软件图书馆的CBOW生成了与SG在各种内在和外在任务上完全具有竞争力的词,同时在培训速度超过3倍的同时,与SG在各种内在和外在任务上具有充分竞争力。我们发布了我们的执行程序,即Kãoan,网址是https://github.com/bloomberg/koan。