Word2Vec is a prominent model for natural language processing (NLP) tasks. Similar inspiration is found in distributed embeddings for new state-of-the-art (SotA) deep neural networks. However, wrong combination of hyper-parameters can produce poor quality vectors. The objective of this work is to empirically show optimal combination of hyper-parameters exists and evaluate various combinations. We compare them with the released, pre-trained original word2vec model. Both intrinsic and extrinsic (downstream) evaluations, including named entity recognition (NER) and sentiment analysis (SA) were carried out. The downstream tasks reveal that the best model is usually task-specific, high analogy scores don't necessarily correlate positively with F1 scores and the same applies to focus on data alone. Increasing vector dimension size after a point leads to poor quality or performance. If ethical considerations to save time, energy and the environment are made, then reasonably smaller corpora may do just as well or even better in some cases. Besides, using a small corpus, we obtain better human-assigned WordSim scores, corresponding Spearman correlation and better downstream performances (with significance tests) compared to the original model, trained on 100 billion-word corpus.
翻译:Word2Vec 是自然语言处理(NLP)任务的一个突出模型。 类似灵感的灵感也存在于新艺术( SotA)深神经网络的分布式嵌入中。 但是, 超参数的错误组合可以产生质量差的矢量。 这项工作的目的是从经验上显示超参数的最佳组合存在, 并评估各种组合。 我们把它们与已释放的、 预先训练的原始单词2vec 模型进行比较。 进行了内在和外部( 下流) 评估, 包括名称实体识别( NER) 和情绪分析(SA) 。 下游任务显示, 最佳模型通常是任务特定, 高类比分不一定与 F1 得分成正相关, 而仅关注数据也不一定相同。 在点后增加矢量大小导致质量或性能差。 如果做出节省时间、 能源和环境的道德考虑, 那么在某些情况下, 相对较小的子体( orsoora) 可能做得更好或更好。 此外, 使用一个小体, 我们得到了更好的人指派的WSim 得分数, 、 对应的Spear comman im im im im im commal commal commation 和 commest est est est exmation immactorations