In this work, we present a naive initialization scheme for word vectors based on a dense, independent co-occurrence model and provide preliminary results that suggest it is competitive and warrants further investigation. Specifically, we demonstrate through information-theoretic minimum description length (MDL) probing that our model, EigenNoise, can approach the performance of empirically trained GloVe despite the lack of any pre-training data (in the case of EigenNoise). We present these preliminary results with interest to set the stage for further investigations into how this competitive initialization works without pre-training data, as well as to invite the exploration of more intelligent initialization schemes informed by the theory of harmonic linguistic structure. Our application of this theory likewise contributes a novel (and effective) interpretation of recent discoveries which have elucidated the underlying distributional information that linguistic representations capture from data and contrast distributions.
翻译:在这项工作中,我们提出了一个基于密集、独立共生模式的文字矢量天真初始化计划,并提供了初步结果,表明它具有竞争性,值得进一步调查。具体地说,我们通过信息理论最低描述长度(MDL)来证明我们的模型EigenNoise(EigenNoise)尽管缺乏任何培训前数据(EigenNoise),但仍可以处理经过经验培训的GloVe的性能。我们提出这些初步结果,有兴趣为进一步调查这种竞争性初始化如何在没有培训前数据的情况下运作,以及邀请探索以和谐语言结构理论为依据的更智能初始化计划,我们这一理论的应用同样有助于对近期发现进行新颖(和有效)的解释,这些发现阐明了语言表达从数据和对比分布中获取的基本分布信息。