In this paper we explore new representations for encoding language data.The general method of one-hot encoding grows linearly with the size of the word corpus in space-complexity. We address this by using Random Indexing(RI) of context vectors with nonzero entries. We propose a novel RI representation where we exploit the effect imposing a probability distribution on the number of randomized entries which leads to a class of RI representations. We also propose an algorithm to track the semantic relationship of the key word to other words and hence propose an algorithm for suggesting the events that could happen relevant to the word in question. Finally we run simulations on the novel RI representations using the proposed algorithms for tweets relevant to the word ``iPhone'' and present results. The RI representation is shown to be faster and space efficient as compared to BoW embeddings.
翻译:在本文中,我们探索编码语言数据的新表达方式。 单热编码的一般方法随着空间复杂度中字体的大小而线性地增长。 我们通过使用无零条目的上下文矢量随机索引(RI)来解决这个问题。 我们提出一个新的 RI 表示方式, 利用这种效果对随机输入条目的数量进行概率分布, 从而导致产生一类RI 表示方式。 我们还提议了一种算法, 以跟踪关键词与其他词的语义关系, 从而提出一种算法, 来建议可能发生与该词相关的事件。 最后, 我们用与“ iPhone” 字相关的推文的拟议算法来模拟新的 RI 表示方式, 并展示结果。 RI 表示方式显示比 BoW 嵌入式更快, 空间效率也更高 。