The present paper explores a novel variant of Random Indexing (RI) based representations for encoding language data with a view to using them in a dynamic scenario where events are happening in a continuous fashion. As the size of the representations in the general method of onehot encoding grows linearly with the size of the vocabulary, they become non-scalable for online purposes with high volumes of dynamic data. On the other hand, existing pre-trained embedding models are not suitable for detecting happenings of new events due to the dynamic nature of the text data. The present work addresses this issue by using a novel RI representation by imposing a probability distribution on the number of randomized entries which leads to a class of RI representations. It also provides a rigorous analysis of the goodness of the representation methods to encode semantic information in terms of the probability of orthogonality. Building on these ideas we propose an algorithm that is log-linear with the size of vocabulary to track the semantic relationship of a query word to other words for suggesting the events that are relevant to the word in question. We ran simulations using the proposed algorithm for tweet data specific to three different events and present our findings. The proposed probabilistic RI representations are found to be much faster and scalable than Bag of Words (BoW) embeddings while maintaining accuracy in depicting semantic relationships.
翻译:本文探讨了基于随机索引(RI)的新变体,用于编码语言数据,以在动态假设中持续发生事件,从而在动态假设中使用这些变体。随着单热编码一般方法的表达规模随着词汇的大小而线性增长,它们变得无法在网上使用大量动态数据进行缩放。另一方面,由于文本数据的动态性质,现有的预先培训的嵌入模型不适合探测新事件发生的情况。目前的工作通过使用新颖的 RI 代表来解决这一问题,方法是对随机化条目的数量进行概率分布,从而导致产生一类国际智能表达。它还严格分析了用于用单热调编码语义信息的一般方法的精度随着词汇的大小而随着动态数据数量的增加而增长。基于这些想法,我们提出了一种逻辑-线性算法,该算法与词汇的大小不适于跟踪一个查询词与其他词的语义关系,用以建议与该词相关的事件。我们用拟议的推算法进行了模拟,用于将数据与三种国际语言的随机化关系进行精确性分析,同时将图像显示为三种不同事件的精确性,而将显示我们目前的磁性更精确性。