Twitch chats pose a unique problem in natural language understanding due to a large presence of neologisms, specifically emotes. There are a total of 8.06 million emotes, over 400k of which were used in the week studied. There is virtually no information on the meaning or sentiment of emotes, and with a constant influx of new emotes and drift in their frequencies, it becomes impossible to maintain an updated manually-labeled dataset. Our paper makes a two fold contribution. First we establish a new baseline for sentiment analysis on Twitch data, outperforming the previous supervised benchmark by 7.9% points. Secondly, we introduce a simple but powerful unsupervised framework based on word embeddings and k-NN to enrich existing models with out-of-vocabulary knowledge. This framework allows us to auto-generate a pseudo-dictionary of emotes and we show that we can nearly match the supervised benchmark above even when injecting such emote knowledge into sentiment classifiers trained on extraneous datasets such as movie reviews or Twitter.
翻译:电动聊天在自然语言理解方面造成了一个独特的问题, 原因是出现了大量新纪元, 特别是摩托。 总共有806万个摩托( motes), 其中超过400千米。 几乎没有关于摩托( motes) 的含义或情绪的信息, 而且随着新摩托( motes) 的不断涌入和频率的漂移, 无法维持一个更新的人工标签数据集 。 我们的论文做出了两个折叠贡献 。 首先, 我们为切开( Twitch) 的数据的情绪分析建立了一个新的基准, 超过7.9% 的受监督基准。 其次, 我们引入了一个简单但强大的、 不受监督的框架, 以文字嵌入和 k- NNN 为基础, 来丰富现有词汇外知识模式 。 这个框架允许我们自动生成一个配方的假名词, 并且我们证明, 即便将这种摩托知识注入到通过电影评论或推特( Twitter) 培训的情绪解算器来, 我们几乎可以匹配以上受监督的基准 。