Keyword extraction is a crucial process in text mining. The extraction of keywords with respective contextual events in Twitter data is a big challenge. The challenging issues are mainly because of the informality in the language used. The use of misspelled words, acronyms, and ambiguous terms causes informality. The extraction of keywords with informal language in current systems is pattern based or event based. In this paper, contextual keywords are extracted using thematic events with the help of data association. The thematic context for events is identified using the uncertainty principle in the proposed system. The thematic contexts are weighed with the help of vectors called thematic context vectors which signifies the event as certain or uncertain. The system is tested on the Twitter COVID-19 dataset and proves to be effective. The system extracts event-specific thematic context vectors from the test dataset and ranks them. The extracted thematic context vectors are used for the clustering of contextual thematic vectors which improves the silhouette coefficient by 0.5% than state of art methods namely TF and TF-IDF. The thematic context vector can be used in other applications like Cyberbullying, sarcasm detection, figurative language detection, etc.
翻译:关键词提取是文本挖掘的重要过程。在推特数据中提取具有相应上下文事件的关键词是一项巨大的挑战。挑战主要来自于所使用语言的非正式性。拼写错误的单词、首字母缩略词和含糊不清的术语导致了非正式性问题。当前系统中提取非正式语言中的关键词是基于模式或事件的。本文提出在数据关联的帮助下,使用主题事件提取具有上下文关键词。借助不确定性原理识别事件的主题上下文。使用被称为主题上下文向量的向量衡量主题的确定性或不确定性。在推特COVID-19数据集上测试本系统,证明其有效性。系统从测试数据集中提取特定事件的主题上下文向量并排名。从提取的主题上下文向量用于聚类上下文主题向量,比现有技术方法(如TF和TF-IDF)提高0.5%的轮廓系数。主题上下文向量可用于其他应用程序,如网络欺凌、讽刺检测、比喻语言检测等。