TAG: 努力将准确的社会媒体内容与概念图挂勾 (TAG: Toward Accurate Social Media Content Tagging with a Concept Graph)

Although conceptualization has been widely studied in semantics and knowledge representation, it is still challenging to find the most accurate concept phrases to characterize the main idea of a text snippet on the fast-growing social media. This is partly attributed to the fact that most knowledge bases contain general terms of the world, such as trees and cars, which do not have the defining power or are not interesting enough to social media app users. Another reason is that the intricacy of natural language allows the use of tense, negation and grammar to change the logic or emphasis of language, thus conveying completely different meanings. In this paper, we present TAG, a high-quality concept matching dataset consisting of 10,000 labeled pairs of fine-grained concepts and web-styled natural language sentences, mined from the open-domain social media. The concepts we consider represent the trending interests of online users. Associated with TAG is a concept graph of these fine-grained concepts and entities to provide the structural context information. We evaluate a wide range of popular neural text matching models as well as pre-trained language models on TAG, and point out their insufficiency to tag social media content with the most appropriate concept. We further propose a novel graph-graph matching method that demonstrates superior abstraction and generalization performance by better utilizing both the structural context in the concept graph and logic interactions between semantic units in the sentence via syntactic dependency parsing. We open-source both the TAG dataset and the proposed methods to facilitate further research.

翻译：虽然在语义和知识代表方面已广泛研究概念化,但找到最准确的概念短语以描述快速增长的社交媒体上文字片断的主要概念,仍然具有挑战性,因为大多数知识基础包含世界的一般术语,例如树木和汽车,这些没有决定性力量,或者对社交媒体应用程序用户来说不够有趣。另一个原因是自然语言的复杂性使得能够使用紧张、否定和语法来改变语言的逻辑或强调,从而传达完全不同的含义。在本文件中,我们提出了高质量的概念,即高质量的数据组合,由10,000对贴有标签的精细刻概念和网络版的自然语言句组成,这些概念来自开放的社交媒体媒体媒体媒体。我们用这些精细刻的概念和实体的概念图表来提供结构背景信息。我们评估了广泛的流行神经文字匹配模型以及塔吉亚集团的预缩略语言模型,高品质概念匹配数据集由10,000对一对贴标签的精细刻概念和网络版自然语言句组成,这些概念来自开放式社会媒体媒体用户的动态利益。我们用一个更精细的概念图表格式来进一步匹配TAG和图表格式化结构化,我们用一个更精细的方法在图表中提出一个更精细的图像格式上更精细的方法,用更精细的逻辑化的图像内容来显示其精细的方法,用更精细地标取。