Social media features substantial stylistic variation, raising new challenges for syntactic analysis of online writing. However, this variation is often aligned with author attributes such as age, gender, and geography, as well as more readily-available social network metadata. In this paper, we report new evidence on the link between language and social networks in the task of part-of-speech tagging. We find that tagger error rates are correlated with network structure, with high accuracy in some parts of the network, and lower accuracy elsewhere. As a result, tagger accuracy depends on training from a balanced sample of the network, rather than training on texts from a narrow subcommunity. We also describe our attempts to add robustness to stylistic variation, by building a mixture-of-experts model in which each expert is associated with a region of the social network. While prior work found that similar approaches yield performance improvements in sentiment analysis and entity linking, we were unable to obtain performance improvements in part-of-speech tagging, despite strong evidence for the link between part-of-speech error rates and social network structure.
翻译:社交媒体具有巨大的文体差异,对在线书写进行综合分析提出了新的挑战。然而,这种差异往往与作者的特征如年龄、性别和地理等一致,以及更容易获得的社会网络元数据。在本文中,我们报告了语言和社会网络之间联系的新的证据,以进行部分语音标记。我们发现,调格错误率与网络结构相关,网络的某些部分的准确性很高,其他地方的准确性较低。因此,调格准确性取决于网络的均衡抽样培训,而不是狭小次社区的文本培训。我们还描述了我们试图通过建立一个专家混合模型,使每位专家都与社会网络的一个区域相联系,从而增强文体变化的稳健性。我们以前的工作发现,类似的方法可以改善情绪分析和实体连接,但我们未能在部分语音标记方面获得业绩的改进,尽管有确凿证据表明部分语音错误率和社会网络结构之间存在联系。