The massive popularity of online social media provides a unique opportunity for researchers to study the linguistic characteristics and patterns of user's interactions. In this paper, we provide an in-depth characterization of language usage across demographic groups in Twitter. In particular, we extract the gender and race of Twitter users located in the U.S. using advanced image processing algorithms from Face++. Then, we investigate how demographic groups (i.e. male/female, Asian/Black/White) differ in terms of linguistic styles and also their interests. We extract linguistic features from 6 categories (affective attributes, cognitive attributes, lexical density and awareness, temporal references, social and personal concerns, and interpersonal focus), in order to identify the similarities and differences in particular writing set of attributes. In addition, we extract the absolute ranking difference of top phrases between demographic groups. As a dimension of diversity, we also use the topics of interest that we retrieve from each user. Our analysis unveils clear differences in the writing styles (and the topics of interest) of different demographic groups, with variation seen across both gender and race lines. We hope our effort can stimulate the development of new studies related to demographic information in the online space.
翻译:在线社交媒体的大规模普及为研究人员研究用户互动的语言特征和模式提供了一个独特的机会。在本文中,我们提供了对Twitter中各人口群体语言使用情况的深入描述,特别是利用Face++的高级图像处理算法抽取美国Twitter用户的性别和种族。然后,我们调查人口群体(即男性/女性、亚裔/黑人/白人)在语言风格和兴趣方面有何差异。我们从6个类别(情感特征、认知特征、词汇密度和认识、时间参照、社会和个人关切以及人际焦点)中提取语言特征,以便确定相似性和差异,特别是书面属性集。此外,我们从人口群体中提取顶级词的绝对分级差异。作为多样性的一个方面,我们还使用我们从每个用户获取的兴趣主题。我们的分析揭示了不同人口群体在写作风格(和兴趣专题)上的明显差异,在性别和种族方面都有差异。我们希望我们的努力能够刺激与在线空间人口信息相关的新研究的发展。