Natural language is generated by people, yet traditional language modeling views words or documents as if generated independently. Here, we propose human language modeling (HuLM), a hierarchical extension to the language modeling problem whereby a human-level exists to connect sequences of documents (e.g. social media messages) and capture the notion that human language is moderated by changing human states. We introduce, HaRT, a large-scale transformer model for the HuLM task, pre-trained on approximately 100,000 social media users, and demonstrate its effectiveness in terms of both language modeling (perplexity) for social media and fine-tuning for 4 downstream tasks spanning document- and user-levels: stance detection, sentiment classification, age estimation, and personality assessment. Results on all tasks meet or surpass the current state-of-the-art.
翻译:自然语言是由人创造的,但传统语言模拟观点的文字或文件则像独立生成一样由人产生。在这里,我们建议人类语言建模(HuLM),这是语言建模问题的分级延伸,在语言建模问题上,人一级存在将文件序列(例如社交媒体信息)连接起来的观念,并捕捉人类语言通过改变人类国家而调节的观念。我们引入了HaRT,这是人类语言建模任务的一个大型变压器模型,对大约10万社交媒体用户进行了预先培训,在社会媒体的语言建模(易懂性)和对覆盖文件和用户层面的4个下游任务进行微调(姿态检测、情绪分类、年龄估计和个性评估)方面都证明了其有效性:所有任务的结果都达到或超过目前的最新水平。