People convey their intention and attitude through linguistic styles of the text that they write. In this study, we investigate lexicon usages across styles throughout two lenses: human perception and machine word importance, since words differ in the strength of the stylistic cues that they provide. To collect labels of human perception, we curate a new dataset, Hummingbird, on top of benchmarking style datasets. We have crowd workers highlight the representative words in the text that makes them think the text has the following styles: politeness, sentiment, offensiveness, and five emotion types. We then compare these human word labels with word importance derived from a popular fine-tuned style classifier like BERT. Our results show that the BERT often finds content words not relevant to the target style as important words used in style prediction, but humans do not perceive the same way even though for some styles (e.g., positive sentiment and joy) human- and machine-identified words share significant overlap for some styles.
翻译:人们通过他们所写的文字的语言风格表达其意图和态度。 在这项研究中,我们通过两个镜头来调查不同风格的词汇用法:人类感知和机器词的重要性,因为文字在它们提供的文体提示的强度上有所不同。为了收集人类感知的标签,我们在基准样式数据集的顶端翻译了一个新的数据集Humingbird。我们有人群工人在文本中突出有代表性的词句,使他们认为文字有以下风格:礼貌、情绪、冒犯和五个情感类型。然后我们将这些人类字名标签与流行的微调风格分类师(如BERT)的词重要性进行比较。我们的结果显示,BERT常常发现与目标风格的文字无关的内容词与在风格预测中使用的重要词句不同,但人类甚至对某些风格(如正面情绪和喜悦)的人和机器识别的词与某些风格有重大重叠。