English is the international standard of social research, but scholars are increasingly conscious of their responsibility to meet the need for scholarly insight into communication processes globally. This tension is as true in computational methods as any other area, with revolutionary advances in the tools for English language texts leaving most other languages far behind. In this paper, we aim to leverage those very advances to demonstrate that multi-language analysis is currently accessible to all computational scholars. We show that English-trained measures computed after translation to English have adequate-to-excellent accuracy compared to source-language measures computed on original texts. We show this for three major analytics -- sentiment analysis, topic analysis, and word embeddings -- over 16 languages, including Spanish, Chinese, Hindi, and Arabic. We validate this claim by comparing predictions on original language tweets and their backtranslations: double translations from their source language to English and back to the source language. Overall, our results suggest that Google Translate, a simple and widely accessible tool, is effective in preserving semantic content across languages and methods. Modern machine translation can thus help computational scholars make more inclusive and general claims about human communication.
翻译:英语是社会研究的国际标准,但学者们日益意识到他们有责任满足对全球通信进程进行学术深入了解的需要。在计算方法方面,这种紧张状况与其他任何领域一样,在英文文本工具方面革命性的进展使大多数其他语文远远落后。在本文件中,我们的目标是利用这些进展来证明目前所有计算学者都可以获得多语种分析。我们显示,翻译成英文后计算出来的英语培训措施与原始文本计算出来的原始语言措施相比,具有充分到极优的准确性。我们展示了三种主要分析方法 -- -- 情绪分析、专题分析、和词嵌入 -- -- 超过16种语言,包括西班牙语、中文、印地语和阿拉伯语。我们通过比较原始语言推文的预测及其背译,即将原始语言翻成英语和回源语言的双重翻译,来证实这一说法。总的来说,我们的结果表明,Google Translat是一个简单和广泛可使用的工具,能够有效保存各种语言和方法的语义内容。现代机器翻译可以帮助计算学者就人类通信提出更具包容性和普遍性的主张。