The principle of independent causal mechanisms (ICM) states that generative processes of real world data consist of independent modules which do not influence or inform each other. While this idea has led to fruitful developments in the field of causal inference, it is not widely-known in the NLP community. In this work, we argue that the causal direction of the data collection process bears nontrivial implications that can explain a number of published NLP findings, such as differences in semi-supervised learning (SSL) and domain adaptation (DA) performance across different settings. We categorize common NLP tasks according to their causal direction and empirically assay the validity of the ICM principle for text data using minimum description length. We conduct an extensive meta-analysis of over 100 published SSL and 30 DA studies, and find that the results are consistent with our expectations based on causal insights. This work presents the first attempt to analyze the ICM principle in NLP, and provides constructive suggestions for future modeling choices. Code available at https://github.com/zhijing-jin/icm4nlp
翻译:独立因果机制的原则(ICM)指出,真实世界数据的基因化过程由独立模块组成,这些模块互不影响或互通,虽然这一想法在因果推断领域取得了丰硕的发展,但在NLP社区中并不广为人知。在这项工作中,我们认为数据收集过程的因果方向具有非技术性影响,可以解释一些已公布的NLP结果,如半监督学习(SSL)和不同环境的域适应(DA)的不同性能。我们根据因果方向对通用的NLP任务进行分类,并用经验性方法用最短的描述长度对内容数据原则的有效性作出分析。我们对100多份已公布的SSL和30份DA研究进行了广泛的元分析,发现其结果与我们基于因果洞见的预期相一致。这项工作首次试图分析NLP的ICM原则,并为未来的建模选择提供建设性建议。代码见https://github.com/zhiing-jin/icm4npp。