Causal inference is one of the most fundamental problems across all domains of science. We address the problem of inferring a causal direction from two observed discrete symbolic sequences $X$ and $Y$. We present a framework which relies on lossless compressors for inferring context-free grammars (CFGs) from sequence pairs and quantifies the extent to which the grammar inferred from one sequence compresses the other sequence. We infer $X$ causes $Y$ if the grammar inferred from $X$ better compresses $Y$ than in the other direction. To put this notion to practice, we propose three models that use the Compression-Complexity Measures (CCMs) - Lempel-Ziv (LZ) complexity and Effort-To-Compress (ETC) to infer CFGs and discover causal directions without demanding temporal structures. We evaluate these models on synthetic and real-world benchmarks and empirically observe performances competitive with current state-of-the-art methods. Lastly, we present two unique applications of the proposed models for causal inference directly from pairs of genome sequences belonging to the SARS-CoV-2 virus. Using a large number of sequences, we show that our models capture directed causal information exchange between sequence pairs, presenting novel opportunities for addressing key issues such as contact-tracing, motif discovery, evolution of virulence and pathogenicity in future applications.
翻译:因果关系推断是所有科学领域最根本的问题之一。我们处理从两个观察到的离散象征性序列中推断出因果方向的问题。我们提出一个框架,依靠序列对等的无损压缩器推断无上下文语法(CFGs),量化从一个序列压缩中推断出文法和发现因果方向而不必要求时间结构的程度。如果语法从X美元中推断出比其他方向更能压缩Y美元,我们推算出X美元。为了将这一概念付诸实践,我们提出三个模型,使用压缩-复合措施(CMS)-Lampel-Ziv(LZ)复杂度和Efffort-To-Compress(ETC)等无损压缩缩写来推断文法从一个序列中推断出文法和真实世界基准,并用经验观察与当前状态-艺术方法相比具有竞争力的业绩。最后,我们提出了两种独特的应用模式,即使用压缩-Compression-Commex-commal应用方法(CMex-commission)-real translass reportal translass report translation immissmissmissmissation),从我们直接展示了一个序列中的大序列,从序列中显示数字序列中显示。