Complex systems, such as life and languages, are governed by principles of evolution. The analogy and comparison between biology and linguistics\cite{alphafold2, RoseTTAFold, lang_virus, cell language, faculty1, language of gene, Protein linguistics, dictionary, Grammar of pro_dom, complexity, genomics_nlp, InterPro, language modeling, Protein language modeling} provide a computational foundation for characterizing and analyzing protein sequences, human corpora, and their evolution. However, no general mathematical formula has been proposed so far to illuminate the origin of quantitative hallmarks shared by life and language. Here we show several new statistical relationships shared by proteins and words, which inspire us to establish a general mechanism of evolution with explicit formulations that can incorporate both old and new characteristics. We found natural selection can be quantified via the entropic formulation by the principle of least effort to determine the sequence variation that survives in evolution. Besides, the origin of power law behavior and how changes in the environment stimulate the emergence of new proteins and words can also be explained via the introduction of function connection network. Our results demonstrate not only the correspondence between genetics and linguistics over their different hierarchies but also new fundamental physical properties for the evolution of complex adaptive systems. We anticipate our statistical tests can function as quantitative criteria to examine whether an evolution theory of sequence is consistent with the regularity of real data. In the meantime, their correspondence broadens the bridge to exchange existing knowledge, spurs new interpretations, and opens Pandora's box to release several potentially revolutionary challenges. For example, does linguistic arbitrariness conflict with the dogma that structure determines function?
翻译:生命和语言等复杂系统受进化原则的制约。 生物和语言的类比和比较, 诸如生命和语言等复杂系统, 受进化原则的制约。 然而, 至今还没有提出一般的数学公式来说明生命和语言共享的定量标志的来源。 我们在这里展示了由蛋白、 蛋白语言、 字典、 语原体的语法语言、 复杂程度、 基因组的语法系、 语言模型、 InterPro_ nlp、 interpro、 语言模型、 蛋白质序列、 人类公司、 蛋白质序列的演变。 然而, 至今还没有提出一般的数学公式来说明生命和语言共享的定量标志的来源。 我们在这里展示了几个由蛋白和文字共享的新统计关系。 这激励我们建立一个通用的进化机制, 清晰的进化模式可以包含旧的和新特点。 我们发现自然选择可以通过进化的公式来量化, 确定进化过程的顺序变化的原理。 此外, 权力法行为的起源, 以及环境的变化, 也使得新蛋白质的出现新的蛋白质和文字的出现 也可以通过引入的变异变变的变 来解释 。 我们的顺序 和语言的变的变的顺序 与历史的基因的变的变的变的顺序 也只是的变的变的变的变的变的变的 与历史的基因的变的变的变的变的变 。 我们的基因的变的变的变的变的变的变的变的变的变的变的变的变 。