Apparent parallels between natural language and biological sequence have led to a recent surge in the application of deep language models (LMs) to the analysis of antibody and other biological sequences. However, a lack of a rigorous linguistic formalization of biological sequence languages, which would define basic components, such as lexicon (i.e., the discrete units of the language) and grammar (i.e., the rules that link sequence well-formedness, structure, and meaning) has led to largely domain-unspecific applications of LMs, which do not take into account the underlying structure of the biological sequences studied. A linguistic formalization, on the other hand, establishes linguistically-informed and thus domain-adapted components for LM applications. It would facilitate a better understanding of how differences and similarities between natural language and biological sequences influence the quality of LMs, which is crucial for the design of interpretable models with extractable sequence-functions relationship rules, such as the ones underlying the antibody specificity prediction problem. Deciphering the rules of antibody specificity is crucial to accelerating rational and in silico biotherapeutic drug design. Here, we formalize the properties of the antibody language and thereby establish not only a foundation for the application of linguistic tools in adaptive immune receptor analysis but also for the systematic immunolinguistic studies of immune receptor specificity in general.
翻译:自然语言与生物序列之间明显的平行关系导致最近应用深语言模型(LMS)分析抗体和其他生物序列时出现高涨,但缺乏严格的生物序列语言正规化的生物序列语言语言,这些语言正规化将界定基本组成部分,如词汇(即语言的离散单位)和语法(即将精密的顺序、结构和意义联系起来的规则),导致LMS基本上不针对具体域的应用,这些应用没有考虑到所研究的生物序列的基本结构。另一方面,语言正规化为LM应用程序建立了语言上知情的、因此适合域的组件。这将促进更好地了解自然语言和生物序列之间的差异和相似性如何影响LMS的质量,这对于设计可解释的模型和可提取的序列功能关系规则(例如反体特性预测问题的根源)至关重要,而确定反体特征特性规则对于加速理性和在Silico生物序列中进行适应性适应性化的特性分析至关重要,因此,我们不仅在对常规的免疫再定位工具进行正规化的理论分析,而且还在常规性生物再感应变原原原原原原原原原原原原原原原原原原原原原原原药进行。