生成多维分子调节因子时间序列数据用于基于人工智能的疾病轨迹预测和药物开发数字孪生：各种考虑 (Generating synthetic multi-dimensional molecular-mediator time series data for artificial intelligence-based disease trajectory forecasting and drug development digital twins: Considerations)

2023 年 3 月 16 日

Generating synthetic multi-dimensional molecular-mediator time series data for artificial intelligence-based disease trajectory forecasting and drug development digital twins: Considerations

翻译：生成多维分子调节因子时间序列数据用于基于人工智能的疾病轨迹预测和药物开发数字孪生：各种考虑

Gary An,Chase Cockrell

from arxiv, 16 pages, 2 Figures

The use of synthetic data is recognized as a crucial step in the development of neural network-based Artificial Intelligence (AI) systems. While the methods for generating synthetic data for AI applications in other domains have a role in certain biomedical AI systems, primarily related to image processing, there is a critical gap in the generation of time series data for AI tasks where it is necessary to know how the system works. This is most pronounced in the ability to generate synthetic multi-dimensional molecular time series data (SMMTSD); this is the type of data that underpins research into biomarkers and mediator signatures for forecasting various diseases and is an essential component of the drug development pipeline. We argue the insufficiency of statistical and data-centric machine learning (ML) means of generating this type of synthetic data is due to a combination of factors: perpetual data sparsity due to the Curse of Dimensionality, the inapplicability of the Central Limit Theorem, and the limits imposed by the Causal Hierarchy Theorem. Alternatively, we present a rationale for using complex multi-scale mechanism-based simulation models, constructed and operated on to account for epistemic incompleteness and the need to provide maximal expansiveness in concordance with the Principle of Maximal Entropy. These procedures provide for the generation of SMMTD that minimizes the known shortcomings associated with neural network AI systems, namely overfitting and lack of generalizability. The generation of synthetic data that accounts for the identified factors of multi-dimensional time series data is an essential capability for the development of mediator-biomarker based AI forecasting systems, and therapeutic control development and optimization through systems like Drug Development Digital Twins.

翻译：合成数据的使用被认为是开发基于神经网络的人工智能系统的关键步骤。虽然在其他领域的应用生成合成数据的方法在某些生物医学人工智能系统中发挥了作用，主要涉及图像处理，但在生成时序数据用于需要了解该系统如何工作的人工智能任务方面存在重要差距。这个差距在于生成合成多维分子时间序列数据（SMMTSD），这是预测各种疾病的生物标志物和调节因子签名研究的基础数据类型，并且是药物开发流水线的重要组成部分。我们认为，使用统计和数据中心的机器学习（ML）方法来生成这种合成数据的不足是由多种因素共同导致的：由于维数灾难，中心极限定理的不适用性和因果层级定理所施加的限制而导致的永久性数据稀疏。相反，我们提出使用复杂的多尺度基于机制的模拟模型来进行模拟，这些模型在构建和操作时应考虑到认识上的不完整性，并需要根据最大熵原理提供最大的扩张性。这些操作提供了 SMMTD 生成，该生成最小化了神经网络人工智能系统的已知缺陷，即过度拟合和缺乏通用性。考虑到多维时间序列数据的识别因素而生成合成数据是开发调节分子-生物标志物为基础的人工智能预测系统和通过药物开发数字孪生体系进行治疗控制开发和优化的重要能力。

相关内容

关注 7019

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

药物发现中的深度学习

专知会员服务

41+阅读 · 2022年11月14日

用于药物发现的抗体表征学习

专知会员服务

10+阅读 · 2022年10月31日

【牛津大学】电子医疗记录的生成式对抗网络:应用、评估措施和数据来源综述，A review of Generative Adversarial Networks for Electronic Health Records: applications, evaluation measures and data sources

专知会员服务

24+阅读 · 2022年3月15日

【AI+军事】《用于兵棋推演和建模的人工智能技术》2022最新论文

专知会员服务

173+阅读 · 2022年3月14日