SMT-DTA:利用半监督的多任务培训改进药物-目标近似预测 (SMT-DTA: Improving Drug-Target Affinity Prediction with Semi-supervised Multi-task Training)

Drug-Target Affinity (DTA) prediction is an essential task for drug discovery and pharmaceutical research. Accurate predictions of DTA can greatly benefit the design of new drug. As wet experiments are costly and time consuming, the supervised data for DTA prediction is extremely limited. This seriously hinders the application of deep learning based methods, which require a large scale of supervised data. To address this challenge and improve the DTA prediction accuracy, we propose a framework with several simple yet effective strategies in this work: (1) a multi-task training strategy, which takes the DTA prediction and the masked language modeling (MLM) task on the paired drug-target dataset; (2) a semi-supervised training method to empower the drug and target representation learning by leveraging large-scale unpaired molecules and proteins in training, which differs from previous pre-training and fine-tuning methods that only utilize molecules or proteins in pre-training; and (3) a cross-attention module to enhance the interaction between drug and target representation. Extensive experiments are conducted on three real-world benchmark datasets: BindingDB, DAVIS and KIBA. The results show that our framework significantly outperforms existing methods and achieves state-of-the-art performances, e.g., $0.712$ RMSE on BindingDB IC$_{50}$ measurement with more than $5\%$ improvement than previous best work. In addition, case studies on specific drug-target binding activities, drug feature visualizations, and real-world applications demonstrate the great potential of our work. The code and data are released at https://github.com/QizhiPei/SMT-DTA

翻译：由于湿试验成本高、耗时,DTA预测的监督数据极为有限,这严重阻碍了深层次学习方法的应用,这需要大量的监督数据。为了应对这一挑战并改进DTA预测的准确性,我们提议了一个框架,在这一工作中采用若干简单而有效的战略:(1) 多任务培训战略,在配对的药物目标数据集方面采用DTA预测和隐蔽语言模型(MLM)任务;(2) 半监督培训方法,通过在培训中利用大规模未涂色分子和蛋白来增强药物和目标代表的学习能力,这与以往的培训前和微调方法不同,后者在培训前只使用分子或蛋白;(3) 交叉使用模块,以加强药物与目标代表之间的互动。正在对三种真实世界基准数据集进行广泛的实验: 固定的DB、DAVIS和KIBA的当前工作成绩框架。