Searching for potential active compounds in large databases is a necessary step to reduce time and costs in modern drug discovery pipelines. Such virtual screening methods seek to provide predictions that allow the search space to be narrowed down. Although cheminformatics has made great progress in exploiting the potential of available big data, caution is needed to avoid introducing bias and provide useful predictions with new compounds. In this work, we propose the decision-support tool ALMERIA (Advanced Ligand Multiconformational Exploration with Robust Interpretable Artificial Intelligence) for estimating compound similarities and activity prediction based on pairwise molecular contrasts while considering their conformation variability. The methodology covers the entire pipeline from data preparation to model selection and hyperparameter optimization. It has been implemented using scalable software and methods to exploit large volumes of data -- in the order of several terabytes -- , offering a very quick response even for a large batch of queries. The implementation and experiments have been performed in a distributed computer cluster using a benchmark, the public access DUD-E database. In addition to cross-validation, detailed data split criteria have been used to evaluate the models on different data partitions to assess their true generalization ability with new compounds. Experiments show state-of-the-art performance for molecular activity prediction (ROC AUC: $0.99$, $0.96$, $0.87$), proving that the chosen data representation and modeling have good properties to generalize. Molecular conformations -- prediction performance and sensitivity analysis -- have also been evaluated. Finally, an interpretability analysis has been performed using the SHAP method.
翻译:暂无翻译