Predicting protein targets for compounds is an area of high practical importance, such as for the elucidation of modes of action from phenotypic screens. But it is on the other hand not trivial, in particular due to the underlying data, which is highly biased (classes are of very different size, contain chemical structures with very different diversity etc.) as well as due to the associated experimental error of biochemical assay measurements. In this interview, we will explore two recent publications which dealt with precisely those two distinct (but related) aspects, namely, how to normalize probabilities for class membership predictions on the output side, and how to deal with the error in measurements on the input side when generating target prediction models.
The interview has been conducted with Lewis Mervin and Marianna Trapotsi, both of whom have been PhD students at the University of Cambridge, funded by AstraZeneca. Lewis Mervin is now working for AstraZeneca, while Marianna Trapotsi is in the last year of her degree.
Congratulations to your recent publications! Let’s focus on the first aspect mention above first, namely how to deal with dataset imbalances in drug-protein target prediction models, and the difficulty of arriving at absolute target likelihoods as a result. Could you please summarize what you have done in this work, and in which way this is relevant for drug discovery?
Lewis Mervin (LM): Thank you! Our recent studies have focused on protein-ligand prediction, and more specifically, the task of generating more reliable scores from machine learning algorithms to better represent the probability of compounds binding proteins of interest. This is a problem because of dataset bias (with respect to both size and chemical diversity of molecules in different activity classes; sometimes you have only a dozen active molecules for a protein, while in other cases you have thousands) and the associated practical difficulty of predicting absolutely class membership likelihoods. In reality, the fraction of active molecules assigned a output score is often higher or lower in practice. This is something I have seen for many models I have trained, and in those circumstances, they can be considered to be poorly calibrated. It is a problem for chemogenomics modelling since it means that the probabilities generated from methods are not always sufficient for decision making when models are applied in practice. Hence the first method we evaluated to address this problem was Venn-ABERS, a probability calibration method applied to protein-ligand prediction. We have chosen to explore this method because it was also an active area of research through our collaborations with Royal Holloway, and since it has also shown in various settings to outperform other calibration methods.
So how does this go beyond what has been known before, how does it advance the field?
LM: In this study, Venn ABERS predictors were able to combat the issues of data bias and algorithm limitations to recalibrate the base output of models into more interpretable and actionable scores from the target prediction methods. We compared these scores to traditional methods such as Platt scaling and Isotonic regression, and were excited to see that, for the ChEMBL dataset, the probability estimates for predicted targets of ligands were much closer to the ideal. We are even exploring how the multi-probability scores output from Venn ABERS can be used for confidence estimation, since larger intervals between those scores is thought to indicate areas of higher uncertainty.
This seems to then address one current problem in target prediction better than some other, preceding methods. But this didn’t consider experimental errors in the bioactivity data used, yet, if I am not mistaken. So why did you decide to explore “Probabilistic Random Forests” in the context of target prediction now, for the first time?
Marianna Trapotsi (MT): The main caveat to binary classification approaches is that they weight minority cases close to the threshold boundary equivalently in distinguishing between activity classes. For example, pXC50 activity values of 5.1 or 4.9 would be put into different activity classes (at a classification threshold of 5), even though the experimental error may not afford such discriminatory accuracy. This is particularly important since previous studies on databases such as ChEMBL have shown that there are significant deviations, in the range of 0.4 to 0.7 log units, between intra- and inter-laboratory replicates. Hence we concluded that this is an issue that needs to be addressed, in our field of protein target prediction for ligands.
We then came across the Probabilistic Random Forest, PRF, which had previously been used for large-scale, highly noisy, astronomical data. We quickly realized the applicability of such an approach also in the target prediction area, and so this was exactly what we were looking for in order to address the issue we had identified.
LM: This algorithm is particularly interesting since it takes into account uncertainties in the assigned class labels and represents activity in a framework in-between the classification and regression architecture, and we were hence able to combine the best of both worlds: Compared to regression, we were able to utilize all data, even censored data far from a cut-off, which is usually only possible in classification models. At the same time, we were able to take into account the granularity around the cut-off in a numerical manner, which is usually only the case for regression models. The combination of both has resulted in better probability estimates, particularly for those testing instances close to the decision boundary. Hence our hypothesis has been proven right, in that PRFs generate more realistic probabilities compared to traditional random forests, for testing instances with higher experimental uncertainty.
So when would you recommend to use Probabilistic Random Forest compared to other algorithms, such as classic Random Forest?
MT: By applying Probabilistic Random Forest in a target prediction setting and comparing it with the classic Random Forest, we identified cases where the former outperformed the latter especially for marginal points close to the classification threshold. Therefore, the decision to use Probabilistic Random Forest over classic Random Forest should be made by taking into account parameters of training data quality such as the experimental uncertainty and whether the values are distributed close to the classification threshold.
How can the results you have obtained be used by others, and how to you plan to continue from here?
MT: The processed bioactivity datasets are available online, and a public version of the models implementing the Venn ABERS predictors are also available at https://github.com/BenderGroup/PIDGINv4. The methods are maintained by members of the Bender group and are still under active development and improvement. We have recently updated the dataset to ChEMBL version 28. The evaluation and models for the probabilistic random forests are also available at https://github.com/BenderGroup/PRF.
LM: As an outlook, other exciting work is also evaluating how to use Venn-ABERS estimates as a type of uncertainty estimation for molecule property prediction. This is something that we are considering as a way to assess the extension of the applicability domain for various projects.
Thank you for the interview, and all the best for your future research in the target prediction field!
- Comparison of Scaling Methods to Obtain Calibrated Probabilities of Activity for Protein–Ligand Predictions https://pubs.acs.org/doi/abs/10.1021/acs.jcim.0c00476 / A Comparison of Scaling Methods to Obtain Calibrated Probabilities of Activity for Ligand-Target Predictions https://chemrxiv.org/engage/chemrxiv/article-details/60c74b150f50dbf17c396b93
- Probabilistic Random Forest improves bioactivity predictions close to the classification threshold by taking into account experimental uncertainty https://chemrxiv.org/engage/chemrxiv/article-details/60c75862bdbb897fa4a3ad59
- Probabilistic Random Forest: A machine learning algorithm for noisy datasets https://arxiv.org/abs/1811.05994