Why AI and Drug Discovery are no match made in heaven

Artificial Intelligence (AI) has been described as the ‘fourth industrial revolution’, that will lead us to self-driving cars, computers understanding human language, and automated drug discovery.

I believe the parts about self-driving cars – legal and societal barriers will likely be more important than scientific and technical ones in due course. Also, while there is debate of the quality of translations, Google Translate is good enough to be useful for me already, so I’d entirely buy into that.

The problem is: drug discovery is a very different beast.

So what’s the problem with drug discovery, then? (Note that the following is an extension of the article and subsequent discussion by Al Dossetter on LinkedIn recently.)

Let’s briefly outline the ‘drug discovery process’ first (which is only a crude generalization anyway, but it may be useful here as an overview):

Desired OutcomeCompound active on target/in cellular assaySuitable in vitro properties (selectivity, solubility, …)Efficacy in animal model, tolerable toxicityEfficacy in man, tolerable toxicity, better than standard of careCommerci-
ally viable (market size, market need, pricing)

So what we really care about when doing drug discovery are the in vivo results – we don’t want to treat a protein with a drug, or  a cell line, or a rat; we want achieve efficacy, with tolerable toxicity, in humans.

So in which way can AI now support the different phases of drug discovery?

Conceptually, it can be useful in any of the above steps, leading to the increasing ability of discovering hits, optimizing in vitro properties …. and thereby providing increased likelihood of in vivo efficacy, at tolerable toxicity.

BUT – AI needs data, and this is the weak point when trying to apply ‘AI’ to the drug discovery field. This is not object or speech recognition, where we have a huge amount of both labelled and unlabeled data.

Let’s hence now examine three criteria for data in the drug discovery process we discussed in a previous post:

  • Amount of data available,
  • Reliable labeling of data, and
  • Problem relevance (here for the in vivo situation).

Let’s see which data we have available related to the different phases above – leaving out market considerations here, and sticking to the scientific goal of finding a bioactive entity that is able to cure disease:

PhaseKey data availableAmount of dataConsistent data labellingProblem (in vivo) relevance
Hit discoveryBioactivity, solubility, …+++o
Lead OptimizationSolubility, permeability, off-target activities, simple DMPK ++o
Animal StudiesEfficacy and toxicity data in animals, animal PKo+
Clinical StudiesHuman endpoint efficacy data, human safety, human PK++

So what we see is: In early phases we have more data, which is more clearly labelled – but it is less relevant to in vivo outcomes, such as efficacy. In late phases we have data that is more relevant to in vivo outcomes, but we have very little data available in general.

To support the above statements with some facts – in databases such as ChEMBL, ExCAPE or PubChem we have millions of bioactivity datapoints, linking compound structures to protein targets. But activity against a target does not make a drug, far from it. So we have lots of data that is insufficient to understand and anticipate the in vivo situation.

On the other hand, in databases such as ToxRefDB, DrugMatrix or Open TG-GATEs we have in the order of (a low number of) thousands of compounds covered with animal toxicity data – in a chemical space that comprises 1033 (or so) compounds in total. So we have likely more relevant data at hand – but for very few compounds, since generating such data is costly (eg the DrugMatrix data generation has cost in the order of $100m).

What is now meant by ‘Consistent data labeling’?

Imagine a consumer clicks on an Internet link and buys a product – here you have clear data points, unambiguously connecting the dots between clicking on a link, and buying a product. However, whether a drug shows efficacy in a disease (or toxic side effects) depends, at the very least, on dose, route of delivery, and individual genetic setup of the organism and the disease (i.e., the endotype), among many other variables. So there is no clear label one can assign, such as ‘drug X treats disease Y’ – yes, sometimes, but sometimes not, depending on the context of how and in which context the drug is applied to a particular organism. Hence labels in the biological domain are generally much more ambiguous, and context-dependent, than in other domains.

(In many cases we simply also don’t know which early-stage data are predictive of in vivo effects – an article published just last month concluded for example that “Chemical in vitro bioactivity profiles are not informative about the long-term in vivo endocrine mediated toxicity“.)

Of course there have been notable successes around ‘AI in drug discovery’, for example in the areas of synthesis prediction, automated chemistry, bioactivity modelling, or using image recognition to analyze phenotypic screening data. These are all important areas to work on – however, they are also a good number of steps away from the more difficult biological and in vivo stages, where efficacy and toxicity in living organisms decide the fate of drugs waiting to be discovered. Hence, there is still a gap that needs to be bridged, in an area that needs progress most, namely in vivo efficacy and toxicity.

So AI and ‘drug discovery’ may not be a match made in heaven – but that’s not necessarily a problem, since we live on earth anyway. There is obviously ample data around in the drug discovery process, the amounts available will increase, and we need to analyze them, so much is clear. Quite possibly, from what I can see, AI will be used more for deselection (rather than positive selection), to increase the odds of success. But we certainly need to learn which models matter for the in vivo situation, instead of just ‘plugging data into the machine’, no matter their relevance for the human setting, and hoping to get the right answer out.

The question is hence where we have, at the same time, sufficient and sufficiently relevant data in order to predict properties of potential therapies that are relevant for the in vivo situation, which are related to efficacy and toxicity-relevant endpoints. We will explore concrete examples in future posts.


DrugDiscovery.NET – AI and Machine Learning in Drug Discovery, in Practice

Machine Learning and Artificial Intelligence are increasing in importance currently – due to significantly increased data availability, the development of new methods, and our understanding how to apply those methods best.

However, using ‘data’ in the drug discovery field, be it early stage data (eg for discovering a compound active against a target) or later stage data (from preclinical and clinical phases), differs significantly from other domains, which are either

  • More information-rich with respect to the number of data points – think video or text data for example, which is available at scale, compared to data in the drug discovery context which needs to be experimentally generated, which is particularly costly at the clinical end of the scale;
  • Have more clearly labelled data – think about a customer who clicks on a link and then buys or does not buy a product, vs a drug which causes a particular effect in a particular human, but only in the context of this particular dose, interactions with other medications, the particular genetic setup, etc.; and/or
  • Have data that represents what we are actually interested in – if a customer buys a product then he or she buys the product, but in drug discovery we often use proxy variables (say, PAMPA or Caco-2 assays for permeability, in vitro toxicity assays, animal studies to predict human response, etc.) where the value of the data for the property of actual interest is often unclear or disputed.

Hence, while there clearly will be a value of analyzing data using AI/ML in the wider field of drug discovery, currently some very relevant questions do not seem to be asked in my experience, which often relate to some of the above points. This website will now aim to discuss developments in the field, and to provide a critical context to it – since only if we question what we do we will end up with methods that work in practice.