‘-Omics’ Data – So where is the signal, please?

Biology has over recent decades moved to finer and finer levels of details, be it the type of readout, or better resolution in the spatial (eg single-cell) or time domain. And, understandably, there is considerable excitement every time we are able to generate data in a technologically novel way.

The question is, though: In which way is this data practically useful, either to understand biology (and patient subtypes/endotypes), and/or to discover new drugs?

Especially currently, with the majority of ‘AI in drug discovery’ startups focused on generating novel drugs, there needs to be data that links disease biology (genes, mutations, …) to potential therapy (be it small molecules or biologics) – and this link can only be as strong as the data is (since even the fanciest algorithm will not make up for poor data!).

Which types of ‘-omics’ data are around currently?

Some types of biological data we are currently able to generate relatively easily in this context are summarized in the table below and further described in the text (with points in italics discussed with respect to their current practical utility in drug discovery in the personal opinion of the author below):

Information provided(Potential) Benefits
Genome SequencingDNA Sequence of organism (human, pathogen, etc.)Understanding ‘building blocks’ of life; variations associated with disease; identifying drug targets
Single cell sequencingSequence/expression level on single cell levelUnderstanding heterogeneous cell populations (cells that drive disease, contribute to eg drug resistance in cancer, etc.)
Gene ExpressionExpression levels of genesIdentifying activity of genes related to cellular function, disease, drug efficacy/resistance, …
Cellular ImagingGeometry (morphology) of cell and its organellesUnderstanding visually (via markers) changes in cellular organization
  • Genome sequencing data – data available about both human and other DNA has increased particularly since the Human Genome Project in the 1990s, with the expectation at the time that we would learn more both about human biology, and in turn also about potential drug targets;
  • Single cell sequencing– a concept that has become significantly more popular in the last decade, realizing that cell populations are heavily heterogeneous; eg the Sanger Institute in Hinxton has become heavily invested in the area recently. This type of data hence allows for the generation of spatially better resolved data, which is of importance eg to understand heterogeneous cancer cell populations;
  • Epigenetics information – this describes heritable traits beyond modifications of DNA sequence, which is a concept which interestingly goes back more than 60 years already, to 1942 (before even actual ‘genes’ were known) and the work of Waddington;
  • Gene expression data – capturing not the sequence of genes but rather their expression levels, and where larger scale work from the disease side goes back to around the early 1980s. This field had its popularity significantly increasing with Affymetrix GeneChips in the 90s; and with more recent data from the compound side becoming available eg via Connectivity Map and LINCS within the last 10 years or so. During this time also other techniques such as RNA-Seq, and its experimentally simpler and more affordable cousins such as RASL-seq, TempO-Seq, DRUG-Seq (and others) have been established;
  • Proteomics data – which is describing a biological system not on the gene but on the protein level, thereby also considering that gene and protein levels are often only weakly correlated. In this area experimental approaches have made significant leaps in recent years, but generally both the experimental setup and data analysis remain significantly more difficult than on the transcriptomic level;
  • Metabolomic information – which is identifying and quantifying metabolites in a living system, with claims to be ‘closer to the phenotype’ than even the proteomics level. Aspects such as metabolite identification remain tricky, but structure elucidation and experimental techniques are continuing to evolve significantly currently;
  • Imaging data – which can refer to data on rather different levels, from the cellular to the organ and organism level. On the cellular level the field was driven eg by developments in confocal miscroscopy (interestingly the first confocal miscroscopy patents date back to 1957!), but it also comprising eg 3D imaging methods of tumors. While some of the underlying physical principles for data generation have been known for longer, efficient data analysis methods only emerged much more recently, in the last 15-20 years. In the context of this discussion the focus will be on cellular imaging and its use in the drug discovery context, in particular on High-Content readouts, such as those available via the recent Cell Painting assay/datasets and similar formats;
  • And others, which are not included here for now

We have data, great – and now?

To preempt the conclusion of this piece somewhat: I entirely share the excitement about the technical side of generating data on a finer and finer level of detail, both from the purely technological side, and I can also see implications for the understanding of fundamental biology to a good extent. What, however, is less clear to me in many cases is that we actually know what we are doing with all that data subsequently, in the rather practical context of discovering safe and efficacious drugs at reasonable expense and pace, and/or patient subtyping for personalized medicine. My point is not that this is never possible – my point is that we have a huge amount of data, and compared to that the practical utility is comparatively small.

Personally, I have encountered this conundrum the first time when analyzing High-Content Screens myself during my postdoc at Novartis more than 10 years ago, and where the cellular parameters determined from automated microscopy seemed (and after 10 years still seem!) rather cryptic to me with respect to biological interpretation and utility (though some approaches to eg rationalize high-content readouts mechanistically have been published recently). To me, our (relative) lack of understanding doesn’t seem to be limited only to cellular microscopy readouts – I would argue that it is even common to the majority of ‘-omics’ types of data generated. We need to observe the response of biology to compound application – fully agreed, so high-dimensional readouts can in principle tell us more than eg target-based assays alone. But my point is that, without proper hypotheses being used in the first place the data generated is often difficult to handle subsequently – either because of statistical reasons (eg weak signals in cases where we have few samples and a high-dimensional readout space), and/or since the experimental setup is simply irrelevant for any in vivo situation (say, due to using single-cell systems, physiologically irrelevant dose or time points), or also for many other possible reasons.

So we can generate all this data – but what does it mean, how can it be used? In the following I will (very) briefly – and admittedly subjectively, though not without evidence – shed some glimpses of light on the impact that genome sequencing, single cell sequencing, gene expression data, and cellular imaging had so far on both aspects, of understanding disease as well as drug discovery.

So what did – in brief – genome sequencing, single cell sequencing, gene expression data, and cellular imaging data contribute to drug discovery today?

Sequencing: The sequencing of the human genome was an ambitious project, compared at the time to bringing the first human onto the moon (though this picture has been used rather frequently, also more recently, in this context). At the time it was expected that there will be “More drug targets… 3000–10 000 targets compared with 483” (luckily this work didn’t state a particular time frame for that to happen). It appears that at least 20 years later we didn’t really get there – recent (2017) estimates of drug targets put the number currently at around 667. On the other hand, with CRISPR and related techniques, will be be able to expand the number of drug targets in the near future, and isn’t this based on previous projects, such as the Human Genome Project? In addition, maybe a focus on small molecules held us back – and will new chemical modalities/biologics help in the future? So quite likely I wouldn’t put the impact of sequencing the human genome to expanding to ‘3,000-10,000 drug targets’, that hasn’t materialized yet. But we certainly can annotate genomes, and hence proteins, more systematically than we were able to do before. So in a way, genome sequencing by itself didn’t really unravel the dynamic interactions of living systems and expand druggable targets on a huge scale. But it helped catalog biology better, which is crucial to store and annotate data in the future. Also sequencing helped for practical purposes, such as understanding the heterogeneity of cancers (and that eg two cell lines in the NCI 60 screening set are actually the same, which hasn’t been known before), thereby providing a basis for defining what we actually deal with. On the patient level also genetic drivers for other diseases have been identified, as an example I will pick Pulmonary Arterial Hypertension (PAH) here given recent local work here in Cambridge – where, inch by inch, the authors were able to tease out genetic factors which contribute to PAH, given suitable datasets, methods, and making a dedicated effort.

VERDICT: Sequencing data has helped us greatly to catalog and annotate biology. Some advances have been made to identify genetic drivers for disease as well. However, its impact to develop new drugs has been limited in my personal opinion. Reasons include that few diseases (except some genetic diseases) are purely defined by DNA sequence; even if there is a genetic contributing factor there will be other contributing factors as well which are required for a disease to develop; and on the methodological side samples sizes have often been small and mathematical methods are more tricky to use (eg when it comes to biases) than a simple ‘data in – knowledge out’ would suggest.

Single cell sequencing data: The core argument for spatially resolved single-cell sequencing data is that transcription events are not uniformly distributed across cells, and hence for understanding cellular populations across different areas (say, development, understanding disease, and drug response) a finer level of detail than cellular population averages is needed. On the fundamental level this is intuitively true – and some research exists that underlines the practical usefulness of this type of data, such as when understanding developmental processes and understanding the heterogeneity of brain tumors, with practical implications to drug response. Other studies used single-cell data to describe determinants of drug response in cancer immunotherapy. It seems to me that single-cell sequencing is currently somewhat on the peak of the hype cycle, with articles such as “A Project to Map All Human Cells Will Change How Disease Is Cured” – what is this claim really based on though? When reading the article I cannot say – lots of ‘might’ and ‘could’ feature in it. Practically I wonder if the finer and finer level of spatial and temporal resolution will really lead us to a more unified picture of disease biology that will be useful for practical purposes – since we need to generate the data in the first place, store and analyze it, and then, importantly, identify common patterns in it – which, given the larger number of variables, will be more and more tricky the finer level of data we generate in the first place. Beyond first studies understanding development and disease which have already appeared I think we need to see whether the insight gained really justifies that (rather large) investments in the area.

VERDICT: ‘Proof of concept’ has certainly been established, eg for understanding development, or understanding drug response in patients, so there is a clear scientific rationale for this work to further our understanding of biology. Where does it help in drug discovery though, or practical patient subtyping (in the clinic)? I think this is where the level of detail generated can be tricky to handle (see also below) – even ‘conventional’ sequencing often isn’t really used in cancer clinics right now. So from the viewpoint of ‘value for money’ this might be difficult to justify… but maybe fundamental research never is?

Gene Expression data: Firstly a personal disclaimer: I used gene expression data quite a lot (both personally and in my research group), and I love it! That being said: Maybe it’s more a love/hate relationship. There have been many successful examples of using gene expression data eg for repurposing, mode of action analysis, understanding drug efficacy and understanding the toxicity of potential new drugs, so from the empirical angle transcriptomics data seems to be very valuable. On the other hand, would I claim that we really understand gene expression data? We can do Gene Set Enrichment Analysis (GSEA), Weighted Gene Co-Expression Network Analysis (WGCNA), etc. … but it seems the outcome of such analysis is very often that a rather large number of gene(s) are differentially expressed, heavily dependent on the precise method and parameters used, leading to a number of colorful pathway annotations being modulated… and drawing concrete conclusions from the finding, a proper interpretation, is rather tricky. Even beyond the method used – is the data that has been generated even coming from the right disease state/tissue/has it been taken at the right time point/compound dose/etc.? Quite often the answer to those questions is simply – ‘we don’t know!’. In addition, much of current (compound-derived) gene expression data has been generated in cell lines – how does this extrapolate to decision making in patients? So, while empirically gene expression data – which is quite cheap and fast to generate – has turned out to be very useful, understanding gene expression data, in my personal opinion, is – often- tricky. But maybe a signal is good enough – eg looking at the most up-and downregulated genes, and apply it for signal detection and repurposing, without even understanding the data in every detail? It seems that this works rather very well, as studies of other groups as well as our own have shown.

VERDICT: I would put gene expression data into the category ‘we don’t quite understand what we do, but often it’s rather useful in practice’ – we in many cases do not really model and understand the data, far from it; but for a variety of practical applications transcriptomics data has turned out to be useful. And it’s cheap and easy to generate, which is a plus – though care has to be taken precisely how to set up a biological system to be predictive for a given question one asks.

Cellular Imaging data – While DNA sequencing and transcriptomics have been around for a while and are now rather established techniques, cellular imaging data seems to be in its ‘second spring’ currently, after the first practical demonstration of general principles in the late 1990s, and the more recent standardization of readouts, such as in the CellPainting assay. To me personally it was rather surprising to see how long it took to establish standards for data generation and handling in the field (processes still ongoing) – but now that standards emerge organizations such as the EPA (and also many pharmaceutical companies) are rather swift adopters of the readout and companies such as Recursion Pharmaceuticals banking heavily on this type of readout, along with other of the big pharma companies. Some first examples of drug repurposing applications (with somewhat different formats) do exist, so there inherently seems to be an information content in cell morphology based readouts. The same holds true for compound target prediction using imaging data. But what is the biological setup I need (in particular when it comes to the ‘ugly siblings’, cell line, time point and dose), how do I need to analyze data for a given purpose? Do I really need imaging data, is it worthwhile using such data compared to other tools, such as ligand structure-based target prediction? This may very well be the case, in particular given the reusability of images for different endpoints, but at least according to the information available in the public domain we probably still need to wait and see for further applications and comparative studies to emerge. In my very personal opinion, given the research we also perform in the group, we have frequently encountered situations where trivial signals were easy to detect in CellPainting readouts – but where subsequent, finer-grained information was much more difficult or even impossible to tease out. Is this due to the data, or rather the analysis method and endpoint used by us? Very difficult to say at this stage.

VERDICT: While image-based cellular morphology readouts have now been around for more than 20 years it is, in my opinion, still difficult to say what they can precisely be used for, and what the best setup for a given purpose is. It seems that cellular imaging, from the data generation, storage/handling, as well as the analysis side – with respect to practical impact – has been in the ‘establishing best practice’ stage for quite a while now, in particular when it comes to hypothesis free/general data generation (obviously looking for particular markers is very different!). Hence I would say that no final verdict is possible at this stage, but in order to establish practical value eg of the CellPainting assays we would ideally need to move on to ‘production phase’ shortly, and likely this needs to involve multiple partners and larger consortia.

So what now – is there a point in using ‘-omics’ data, or not?

Overall I think the picture, looking at those four different types of readouts as a sample, is rather mixed – and for very different reasons in every case. Sequencing allows us to catalog – with little direct, but hence much indirect impact on understanding disease, and drug discovery. Single cell sequencing allows us to understand healthy and disease biology better – but likely its cost will be prohibitive for some time for practical applications in the clinic, such as patient subtyping. Gene expression data is practically tremendously useful in different areas, such repurposing and others – without us really understanding what is happening (which may or may not be a problem in particular cases). Cellular imaging has recently arrived at standards for data generation which is good to see – but the readouts are still rather cryptic, and practical utility still remains to be shown, at least in the public domain.

ClaimContributionNot yet realized
Sequencing‘understanding the building blocks of life’Significant contributions to systematically cataloging biology, some impact on patient subtypingLittle direct impact on drug discovery itself (certainly compared to original claims)
Single cell sequencingUnderstanding heterogeneity in disease, processes in developmentCase studies show first suitability for understanding eg developmental processes, cell heterogeneity, and drug reponseApplications for drug discovery & in clinic (not the lab!) still to be established, amount of data generated might be prohibitive for some (many?) applications
Gene expression dataUnderstanding cellular states and processesVarious successful applications eg in repurposing, understanding modes of action, and understanding and predicting modes of toxicitySignals often difficult to interpret overall (but genes can be interpreted and eg put on pathways individually), analyses are heavily method- and parameter-dependent
ImagingUnderstanding cellular processes on morphological levelAllow generation of standardized cell morphology data on large scale, comparatively cheapReadouts are cryptic; practical utility beyond simple examples still needs to be demonstrated (at least in the public domain)

Problems with ‘-omics’ data

While every new technology has a certain period where it needs to show its value, some aspects can be commonly observed (especially in the current context of ‘AI’ in drug discovery, which doesn’t always consider the complexity, predictivity, and variance of biological data fully):

To a good extent still simplified model systems are used – often simple cell lines (though there is a considerable effort to go into 3D and heterogeneous cell cultures etc.). To what extent does this represent the patient, or a situation in the clinic? This uncertainty relates to any of the -omics readouts mentioned above where cell lines are used to generate data. (On the other hand, cells can simply be seen as ‘signal generators’ of course who respond to eg compound treatment – but in this case one needs to let go of the idea that measurements in cells have direct clinical applications in a then different biological context.)

We assume that ‘more is better’ – finer levels of spatial resolution, temporal resolution, different types of readouts, … until we can generate spatially and temporally resolved maps of our body on a cellular level. Apart from practical problems of data handling and analysis the problem that emerges is: How can we generalize this? How do we integrate this type of information, to go from data to understanding? This is not a trivial point – we generate more and more variables with the data we generate, which isn’t really matched by the number of data points we have – but to go to knowledge we need to have an underlying map to integrate this data into a unified concept. Otherwise we just – generate data. But generating data cannot be the end of it in order to arrive at practically useful solutions.

Technology push vs scientific pull – I am old enough to have observed this in the 2000 biotech bubble, and again now – our human mindset in ‘the West’ is (usually) skewed towards (a) new is better (‘Artificial Intelligence is better than Machine Learning’ – although the latter has been around for decades), (b) analytical approaches (where ‘more data is better’ – although predictivity/signal should be the guideline instead), and (c) economic interests of companies, who have the (rather natural, from their perspective) incentive of pushing their own product into the market. If you have a start-up presenting on ‘AI using ‘omics data and deep learning’ then the VCs will flock to you – and this company will create a technology push, where other potential customers, because of ‘Fear Of Missing Out’ or other reasons, will have difficulties to resist and not buy into the new offering. The skeptical voice will receive less attention – maybe he or she is simply behind the current state of the art! Scientific pull is, as far as I can observe it, less often the reason for developments than technology push. Jumping on every new technology will never allow us to understand the capabilities of the tools we already have. Promoting a new hype, though, is more sexy though than warnings – leading to article headlines such as ‘More is better‘ when it comes to -omics data, with the contents however being rather light on detail why ‘more’ should precisely be ‘better’, in particular across applications, and when taking the cost of generating such data into account as well.

Where is the signal in the data? This goes back to the question of hypothesis-free vs hypothesis-driven data generation. Based on my experience hypothesis-free data might well seem more ‘universal’ at the start – but it harbors the problem of not being able to identify the signal one needs for decision making. I think in some cases (such as gene expression data) we have more experience that this is a useful type of data to generate than in others, simply from experience – and it has the advantage that we can easily link it back to existing knowledge (genes, pathways, etc.). So if you can – generate hypothesis-driven data. This needs to be driven by scientific question, covering suitable experimental design, and likely consortia in many cases (not just for sharing data which is often the case, but also for generating data in the first place)

Data is crucial for use in AI – but in many currently published studies it isn’t always clear how data has been used precisely for decision-making, which methods were applied, and what the contribution of ‘AI’ compared to a negative control (eg an established method) was. As one example, selecting a drug candidate for Wilson’s Disease was based on data from “more than 2,400 diseases and over 100,000 pathogenic mutations” – in my experience though, this doesn’t give you a neat mutation, or a handful of drugs for testing in return. You rather get lots of possible links back… and then you need a human to sift through the information (which is often the much more tricky part!). As for ‘validation’ it is simply very easy to ‘re-discover’ what you have known anyway using AI – but without a negative control it is very difficult to assess what the added advantage of AI is.

Translation from research to the clinic is poor: As in: really poor. There are fancy things that appear on the research front, characterizing cancer landscapes etc. The wife of a friend of mine had breast cancer in Germany, and what was the genetic information used for decision making? Nada – standard approach for everyone (operation, radiation and subsequently Tamoxifen), and that was it. I have heard similar stories from Spain – lots going on in research, but in the clinic? Translation is meager, partially due to lack of validation of methods in the clinic, partially due to established workflows and difficulty of implementation, partially due to cost. Do we really get most ‘bang per buck’ for taxpayer’s money this way though? I am not sure about that.


We need to generate data, such as -omics data, for describing biology – but the devil is in the detail; which data do we generate, in which biological setup, and how do we analyze if for which purpose? Here, much remains to be done. We will likely run into problems with generating and understanding the data we generate if we always jump on finer levels of resolution in the spatial and temporal domain, both on the practical (eg with respect to data storage and analysis) as well as the scientific level (how to make sense of fine-grained data). Biological noise, and the inability to generate the numbers of data points needed, will likely set limits to what we can achieve with always finer levels of resolution (not necessarily when characterizing systems, but probably when trying to draw actionable conclusions from the data.

We probably need to explore collaborative approaches for establishing state of the art when it comes to data – the current focus on ‘proof by example’ is likely not useful; we need to have controls and established methods as a baseline, and studies where reproducibility is ensured. Hiding logistic regression baselines in the Supplementary Material to promote the apparent superiority of ‘deep learning’ is also not a good approach. A certain distrust in scientific publications in understandable – and ‘higher-ranked’ journals do not really seem to fare better than ‘lower-ranked ones’ – “only half of the articles [from machine learning in biology and medicine] shared software, 64% shared data and 81% applied any kind of evaluation. Although crucial for ensuring the validity of ML applications, these aspects were met more by publications in lower-ranked journals.”

Let’s see what we can discover in -omics data in the future – I am certainly very curious and hope that they can be put to good use when answering practical questions in the healthcare setting in the future.


So did ‘AI’ just discover its first drug? Comment on “Deep learning enables rapid identification of potent DDR1 kinase inhibitors”

The recent publicationDeep learning enables rapid identification of potent DDR1 kinase inhibitors‘ by the team around Alex Zhavoronkov at In Silico Medicine, together with WuXi AppTec and the University of Toronto, has received quite some attention recently – so what’s to it?

Did it really happen that AI ‘discovered its first drug’? Let’s look at this work in more detail.

The authors used an implementation of ‘generative tensorial reinforcement learning (GENTRL)’, which in its objective functions includes information about on-target activity, synthetic feasibility, and novelty. Six compounds designed against the kinase DDR1 were synthesized and tested in biochemical assays, leading to four active compounds (below 10uM, with one compound going down to an IC50 of 10nM against DDR1), and two compounds being active in cellular assays. In addition, pharmacokinetics of compounds was determined in mice.

Having four, or even two, out six compounds being ‘active’ against the intended target is certainly not a bad ‘hit rate’, by any means. But how about novelty? We don’t develop drugs against proteins, we intend to treat people – hence, how about pharmacokinetics of the compound, efficacy and safety?

To evaluate novelty of the compounds I just did a ChEMBL search of their most active compound, compound 1, at 75% similarity, in order to evaluate novelty in the public domain (or at least in this database):

Compound 1 from the publication, active at 10nM against DDR, and two of the six most similar compounds retrieved from ChEMBL (at 75% similarity). The bottom left compound is active against ABL1 at 19nM (with known cross-reactivity against DDR1), while the bottom right hand compound is active against JAK1.

We can see is that the algorithm rearranges heterocyclic ring systems of known kinase inhibitors to come up with novel/rearranged structures. In this case activity against ABL was possibly extrapolated to DDR1, two kinases with known cross-reactivity. (Though, given that I cannot reproduce the workflow myself in detail, I also cannot give the origin of this structure with absolute certainty.) What is interesting is that the synthetic (and other) filters also prioritized so similar substructures to known compounds – maybe synthetic chemistry has rather strong biases and preferences (which wouldn’t be an entirely new observation of course!).

Apart from on-target activity, the authors also evaluated pharmacokinetics of the compound, which they described as favourable – however, this seems to have been unrelated to the design hypothesis (ie, PK was not considered explicitly here).

What I appreciate about the article is that it does not claim that it ‘discovered drugs’ in any way, as opposed to some tweets which described this as ‘AI doing drug discovery’. However, drugs need to show their effect in vivo, and this has not (yet!) been performed in this work. This would be the logical next step though – and even more so this is a crucial step, given that the majority of drugs in clinical development fail due to lack of efficacy, which is difficult to anticipate from early-stage data (the whole gamut of compound distribution, metabolism, target engagement, etc. comes into play beyond binding to an isolate target).

So given that the in vivo study only comprises PK and no efficacy (or extensive tox) components I would probably not share the comment that AI has performed ‘drug discovery’ as has been the case by some of the readers on Twitter – but it has certainly allowed the discovery of novel bioactive chemical matter in a short amount of time, I fully agree with that.

One thing that might be worth pointing out is that this paper uses quite a lot of information that isn’t really available in many other early-stage projects, such as crystal structure data and information about existing active compounds. How would the method perform on other targets with much less such information available? It would be interesting to see a ‘simple’ baseline method for comparison – so what would have happened with bog-standard, say, ligand-based virtual screening, docking, and proteochemometrics modelling as baselines? Given that all this ligand and structural information is available it could relatively easily have been used for comparison – and the less information we really need to use a method, the wider it will be applicable in practice.

I am looking forward to the next steps of this work, and in particular moving into the biological domain – tackling the biological steps of discovering drugs with computational methods would likely bring huge advantages when it comes to anticipating efficacy and safety, and hence reducing attrition in the clinic.


‘AI’ in Toxicology (In Silico Toxicology) – The Pieces Don’t Yet Fit Together

‘All happy molecules are alike; each unhappy molecule is unhappy in its own way.’

Anna Karenina Principle, adapted from Tolstoy

For a drug to be useful for treating a disease, it ‘only’ needs to fulfill two scientific criteria (we leave out commercial aspects here): It needs to be efficacious (improve disease state, which may also be symptomatic, at least for a subset of the patient population, and usually compared to standard of care); and this needs to happen at sufficient safety (or tolerable toxicity, which of course is in turn related to effective dose and indication).

Both of those aspects seem superficially mirror images of each other, but far from it – efficacy is the presence of one desirable property, of which there may be one or more which are leading to the desired effect, such as up- or downregulating a particular biological pathway (or many other things). On the other hand, safety is the absence of a long list of undesirable properties, which by its very nature is more difficult to deal with in practice.

So both parts are fundamentally different in nature – efficacy is the selection for the presence one given property (which realistically can only be defined on an organism/in vivo level). Safety means absence of many possible events, and this by its very nature is very difficult to predict. How do you predict an unknown mode of toxicity that you haven’t seen before? Or the potential toxicity of an entirely novel structure?

Visually we can represent this as follows (based on a drug approved in 2019, the first one to treat postpartum depression, brexanolone):

Illustration that efficacy and safety are very different in nature, the former requiring the presence of one/a limited number of compound properties, while the latter requires the absence of a long list of undesirable properties. List adapted from Guengerich et al. Note that the link of a single target to efficacy is likely also simplified to a good extent, but this may still serve as an illustration of the conceptual difference between efficacy and safety.)

(I should add that there is no claim that this particular drug is unsafe in any way – it is simply a recently approved drug which has been used here as an example for the many aspects of safety one needs to consider, and I could have picked any other example as well to underline this general principle.)

This is also what makes developing a drug so much more difficult than ‘finding a ligand’ (which in particular in academic publications is often equated with ‘drug discovery’) – it’s the in vivo behaviour that counts, and a whole balance of phenotypic endpoints. This is why we have millions of ligands in ChEMBL, but only a few thousand approved drugsmany compounds are active on a target, some are even achieving an efficacy-related endpoint; but tolerable safety thins out the field considerably.

Toxicity may also be either mechanism-related, or not mechanism related, which provides a very different context – eg early anti-cancer drugs with unspecific cytotoxicity could easily assumed to have mechanism-related toxic effects. However, the non-mechanism related toxicities are more difficult to anticipate, since one doesn’t initially know what to look for.

Given the interest in ‘AI’, also the field of In silico toxicology has received considerably more attention in the recent past. So where do things stand currently – and how good are we at using computational algorithms for predicting the toxicity of small molecules? This is what this post is aiming to summarize. (By its nature it will be rather brief though, for more details the reader is referred to recent reviews on databases and approaches in the field.)

In the following four approaches for toxicity prediction will be distinguished, which are based on different types of data:

  1. Target/Protein-based toxicity prediction (analogous to target-based drug discovery), which also forms the basis of eg safety pharmacology, and which is hence regularly employed in practice: You screen a compound against a dozen (or up to a hundred) proteins, and hope that the results will be informative. The ‘target’ of a drug may be also related to efficacy at the same time (‘on-target related toxicity‘), or it may be an ‘off’-target related toxicity, but in any case there is some link assumed to be present between ligand activity against a protein, and a toxic effect of the compound. Basically, this part of safety assessment means: ‘We know where we are in bioactivity space, but we don’t quite now what this means for the in vivo situation‘ (how relevant the target really is, see below);
  2. Biological readout (‘omics’) based toxicity prediction, eg based on transcriptomics (‘toxicogenomics‘), imaging data, or other high-dimensional readouts derived from a more complex biological system than activity against an individual target. In this case, biological readouts are not on the target (direct ligand interactor) levels, but they can be enumerated (genes, image-based readout variables), and some features may be more directly interpretable (genes), while others are less interpretable (morphological features). This corresponds to the situation ‘We are capturing biology on a large scale, along one or few readout types, and hope that those readouts tell us something relevant for toxicity’ – which always needs to be shown. I have learned about this in the context of analyzing high-content screening data which was generated in a hypothesis-free manner and where interpretation and use of data was far from trivial);
  3. Empirical/phenotypic toxicity – In this case toxicities are measured ‘directly’ in a ‘relevant model system’, say the formation of micronulei, hepatotoxicity etc. In this case (usually, but not always) cellular effects are, as opposed to the previous point, not mapped to annotated genes/pathways, and the toxicity/empirical endpoint derived from the system is taken as the readout variable. This represents the situation ‘We don’t really know where we are in bioactivity space and in biology, but we are more confident that the readout that is relevant for compound toxicity in vivo’; and
  4. Systems models – Conceptually different from the above four categories, ‘systems models’, either on the experimental or computational level, are aiming to model complex biological systems, such as ‘organs on a chip‘, or by using metabolic network models or cellular signalling models. They can conceptually be used for any aspect of understanding and predicting compound action in vivo (both related to efficacy and toxicity). While this category of models may not be fully functional in many areas currently, I have big hope for the future, given they may well represent useful trade-off between complexity and cost (ie, practically useful predictivity at reasonable cost).

So to summarize, the current types of data we can use for in silico toxicity prediction are visualized as follows:

Biological systems used to generate data for toxicity prediction (but also more generally biological assay systems) can either be more empirical (left), or abstract (right), or in between.
The level of abstraction, and understanding of the system, need to be related to choose meaningful readout parameters. Relevance for the in vivo situation then depends on the strength of the link between readouts generated and their impact in man, as well as quantitative aspects such as PK/exposure (in addition to practical assay setup questions, see below). Note that this is a conceptual figure, and details differ depending on readout, and the particular application area (drug discovery vs consumer safety etc.)

So how well in ‘AI’/In Silico Toxicology doing today, in the above four areas? Where can we predict safety, or toxicity, sufficiently well?

1. Target-based toxicity aims to use on-target activities as predictors to anticipate toxicity. This is an appealing concept due to its simplicity (protein assays are easy to perform generally), hence its frequent use in safety profiling in pharmaceutical companies. However, when it comes to its relevance to the in vivo situation, and hence practical relevance, two fundamental questions exist:

  • What is link between activities against protein targets, and adverse reactions observed in vivo? Some recent reviews exist which link the two (Bowes et al., Lynch et al., Whitebread et al.), and ever since working as a postdoc at Novartis on links between protein targets and adverse reactions years ago I wondered who picks proteins to be measured in safety panels, and using which types of criteria. However, quantitative information between protein activity and adverse reactions caused is hard to obtain, due to many reasons (the biases in reporting drug adverse reactions being a particularly big contributor here, which has been the basis of some recent studies as well). If we look at the available data from our own analysis, based on relationships between protein activities and adverse reactions observed of marketed drugs, another problem emerges though (which is the result of work by Ines Smit in my lab, publication is currently under preparation):
Conditional probability of a compound showing an adverse event given a hit against a safety panel protein (X axis), vs conditional probability of a compound hitting a safety profiling target given an association with/presence of an adverse event (Y axis). Basically, either you have a very promiscuous safety panel protein that is sensitive, but has low positive predictive value (top/left); or you have a very specific safety panel protein with low hit rate (bottom/right). Choose your devil!
OPRM1: Mu opioid receptor, HTR2A: Serotonin 2a (5-HT2a) receptor, DRD2: Dopamine D2 receptor, SLC6A4: Serotonin transporter, HRH1: Histamine H1 receptor, ADRB1: Beta-1 adrenergic receptor, CHRM3: Muscarinic acetylcholine receptor M3, KCNH2: HERG/Potassium voltage-gated channel subfamily H member 2 (Work by Ines Smit, to be published shortly)

It can be seen in the above figure that some proteins are rather good at detecting adverse events (high positive predictive value), but they do so at low hit rates (low sensitivity; bottom/right). Or you have proteins with high hit rate (sensitivity), but they have low positive predictive values (ie, in many cases activity against a protein does not lead to an adverse event; top/left). So – choose your devil! Quite possibly combinations of assays perform better such as rules learned on assay data, but the main point remains that protein-to-adverse reaction mappings are no simple 1:1 relationships, far from it. (I should add that the positive predictive values listed above are significantly larger than those one would expect by random, so there certainly is some signal in them. However, the key question is: how can we use this information for decision making, instead of just creating a ‘worrying machine’ that throws up warning signs all the time? For this we still need to understand better which information is contained in particular assay readouts, in their combinations, and how this translates to the human situation.)

The other big problem of course is in vivo relevance, due to compound PK and exposure. What is the route of administration, metabolism, excretion etc. associated with a compound? We often simply don’t know (though efforts such as the High-Throughput Toxicokinetics Package try to bridge that gap). However, given that ‘the dose makes the poison’ this is something we need to know in order to make an informed decision.

Hence, target-based ways of anticipating safety have severe limitations. In the context of ‘AI’ and computational methods this is even more of a limitation since this basically the only area where we have sufficient amounts of data (target-based activity data), but for in vivo relevant toxicity prediction this is often simply not good enough – we don’t really know what a prediction in this space means for the in vivo situation.

One might argue that at least for some proteins a link between in vitro effects and adverse reactions is clearly given, such as evidenced by the focus say on hERG activity and its links to arrhythmia in the last 15 years or so. However, this seems often be driven by the need for simple testable endpoints, rather than extremely strong in vitroin vivo links, as more recent studies made clear.

So to summarize, when it comes to protein activities we have (relatively speaking) lots of data, of tolerable quality, but the relevance for the in vivo situation is not always warranted. This is due to unknown links between protein targets and adverse drug reactions, but also due to the immense uncertainty of quantitatively extrapolating in vitro readouts at a defined compound dose to the in vivo situation with often less defined PK and exposure.

2. Biological readout-based toxicity prediction – In this case, biological, high-dimensional readout are employed for the prediction of toxicity, in practice (mostly due to the ability to generate such data) often gene expression data, or more recently also imaging data. We have some experience in this area ourselves, such as in the QSTAR project with Janssen Pharmaceuticals, which aimed to use gene expression data in lead optimization. There are obviously also studies from other groups around, aiming eg to predict histopathology readouts based on gene expression data. In line with the generally increasing interest in ‘-omics’ data and high-dimensional biological readouts, also the EPA proposes to follow this route in their ‘Next Generation Blueprint of Computational Toxicology at the U.S. Environmental Protection Agency‘, and to use both gene expression and imaging data for profiling of chemicals, to somewhat adjust their earlier efforts with the ToxCast profiling datasets. And I do think they have a good point: Biological readouts create for relatively little cost (depending on the precise technology chosen) a lot of data, meaning they profile compound effects rather broadly (for not just one endpoints, but everything that can be measured, under given assay conditions, along the given slice of biology considered).

The question is, though: Is this also relevant for the prediction of toxicity? Is there a signal – and if so, where is it?

This is a tricky question to answer – a very tricky question actually, when one is trying to generate practically useful models for toxicity prediction based on -omics readouts. Consider that only some of the variables that need to be considered in this case are as follows:

  • What is the dose of a compound to be tested? (At low dose there is often no signal, at high dose this is not relevant for the in vivo situation! Also, identical concentrations are not actually ‘the same’, since eg drugs are given at different doses, and show different PK, so there is no objective way to choose consistent assay conditions). Should a compound be given at single dose, or repeat dosing? What is relevant in humans?
  • What is the time point to be chosen for readouts? Early stage (eg 6 hours), or later stage (eg 24 or 48 hours)? How is this related to later dosing of the drug (which of course is often not known early on)?
  • What is the precise biological setup to be used, eg the cell line (or better spheroids, co-cultures etc)? Is the signal from one cell line relevant for others (and what is actually the distribution of the compound in vivo, across tissues and cells…)?
  • How do we normalize data – should we use housekeeping genes (or are those not so stably expressed after all)?
  • How is toxicity to be defined? In vivo endpoints are noisy and variable (but tend to be more relevant); lower-level endpoints (say, cellular cytotoxicity) are often less noisy (but tend to be also less relevant). If using eg histopathology data, do we use single endpoints or combinations thereof? (How can we summarize data? Are there actually ‘classes’ of toxicity, which we need for model training?)
  • How do we deal with variability between eg animal histopathology readouts? Control animals that show significant histopathology (which always happens in practice)?
  • Etc.

So as we can see, just ‘generating lots of data’ will not do – we need to generate relevant data. What this means though, depends on the case (compound etc.)

Generally I believe using biological high-dimensional readouts is worthwhile, since one samples large biological space at often a suitable cost. We use it a lot as well – eg for repurposing using gene expression data works rather well. However, on the one hand this relates to efficacy, not toxicity, which is hence very different in nature (see above). On the other hand it is unclear how to define the biological setup of the system where data is measured. (One could even claim that either one has a consistent setup, or one that is relevant for the in vivo situation, but not both at the same time…), how to define toxic endpoints, and how hence to generate sufficient data for such models. This is at odds with some of the literature from the ‘AI’ field which eg happily discuss toxicity prediction with RNA-Seq data, but which signal in such data predicts precisely which in vivo relevant endpoint is not really clear to me, and I think this needs to be worked out in much more detail to be practically useful.

So while technical problem are partially solved to generate biological high-dimensional readouts, practical aspects of how to generate such data, and which signal is predictive for which toxic endpoints seems, on the whole, still largely work in progress. Inconsistently generated data will lead to problems when training computational toxicology models though.

3. Empirical/phenotypic toxicity – in this case we have a model system to generate empirical toxicity readouts for compounds. Hence, this type of data is on the one hand based on a complex biological system (certainly more complex than isolated proteins); but on the other hand we do not care about individual biological parameters (be it binding to a receptor, or the up- and downregulation of a large number of genes). Instead, we rather link a compound structure to an empirical, phenotypic endpoint. Eg the micronucleus assay might be one such type of assay, animal toxicity studies which consider particular endpoints might be another. There has been lots of discussion on the predictivity of animal studies, generally concluding that in many cases, depending on animal chosen and toxic endpoint considered, animals often indicate the presence of toxicity (ie, a compound toxic in animals is often toxic in humans); while they are less good at indicating the absence of toxicity (ie, a compound not toxic in animals is not necessarily safe in humans). See eg Bailey, as well as the reply to the work and related publications for a more detailed discussion.

What is the problem with this group of endpoints in the context of in silico toxicity prediction? On the positive side the endpoint is, if well-chosen, relevant for the in vivo situation, which is a big plus already (see comments above related to target-based data, and biological readouts as a contrast). On the other hand, depending on the particular assay, we might still not have suitable translation to in vivo situations, mostly due to lack of estimating PK/exposure properly. In addition, as mentioned in the introduction, empirical toxicity readouts only consider what is explicitly examined – we only see the histopathological readouts we examine, we only detect the toxicity that we set up an assay to detect. But, since the list of possible toxicities a compound may have, is very large, the coverage of toxicity space is difficult to achieve. (I will leave out a more detailed discussion of using animal-based readouts for safety analysis at this point, which opens up a whole new set of questions due to the much more complex system used.)

That being said, one area of toxicology, namely (digital) pathology, with the ability to classify readouts more quickly and more consistently than humans has benefited greatly from the advent of machine learning. This area is basically made for applying Convolutional Neural Networks (CNNs) with its performance in image recognition, although in some cases additional technical steps need to be taken to address eg the rotational invariance of cellular systems. While this addresses some issues related to consistency of data, the questions how to extrapolate from animal histopathology to human relevance of course remains, in addition to the question how to define toxic endpoints in a relevant manner (based on which readouts and quantitative thresholds).

So what does this mean for predicting toxicity in silico, based on empirically measured toxicity data? We measure in systems which may at least partially be more representative of organisms, which is a plus. On the other hand, translating in vitro results (or those from animals) to the in vivo human situation is difficult due to often little understanding of PK of a new compounds, and also conceptually it requires an enumeration of explicitly chosen toxic endpoints – we only see what we explicitly look for.

4. Whole systems modelling – As opposed to the above three approaches, modelling a whole biological system, either in a simplified experimental or in a computational format, aims to combine both practical relevance of the readout or model, as well as being fast and cheap enough to be used practically to make decisions. This can mean experimentally eg the ‘organ on a chip models‘, or computationally cellular models or virtual organs such as intestines or the heart. Some of those models, such as the organ on a chip model linked above, have achieved remarkable things, such as ‘to establish a system for in vitro microfluidic ADME profiling and repeated dose systemic toxicity testing of drug candidates over 28 days’.

Simple models (such as activity against protein targets) often have little predictivity for the in vivo situation, and we are unable to exhaustively test all (or sufficient) chemical space in all (relevant) empirical tox models. Therefore, either experimentally or computationally staying at level that is biologically sufficiently complex to be predictive and relevant, but at the same time practically applicable (with respect to time and cost), seems an appropriate thing to do. The question for this class of models is to what extent biology can be reduced to stay representative – be it by identifying which cells or cell systems, in which arrangement, need to be present to interact to be predictive; or computationally which (eg on the cellular level) genes, and interactions, need to be present to achieve this goal.

What hence remains to be done is to establish which parts are needed to model the system, and how to parameterize the experiment, be it truly experimental (cellular setup, etc.), or in silico (eg on the cellular level parameterizing interactions between genes etc.) which is no trivial exercise. Subsequently, such systems of course need to be validated properly – which has been done in some cases, eg in case of the heart, with remarkable success, where “Human In Silico Drug Trials Demonstrate Higher Accuracy than Animal Models in Predicting Clinical Pro-Arrhythmic Cardiotoxicity“.

Hence, while whole systems models are – generally speaking – probably still in their infancy, both from the experimental and computational point of view, it appears to me that this will quite possibly be a suitable way to have both practically relevant and applicable models available for in silico toxicity prediction in the more distant future. This is due to the fact that systems try to resemble in vivo situation at suitable biological complexity. The problem is that it will still take a while to validate such models, and likely consortia are needed to generate data for in silico models on a sufficient scale, so this will likely not happen tomorrow in all areas.

So where do things stand now for ‘AI’/In Silico drug toxicity prediction?

Currently the situation can be summarized in a way that we didn’t yet find a suitable biologically relevant abstraction of the system to generate sufficient amounts of data for many relevant toxic endpoints, and we are not yet properly able to extrapolate from model-based readouts to the in vivo situation (due to PK/exposure), though some steps such as the HTTK package of course points into that direction.

The following gives a table of which types of data can be used for in silico toxicology, and what their advantage and disadvantages are:

Toxicity Prediction based on…AdvantagesDisadvantages
Protein-Based Bioactivity DataEasy to generate
Lots of data
Does not consider exposure/PK (in vivo relevance unclear)
Often either low sensitivity
or positive predictive value for clinical adverse reactions
Biological readouts (gene expression, images, …)Considers
complexity of
Broad readout
Setup for in vivo relevance not trivial (dose, time point, cell line, …)
Defining ‘toxic’ endpoints not trivial
Empirical toxicity modellingPotentially in vivo relevant endpoint can be replicated in assayTranslation (PK/exposure) often unclear
Needs explicit enumeration of toxic events
Systems ApproachesConsider whole system
Combine suitable level of complexity/predictivity and manageable cost
Currently in their infancy
Extensive parametrization necessary

Hence, we either have large amounts of less relevant data for the in vivo situation (eg target based profiling data); medium amounts of likely more relevant but also hugely complex data (biological readouts) where both generating and finding a signal in the data is not trivial; or very specific data for individual types of toxicities. This, naturally, makes the construction of reliable in silico toxicity models difficult at the current stage. (Though this should not be an oversimplification here – there may very well be suitable computational models available in areas where it was possible identify a suitable model system to generate sufficient amounts of relevant data, but I would doubt this is currently generally the case in the area of toxicity prediction.) This is also reflected in public databases – in databases such as ChEMBL we have significantly more information on efficacy-related endpoints than on those related to safety.

One topic we left out of the discussion at this stage is metabolism – what actually happens to a compound in the body, and how can we anticipate effects not only of the original drug (the ‘parent compound’), but also its metabolites? Target-based activity profiling of course will not tell us anything about that, but empirical toxicity assays (or eg animal models) might, provided that they are metabolically competent in a way that translates to human metabolism. Likewise, even if toxicity occurs in animal models, this may be dose-dependent and only occur below therapeutic dose – and animal models may then provide biomarkers to follow in later clinical stages. So while there is still a potential problem it becomes far more manageable – at least we know what to measure, and what to look for.

Some of the recent computational challenges seem to contribute somewhat to an overly simplistic view on compound toxicity, say the Tox21 and CAMDA challenges, where if one is not careful this could be interpreted as a computational ‘label prediction problem’, where simply those win who get the better numbers from a computer. While those challenges have an honourable aim, toxic endpoints should in my opinion not be used without any attention to the biological context and in vivo relevance, since this is the situation the models will need to be used in. If e.g. the link of target activity to toxicity, or predictivity for novel chemical space due to focus only on overall model performance statistics, or PK/exposure or metabolism etc. are neglected then any computational model will simply have less practical relevance. This can then lead to claims that a certain level of ‘AI predictivity’ is achieved for toxic endpoints, but where applicability to a practical drug discovery situation is not always warranted to the same extent. It is of course easy to be carried away by technology, and I can understand this, but it would likely advance the field if such challenges could also eg involve practitioners in the (drug discovery and medical) field, so that the field is really advanced as a result. (Note: We are in touch with the CAMDA organizers exactly about this point and also participated in the challenge to exactly bring practical relevance to the table, which I believe is absolutely crucial to advance the field).

Of course there is not only the scientific question – sometimes it is about access to data, or legal barriers. When they are overcome then this makes clear, such as shown in recent work by Hartung et al., that computational models are very well able to make use of existing data, as shown in a study presenting Read-Across Structure Activity Relationship (RASARs) for a series of 9 different hazards caused by chemicals.

Interestingly, it can be seen that none of the above situations has ‘deep learning’ or AI as the solution (with the possibly exception of digital pathology, at least from the technical angle) – and even chemical space coverage is secondary, compared to the way of generating data for modelling, which needs to be appropriate for the problem in the first place.

I usually see this as ‘data > representation > algorithm’ – so an algorithm will not work if the representation is insufficient, and the representation will not solve problems with quantitatively insufficient data (or data unsuitable for the problem at hand) either.

The above are probably big problems, but somehow we need to proceed – so what can we do realistically in in silico toxicology?

The above discussion also gives us the answer of what we need to address in the different types of approaches – we need to work on extrapolating from model systems to the human situation in all of the above areas, meaning to improve our understanding of PK and exposure. In protein-based models an aim is to identify better where the predictive signal is; in biological readout (-omics) space the task is to identify relevant assay conditions for readouts and also the predictive signal for a given toxic endpoint. Empirical toxicity modeling requires a predictive assay endpoint in the first place, and then the model needs to be filled with sufficient data.

Some of the readout-to-toxicity links of course might come from current projects such as the Innovative Medicines Initiative (IMI) projects, eg eTox, eTRANSAFE and others. Let’s hope that the sharing of data, with as much annotation and on as large a scale as possible, and also where possible with academic/outside partners, will then help us find patterns in toxicity data better than in the past.

Currently much of the data we use in drug discovery is either sparse, noisy, or simply irrelevant to the problem at hand, and this article was aiming to describe the situation in the area of in silico toxicology, to provide a more realistic view on what is currently possible using ‘AI’ in this area. We are probably not limited by algorithms at this stage, we are limited by the data we have. This is not only due to the cost of generating data, it is partly due to more fundamental questions, such as translating exposure, and the question of how to generate data in a biological assay so that it can be used in different ways in the area, which is far from trivial (see section above on biological readouts). To address this will require working together, from the biological/toxicology/pharmacology domain to computer sciences and everything in between, and between industry, academia and government agencies – the better we understand what each other is doing, the better will be the outcome for everyone in the end for the field of predicting compound toxicity using computers.

I would like to conclude by saying that developing drugs is, of course, about balancing benefit and risk, so safety cannot be looked at in isolation. To use the example provided by Chris Swain (thanks Chris!) here: “Is a drug that relieves a patient from a lifetime of chronic pain, but increases the life-time risk of heart attack two-fold, of value”?


I would like to thank Graham Smith (AstraZeneca), Chris Swain and Anika Liu (both Cambridge) for input on this article, however all opinions expressed are solely my own.

P.S.: I have only realized after starting this article that the topic is very complex and very difficult to compress into a single blog post. I apologize for generalizations, and refer the reader to the reviews and articles cited above, and beyond, for more detailed information.

P.P.S.: I see it with amazement that, as a chemist, I am so much emphasizing the importance of biology these days. A drug has many angles though – the more angles we see a compound from, the better we can understand it.

Why AI and Drug Discovery are no match made in heaven

Artificial Intelligence (AI) has been described as the ‘fourth industrial revolution’, that will lead us to self-driving cars, computers understanding human language, and automated drug discovery.

I believe the parts about self-driving cars – legal and societal barriers will likely be more important than scientific and technical ones in due course. Also, while there is debate of the quality of translations, Google Translate is good enough to be useful for me already, so I’d entirely buy into that.

The problem is: drug discovery is a very different beast.

So what’s the problem with drug discovery, then? (Note that the following is an extension of the article and subsequent discussion by Al Dossetter on LinkedIn recently.)

Let’s briefly outline the ‘drug discovery process’ first (which is only a crude generalization anyway, but it may be useful here as an overview):

Desired OutcomeCompound active on target/in cellular assaySuitable in vitro properties (selectivity, solubility, …)Efficacy in animal model, tolerable toxicityEfficacy in man, tolerable toxicity, better than standard of careCommerci-
ally viable (market size, market need, pricing)

So what we really care about when doing drug discovery are the in vivo results – we don’t want to treat a protein with a drug, or  a cell line, or a rat; we want achieve efficacy, with tolerable toxicity, in humans.

So in which way can AI now support the different phases of drug discovery?

Conceptually, it can be useful in any of the above steps, leading to the increasing ability of discovering hits, optimizing in vitro properties …. and thereby providing increased likelihood of in vivo efficacy, at tolerable toxicity.

BUT – AI needs data, and this is the weak point when trying to apply ‘AI’ to the drug discovery field. This is not object or speech recognition, where we have a huge amount of both labelled and unlabeled data.

Let’s hence now examine three criteria for data in the drug discovery process we discussed in a previous post:

  • Amount of data available,
  • Reliable labeling of data, and
  • Problem relevance (here for the in vivo situation).

Let’s see which data we have available related to the different phases above – leaving out market considerations here, and sticking to the scientific goal of finding a bioactive entity that is able to cure disease:

PhaseKey data availableAmount of dataConsistent data labellingProblem (in vivo) relevance
Hit discoveryBioactivity, solubility, …+++o
Lead OptimizationSolubility, permeability, off-target activities, simple DMPK ++o
Animal StudiesEfficacy and toxicity data in animals, animal PKo+
Clinical StudiesHuman endpoint efficacy data, human safety, human PK++

So what we see is: In early phases we have more data, which is more clearly labelled – but it is less relevant to in vivo outcomes, such as efficacy. In late phases we have data that is more relevant to in vivo outcomes, but we have very little data available in general.

To support the above statements with some facts – in databases such as ChEMBL, ExCAPE or PubChem we have millions of bioactivity datapoints, linking compound structures to protein targets. But activity against a target does not make a drug, far from it. So we have lots of data that is insufficient to understand and anticipate the in vivo situation.

On the other hand, in databases such as ToxRefDB, DrugMatrix or Open TG-GATEs we have in the order of (a low number of) thousands of compounds covered with animal toxicity data – in a chemical space that comprises 1033 (or so) compounds in total. So we have likely more relevant data at hand – but for very few compounds, since generating such data is costly (eg the DrugMatrix data generation has cost in the order of $100m).

What is now meant by ‘Consistent data labeling’?

Imagine a consumer clicks on an Internet link and buys a product – here you have clear data points, unambiguously connecting the dots between clicking on a link, and buying a product. However, whether a drug shows efficacy in a disease (or toxic side effects) depends, at the very least, on dose, route of delivery, and individual genetic setup of the organism and the disease (i.e., the endotype), among many other variables. So there is no clear label one can assign, such as ‘drug X treats disease Y’ – yes, sometimes, but sometimes not, depending on the context of how and in which context the drug is applied to a particular organism. Hence labels in the biological domain are generally much more ambiguous, and context-dependent, than in other domains.

(In many cases we simply also don’t know which early-stage data are predictive of in vivo effects – an article published just last month concluded for example that “Chemical in vitro bioactivity profiles are not informative about the long-term in vivo endocrine mediated toxicity“.)

Of course there have been notable successes around ‘AI in drug discovery’, for example in the areas of synthesis prediction, automated chemistry, bioactivity modelling, or using image recognition to analyze phenotypic screening data. These are all important areas to work on – however, they are also a good number of steps away from the more difficult biological and in vivo stages, where efficacy and toxicity in living organisms decide the fate of drugs waiting to be discovered. Hence, there is still a gap that needs to be bridged, in an area that needs progress most, namely in vivo efficacy and toxicity.

So AI and ‘drug discovery’ may not be a match made in heaven – but that’s not necessarily a problem, since we live on earth anyway. There is obviously ample data around in the drug discovery process, the amounts available will increase, and we need to analyze them, so much is clear. Quite possibly, from what I can see, AI will be used more for deselection (rather than positive selection), to increase the odds of success. But we certainly need to learn which models matter for the in vivo situation, instead of just ‘plugging data into the machine’, no matter their relevance for the human setting, and hoping to get the right answer out.

The question is hence where we have, at the same time, sufficient and sufficiently relevant data in order to predict properties of potential therapies that are relevant for the in vivo situation, which are related to efficacy and toxicity-relevant endpoints. We will explore concrete examples in future posts.


DrugDiscovery.NET – AI and Machine Learning in Drug Discovery, in Practice

Machine Learning and Artificial Intelligence are increasing in importance currently – due to significantly increased data availability, the development of new methods, and our understanding how to apply those methods best.

However, using ‘data’ in the drug discovery field, be it early stage data (eg for discovering a compound active against a target) or later stage data (from preclinical and clinical phases), differs significantly from other domains, which are either

  • More information-rich with respect to the number of data points – think video or text data for example, which is available at scale, compared to data in the drug discovery context which needs to be experimentally generated, which is particularly costly at the clinical end of the scale;
  • Have more clearly labelled data – think about a customer who clicks on a link and then buys or does not buy a product, vs a drug which causes a particular effect in a particular human, but only in the context of this particular dose, interactions with other medications, the particular genetic setup, etc.; and/or
  • Have data that represents what we are actually interested in – if a customer buys a product then he or she buys the product, but in drug discovery we often use proxy variables (say, PAMPA or Caco-2 assays for permeability, in vitro toxicity assays, animal studies to predict human response, etc.) where the value of the data for the property of actual interest is often unclear or disputed.

Hence, while there clearly will be a value of analyzing data using AI/ML in the wider field of drug discovery, currently some very relevant questions do not seem to be asked in my experience, which often relate to some of the above points. This website will now aim to discuss developments in the field, and to provide a critical context to it – since only if we question what we do we will end up with methods that work in practice.