Skip to content

The ‘Google Co-Scientist’ Hasn’t (Yet) Lead to ‘Breakthroughs’ – A Closer Look at Its Scientific Validation

The ‘Google Co-Scientist‘ has attracted a lot of attention since being published only a short time ago. Here I would like to add a few comments to put this work into context.

Firstly: I love tech, I love science, I love progress!

Secondly: This is precisely why I write this post (to make sure the combination of tech and science really contributes to drug discovery as much as possible in the future).

Brief summary of methods

  • This is an agentic system, fine, the proof of the pudding lies in the eating (… and the ingredients, and whether you can actually make a pudding out of them, we will get back to that topic of life science data later)
  • Accuracy was assessed as follows: ‘Seven domain experts curated 15 open research goals and best guess solutions in their field of expertise’, and the AI Co-Scientist was compared to other methods. The authors found that increasing accuracy was overall related to Elo rating, which is part of a plausible technical validation of the system. (Caveat though, the solutions on which accuracy was assessed were apparently available at time of model training, hence see also ‘Pretraining on the Test Set Is All You Need‘ for possible problems with that)
  • Elo rating is increasing with ‘time in computation’, which one would probably expect
  • In various ranking/rating schemes the ‘AI co-scientist’ comes out best (as usual in publications, given selection bias). Technical validation was hence presented as successful

Comments on scientific validation

Three validations in the life science domain (‘scientific validations’) were conducted, termed by the authors ‘drug repurposing’, ‘novel treatment target discovery’ and ‘explain mechanism of gene transfer evolution’.

Unfortunately, these validations contain mismatches between the wordings chosen for claims and the experiments and data presented supporting said claims. I will explain as follows:

In the section ‘Drug repurposing for acute myeloid leukaemia’ the validation is summarized as ‘Notably, the AI co-scientist proposed novel repurposing candidates for acute myeloid leukemia (AML). Subsequent experiments validated these proposals, confirming that the suggested drugs inhibit tumor viability at clinically relevant concentrations in multiple AML cell lines.’ However, when looking at the experiment that has been conducted, the validation is conducted in AML cell lines, so there is no validation regarding tumor viability whatsoever, as the headline suggests. How about ability to permeate the tumor, considering tumour heterogeneity, resistance, tumor-microenvironment (TME) interactions, and so on? One would need to conduct at least in vivo (or, as a somewhat plausible proxy, e.g. organoid) experiments to come to the conclusion given by the authors. Furtermore, when it comes to compound selection, in the full paper the authors state that ‘Notably, Binimetinib, which is already approved for the treatment of metastatic melanoma, exhibited an IC50 as low as 7 nM in AML cell lines’. Given that the compound has been reported before (see PubChem entry) to be active at the single-digit nanomolar level, in very related biological systems, it is somewhat unclear what is ‘notable’ about this fact.

Verdict: The validation that has been claimed, of ‘inhibiting tumor viability’, has not been demonstrated in this work. It is not ‘drug repurposing’ (which would imply in vivo efficacy and safety) that has been performed in the validation, but rather testing a known compound on a cell line, where very similar activity had already been reported previously.

In the section ‘Advancing target discovery for liver fibrosis’ the experimental validation seems to be only incompletely described at this point in time: ‘These findings will be detailed in an upcoming report led by collaborators at Stanford University’. In addition, according to the data shown, the validation that has been performed seems to be also somewhat disconnected from the claim, in that ‘drug effects on fibroblast activity’ are plotted – so which targets have now been discovered, given the claim the authors make? How have they (the targets!) now (e.g.) genetically been validated? Are those actually novel targets? Etc. – the reader simply doesn’t get told any of those things. This section hence suffers from lack of transparency, and the lack of data availability to support the claims that have been made. There is a real mismatch between ‘headline’ and experimental validation

Verdict: Claims not verifiable, based on data provided.

In the section ‘Explaining mechanisms of antimicrobial resistance’ the authors focus on ‘generating hypotheses to explain bacterial gene transfer evolution mechanisms related to antimicrobial resistance (AMR)’. A worthwhile goal, given the importance of antimicrobial resistance worldwide. Here, ‘expert researchers instructed the AI co-scientist to explore a topic that had already been subject to novel discovery in their group, but had not yet been revealed in the public domain, namely, to explain how capsid-forming phage-inducible chromosomal islands (cf-PICIs) exist across multiple bacterial species.’ The simultaneously published papers on the experimental discovery of above findings and their ‘re-discovery’ by AI are correctly titled in my opinion, with the latter called ‘AI mirrors experimental science…’. However, at the same time this wording makes the lack of novelty of this ‘discovery’ clear. Technically, I am on the one hand glad to see that an external test set was apparently used. On the other hand it is unclear (see point above) if this test set was available at training time already.

Verdict: Here, the ‘AI system’ re-discovers what is known already, which is not novel discovery. A truly novel (out of domain, ‘novel’) prediction that is subsequently validated would be beneficial. In addition, there is a trend (generally) in science to ‘select what works’ – we here do not know how many wrong hypotheses (‘false positives’) have been generated, only this single ‘true positive’ prediction has been reported. Hence, we are unable to assess ‘precision’ of the predictions that have been made. This selective reporting makes evaluating AI systems from information provided in publications challenging.

Conclusion

Hence, overall, this is technically fascinating work – however, validation in the drug discovery domain, and validating what really matters in drug discovery, falls as so often short.

To come back to the above point of ‘data’ in the life sciences domain (and whether we actually have the right ‘ingredients’ for our pudding in the first place) – we in this domain deal with data that is created in (a) very different biological systems (with often conflicting readouts and hence often (b) insufficient predictive validity), where (c) often metadata needed for fit-for-purpose model training is missing, where (d) coverage (e.g. of chemical space, mode of action space etc) is insufficient, where (e) data is biased as well as (f) conditional (depending on dose, genotype, etc.), where (g) labels used to annotate data are often not meaningful (especially for more complex endpoints, say in vivo endpoints), where biology is heterogenous, both (h) inter- and (i) intra-individuum… and where hence using any form of ‘predictions’ is just not trivial. For further details on those points please see our previous publications on using life sciences data in drug discovery domain, ‘Artificial Intelligence in Drug Discovery – What is Realistic, What are Illusions?’, Part 1 and Part 2.

Very often, in the experience of the author, with life science data ‘everything can be linked to everything’ in life science data space, via some data source, which in particular is often decreasing the precision of the hypotheses to be tested, that is the number of predictions which turn out to be true. Given the billions (or even more) of links that can be established from data paying attention to the precision of a predictive system is a formidable challenge, but one we need to address, along with meaningful, domain-specific validation.

/Andreas

Leave a Reply

Your email address will not be published. Required fields are marked *