Emilie Henault, together with Maria Rasmussen and Jan Jensen at the University of Copenhagen, have just attempted to answer the question how genetic algorithms are able to explore chemical space in a practically useful way, given the immense size of the latter (though to comprise on the order of 1060 possible small molecules). Their recent publication Chemical Space Exploration: How Genetic Algorithms Find the Needle in the Haystack has just appeared on ChemRxiv, and we have visited one of the authors, Jan Jensen, to learn more about the background of their work.
‘Chemical space’ has been the subject of numerous studies; but few of them have yet given reasons why chemical space can actually be traversed in any efficient way, given its size. So how did this study came about? What made you think about navigating chemical space in novel ways?
Jan Jensen (JJ):This study started out as Emilie’s bachelor project. I had previously written a graph-based genetic algorithm (GA) code that could be used to search chemical space (the roughly 1060 possible small molecules) for those molecules with particular properties. In this context, ‘graph-based’ means that the molecular changes are done by making and breaking chemical bonds (i.e. the vertices of a graph, where the atoms are the nodes). I was curious how this approach compared to simpler string-based approaches, where a molecule is represented by a string of characters such as SMILES or new alternatives such as DeepSMILES and SELFIES. So I suggested this project to Emilie when she was looking for bachelor projects.
By the time I started working on turning Emilie’s project report into a manuscript, I had had a major grant proposal rejected and one of the comments from the panel was that they failed so see how methods like GAs could cover such a vast search space. I wasn’t able to find an answer in the literature, so I decided to look at a very simple string-based search problem I had used in teaching where analytical solutions should be possible. I was stuck on the math when my PhD student Maria came by the office with a question, and she figured it out in short order.
So the main point of the paper became how GAs are able to find particular molecules among 1060 possible. The main idea is formulated using the simple string-based search problem that is not related to chemistry, followed by similar string-based search problems related to chemistry with a comparison to the graph-based approach more familiar to chemists (i.e. Emilie’s project).
So what did you find, and in which way did distinguishing between string-based and graph-based representations help?
JJ: We show that, while chemical space is vast, the number of potential paths (i.e. sequences of molecular changes) that the search algorithms can follow to the target is equally vast. So the probability of randomly finding a molecule that is on one of these paths is quite high and from here a search algorithm can follow the path to the target molecule. For GAs it seems to be more efficient to follow these paths using a graph-based molecular representation than a string-based representation. One reason for that is that changes to the string-based representations often result in strings that cannot be translated to actual molecules because the syntax is messed up.
This certainly underlines the importance of considering the representation of structures when thinking about a problem, and which algorithm to apply!
In which way can the results of this work been used by others, how can it influence the wider field?
JJ: At least for me, the study has changed my thinking about methods that search chemical space from ‘it’s amazing that these methods actually work sometimes’ to ‘why aren’t these methods more efficient?’. If the ‘paths’ picture is correct, these methods should succeed every time and they should do so by testing 100s of molecules – not 10,000-100,000 as they currently do. Why is that?
Thank you for this conversation, and all the best for your future research!
Chemical Space Exploration: How Genetic Algorithms Find the Needle in the Haystack, ChemRxiv, https://doi.org/10.1101/2020.06.12.135939; published in journal format in PeerJ Physical Chemistry, https://doi.org/10.7717/peerj-pchem.11.