Nikolaus Fortelny just completed his postdoctoral research at the CeMM Research Center for Molecular Sciences of the Austrian Academy of Sciences, where he worked on the integration of multi-omics and single-cell data using deep learning, funded by EMBO. He has now started his own group at the University of Salzburg. His publication on “knowledge-primed neural networks” recently appeared in Genome Biology, and we interviewed the author to explain the significance of his work.
Congratulations to your recent publication! Could you please summarize what you have done in this work, and in which way this is relevant for drug discovery?
Nikolaus Fortelny (NF): Thank you! In my postdoc, I tested ways to model signaling processes within and between cells based on multi-omics and single-cell data. Our ability to profile biological systems has grown dramatically over the last few years, but it is difficult to use these data to understand how things function, to dissect molecular mechanisms – which is what we need to understand biology and treat disease. At the same time, machine learning with neural networks and deep learning have really taken off, enabling machines to learn highly complex relationships. The trained algorithms are basically networks of functions, and we can look at and interpret these functions individually. But once we combine them in the full network, which has hundreds or thousands of these functions, it is difficult to extract any humanly readable information – the trained network is a black box.
In knowledge-primed neural networks, we wanted to see if we can use knowledge of biological mechanisms to guide artificial neural networks. Neural networks are networks and our knowledge of molecular mechanisms is also often expressed in networks. Elements (nodes) in biological networks have a real-world meaning, for example they can correspond to genes, or to proteins. But those network are still often hard to interpret as the networks tend to be very large, abstract, and complex. Neural networks on the other hand lack real-world meaning but learn weights during training. These weights are the parameters used in the functions described above. We can therefore use the weights to identify the parts of the network that are more relevant than others.
When we now train neural networks that look like biological networks, we combine the most useful parts of both: Neural network training prioritizes relevant parts of the biological network, revealing interesting biology; and the biological network adds an interpretable label to each function. In the end we have a network which is both prioritized, as well as interpretable on the biological level.
So how does this go beyond what has been known before, how does it advance the field?
NF: Most approaches for interpretable machine learning focus on the input level, for example showing us what a machine sees when it sees a cat (methods which have also been controversially discussed – ed.). Our approach enables interpretation of the network itself, which is closer to “understanding” the algorithm on a deeper level. Others have built neural networks based on prior knowledge, for example using the Gene Ontology tree. In contrast, we used biological networks that specifically mirror biological signaling, connecting receptors to signaling proteins (for example kinases), which are further linked to transcription factors and to changes in gene expression. To train these networks, we start from gene expression data (as the input) to predict cell states (as the output), for example predicting receptor-stimulated cells or predicting cell types. As a result of the built-in biological knowledge, the model interpretation reveals those intermediate regulatory proteins that are relevant for signaling transduction, as a part of the underlying full biological network. In other words, this unique type of interpretability suggests specific molecular mechanisms for a particular effect or cellular state observed. This is very useful for follow-up experiments, such as in case of identifying proteins associated with a particular disease state, and which could hence represent new drug targets.
In addition to the specific structure of knowledge-primed neural networks, we also identified several modifications of the learning method that are required to achieve high interpretability. One problem is that neural networks start from random initiation of all weights, and can thus yield very different results every time the network is trained. Also, biological networks have a very uneven connectivity with some very highly connected nodes (or proteins), which will bias the results in that they come up in the study of any biological system, lacking specificity. We developed a training methodology that addresses both problems effectively.
What was the most difficult part of this work, and what did you do to address the problem?
NF: While the accuracy of predictions is well-defined and easily testable in machine learning, it is less clear how a good interpretation is supposed to look like – one cannot assign a single number to this which can be used for comparison, and there is no real ‘gold standard’ annotation data available for this in the first place. The goal of our interpretability was the identification of relevant regulatory proteins, which are rarely defined or known for most biological systems. This means that our interpretations are highly relevant and interesting, but not straightforward to validate. We therefore developed an entire series of validation experiments. In simulated networks with ground truth, we were able to develop and validate the interpretation methodology. We then increased biological relevance by testing our approach in biological systems with partial knowledge of known regulators, where we also showed that network shuffling results in a loss of interpretability. These results clearly demonstrated the impact of our design choices and demonstrated the biological relevance of our approach. Finally, we employed our approach in less studied biological systems, where we used literature searches to test the relevance of our interpretations. Taken together, our validation experiments thus span very artificial and very real-world problems, enabling us to demonstrate both, the technical and biological relevance of our interpretations.
A second difficulty was method optimization, in particular to ensure robustness and to control for uneven connectivity, which I already referred to above. The identification of these aspects required exploratory analyses, without really knowing what to look for in the first instance. We could have trained one single network and not controlled for uneven connectivity, and would have gotten some results. But, as we have shown in our validation experiments, the interpretations become so much more reliable and relevant when we use our optimized methodology, so in the end normalizing for connectivity turned out to be of crucial importance for the study.
How can the results you have obtained be used by others?
NF: The publication and the code are open access and open source, and the method is readily available to anyone who would like to try it out (https://github.com/epigen/KPNN). Our approach is broadly applicable beyond single-cell sequencing data and biological networks. There are many fields where prior knowledge can be represented as networks amenable to neural network training, for example brain circuits, diseased vs healthy (or drug-treated) cell states, etc. We expect that this research will thus further aid in the applicability of neural networks and deep learning in the life science area, where the black box character and lack of interpretability are often a limiting factor.
Thank you for this conversation, and all the best for your new research group in Salzburg!
Fortelny, N., Bock, C. Knowledge-primed neural networks enable biologically interpretable deep learning on single-cell sequencing data. Genome Biology 21, 190 (2020). https://doi.org/10.1186/s13059-020-02100-5