Zachary del Rosario has just finished his PhD at Stanford and has begun a visiting professorship at Olin College. His recent publication “Assessing the frontier: Active learning, model accuracy, and multi-objective candidate discovery and optimization” just appeared in The Journal of Chemical Physics, and we have now visited with the author to explain the significance of his work.
Congratulations on your recent publication! Could you please summarize what you have done in this work and your PhD, and in which way this is also relevant for how we evaluate computational models used in the drug discovery process?
Zachary del Rosario (ZR): Thanks! The present work is a bit of a departure from my PhD research: I’m an Aerospace engineer by training. My thesis highlighted reliability issues in current structural design practice and presented provably-safe alternatives. The “Assessing the Frontier” paper is similar in spirit in that it questions some of the current practice in cheminformatics and materials informatics.
I work as a consultant with Citrine Informatics: Citrine provides an AI and data platform to help some of the biggest materials and chemicals companies (e.g. Lanxess, Panasonic) develop novel materials faster. The “Assessing the Frontier” paper came out of my work at Citrine: We have internal rules-of-thumb for assessing when a machine learning model is appropriate for ranking candidates for synthesis, but we know that a good R-squared or Mean Squared Error (MSE) is far from a guarantee of predicting the next top superconductor or therapeutic.
One of the really surprising findings from the “Assessing the Frontier” paper is that globally-scoped error metrics, such as R-squared and MSE, can be actively misleading when it comes to choosing a model to identify top-performing candidates. The explanation is fairly intuitive: Globally-scoped metrics tacitly assume that future observations will be drawn from the same distribution as the original data, but we use these models in an extrapolatory fashion. In a sense we’re turning the usual ML paradigm on its head; rather than filtering outliers before training, we’re explicitly trying to understand and find more top performers. These globally-scoped error metrics don’t reflect that use-case.
So how does this go beyond what has been known before, how does it advance the field?
ZR: We introduce the notion of Pareto shell error as a metric more closely associated with identifying top-performing candidates. This is a concept that builds on ideas from multi-objective optimization and the concept of strata from the database literature. In abstract the idea is simple: Instead of computing model error over all of the data, we instead compute error only on the top performers in our dataset. This helps us to select models that perform well in the use-case we care about: Identifying top-performing candidates.
What was the most difficult part of this work, and what did you do to address the problem?
ZR: Honestly, a lot of the difficulty arose from the many small pieces of the problem. There’s no one ‘glorious proof’ in this work; rather, we’re dealing with myriad issues arising from the complexities of multi-objective optimization. For instance, we had to think carefully about how to measure the performance of simulated candidate discovery; typical approaches in the current literature focus on counting the number of Pareto points found. But what if there’s a fabulous candidate that’s just a tiny bit off the Pareto frontier? What if one approach quickly finds a few Pareto points, but then starts to pick worse candidates? We had to develop a new measure of performance to help account for these kinds of edge-cases.
How can the results you have obtained be used by others?
ZR: I’m a big fan of John Tukey, who once said ‘An approximate answer to the right question is worth a great deal more than a precise answer to the wrong question’. I think the currently-used error metrics give a precise answer to the wrong question, at least when it comes to using machine learning to support cases like drug discovery and materials discovery. I hope that more practitioners in this space start regarding these off-the-shelf metrics with the skepticism they deserve, and start using model assessment techniques more closely tailored to the questions they actually care about. Our Pareto shell error is another metric analysts can use to assess model suitability for predicting frontier performance.
Thank you for this conversation, and all the best for the next steps after your PhD!
Further reading: “Assessing the frontier: Active learning, model accuracy, and multi-objective candidate discovery and optimization”, https://firstname.lastname@example.org.MACH2020.issue-1