Workshop recap from How Machine Learning Can Be Used to Solve Plant Biology Problems at Plant Biology 2020 Worldwide Summit
We live in a time of substantial technological innovation. Developments such as next-generation sequencing and high-throughput phenotyping mean that the rate at which data is generated is immense. In modern interdisciplinary biology, we frequently hear the term “machine learning”. Yet, for many of us primarily wet-lab plant biologists the concept of this area has long been obscure.
Midway through the Plant Biology 2020 virtual conference Shin-Han Shiu and Serena Lotreck of Michigan State University hosted a workshop to inform the wider plant science community on the overarching applicability of machine learning to the plant sciences. Firstly, we were introduced to the basics of machine learning before a short hands-on example demonstrated the relevance of machine learning for solving a plant biology problem. Furthermore, several live Q&A sessions were included which furthered the audience’s comprehension of the potential uses of machine learning in plant science.
Machine learning is used in all areas of daily life from Spotify to self-driving cars. While machine learning papers are at an all-time high, plant science is very poorly represented with less than 3% of machine learning papers on plants!
What is the big deal about machine learning?
As plant scientists we generate huge amounts of data, be it molecular, ecological, morphological- the list goes on. With all this data at our fingertips, how can we integrate it all into a model? The key is a machine learning algorithm. The outputs of such models can be highly useful, allowing one to identify issues in the data, address previously unanswerable questions, provide testable hypotheses and it is even possible to transfer the model to areas and species lacking sufficient data.
But wait, what exactly is machine learning?
Humans learn by experience. When faced with a problem or task our “input” is our previous experience and our “output” is a mental model making predictions based on these experiences. If the answer is not good enough, we repeat the process by generating more experiences and thinking more. When the answers are eventually good enough, we now possess new knowledge.
Machine learning involves substituting the human brain in this process with a machine-based algorithm. Shin-Han used the excellent example of Spotify to communicate the concept of machine learning and demonstrated how biological problems such as assigning genes to pathways can be addressed in the same way. Machine learning can be used to ask various questions such as to find patterns in data, or to predict categorical or numeric values.
A “typical” machine learning workflow
Shin-Han then brought us through a typical workflow of a machine learning analysis, using the biological question What kinds of paralogous gene pairs tend to have strong fitness effects when mutated? and used a Jupyter notebook (all available on GitHub) to take us through the tutorial step by step.
Finally, the limitations of machine learning were discussed such as the low model reusability, unknown biases in the data, and an apt quote from Cathy O’Neil’s book Weapons of Math Destruction was incorporated.
In response to a great question (“I am working on completing my PhD in wet lab research. What can I also do to study machine learning in an efficient but effective way?”) Shin-Han highlighted some great (including free!) resources to get started in machine learning. His key advice to the audience was:
(a) find a problem that you are interested in
(b) take the data relevant to the problem of interest and apply the concepts that you have just learned from the course.
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (Recommended if you want to learn how to use Python to do machine learning)
Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy (quoted on the limitations of machine learning slide)