References – Machine Learning for Biologists

Key Points

Introduction	Machine learning algorithms recognize patterns from example data Supervised learning involves predicting labels from features
Classifying T-cells	The ml4bio software supports interactively exploring different classifiers and hyperparameters on a dataset The machine learning workflow is split into data preprocessing and selection, training and model selection, and evaluation stages Splitting a dataset into training, validation, and testing sets is key to being able to properly evaluate a machine learning method
Evaluating a Model	The choice of evaluation metric depends on the relative proportions of different classes in the data, and what we want the model to succeed at. Comparing performance on the validation set with the right metric is an effective way to select a classifier and hyperparameter settings.
Decision Trees, Random Forests, and Overfitting	Decision trees require less effort to visualize interpret than other models Decision trees are prone to overfitting Random forests solve many of the problems of decision trees but are more difficult to interpret
Logistic Regression, Artificial Neural Networks, and Linear Separability	Logistic regression is a linear classifier. The output of logistic regression is probability of a certain class. Artificial neural networks can be viewed as an extension of logistic regression Artificial neural networks can have nonlinear decision boundaries
Conclusion and next steps	You are now prepared to consider how machine learning may benefit your research. There are many excellent introductory and intermediate resources to help you continue to learn about machine learning.

Glossary and other resources

The Google machine learning glossary and ML4Bio guides define common machine learning terms.

The scikit-learn tutorials provide a Python-based introduction to machine learning. There is also a third-party scikit-learn tutorial and a Carpentries lesson.

The book Python Machine Learning has machine learning example code.

The Elements of AI course presents general introductory materials to machine learning and related topics.

Galaxy ML provides access to classification and regression workflows through the Galaxy interface.

The workshop organizers track additional resources for beginners and intermediate users.

Training classifiers for a research project typically requires training many models and tuning their hyperparameters on a validation dataset. Writing scripts helps automate this process, document the training and tuning decisions, and improve reproducibility. Software Carpentry introduces strategies for script-driven research. A computing cluster helps train and evaluate many machine learning models in parallel.

Jupyter notebook example

You can run an example Jupyter notebook in Binder to see how a machine learning workflow looks in Python code using scikit-learn. The notebook will load an executable Python environment in your web browser. After it loads, you can inspect the code and output or rerun it yourself.

Machine Learning for Biologists: References

Key Points

Glossary and other resources

Jupyter notebook example