Machine learning is a hot topic in data science, but few people understand the concepts behind them. You may be fascinated by how people get high paying jobs because they know how to execute machine learning, and decide to delve deeper into the topic, only to be quickly intimidated by the sophisticated theorems and mathematics behind machine learning. While I am no machine learning expert, I hope to provide some basics about machine learning and how you can potentially use Python to perform machine learning in some of my future blog entries.
With all the available machine learning tools available at your fingertips, it is often tempting to jump straight into solving a data-related problem by running your favourite algorithm. However, this is usually a bad way to begin your analysis. In fact, executing machine learning algorithms plays only a small part of data analysis and the decision making process. To make proper use of machine learning, I would recommend you to take a step back and look at the research question from a helicopter view. What are you trying to answer? Are you trying to classify severe and mild dengue cases based on cytokine levels? Or are you predicting antibody responses based on gene expression levels? Or do you have an end-goal in mind?
Often, the research question will define the machine learning model to execute, and this is highly dependent on your explanatory variables and test variables. It is also dependent on whether you prefer a supervised or unsupervised learning of your dataset. Another point to consider is to define the impact of using machine learning. Does it help the company to save money? Would it have a positive impact on clinical practice? If you cannot envision an impact, then it may not be worth the trouble to use machine learning after all.
Once you have an end-goal in mind, you are now ready to use machine learning tools to analyse your data. However, before that, it is critical to ensure that your data is of good quality. As the saying goes, garbage in = garbage out. Your machine learning will not be able to resurrect and learn a poor quality dataset. Next, it is also important to have a reasonable sensing of your dataset. Specifically, are there any outliers in your dataset? If there are, is it worth removing or changing them before performing machine learning? Does your dataset have high variance and require a Z-score transformation for better prediction? Based on these questions, it is often a good idea to use data visualisation tools to understand your dataset before performing machine learning. A quick way to obtain a sensing of the data is to type in the following code in your dataframe (assigned as df in my example):
Another way to have a sensing of the data is to use the correlation matrix using the command (but do note that correlation does not capture non-linear relationships):
Histograms can be used for quick sensing of the distribution of data. To plot histograms for each variable, I recommend using the Lux package, which I have previously updated in my blog.
After preparing your dataset, then it is time to use the machine learning algorithms. At this point, you hope to find a model that can best represent your data. A reasonable solution can be to split your data into two sets: the training set and the test set (see figure below). Usually, approximately 70% – 80% of the data is used for training and the other 20% – 30% will be used as test to evaluate the accuracy of the model. If you have several models that perform equally well, you may even consider doing a validation set or cross-validation set to ascertain which model would best describe the data.
This article provides the fundamentals of machine learning. Subsequently, I will expand on the different machine learning models, how to execute them effectively and the advantages and limitations in each of these models. Excited to learn more and share with all of you. Meanwhile, stay tuned! 🙂