The idea of using machine learning models to make predictions is attractive. But a challenging question remains: There are so many machine learning models out there, but which model is most suitable? Which model should I start exploring? Should I fit every single model that I have learnt and try my luck and see which models work for me the best?
Indeed, these questions cannot be easily answered, especially for beginners. While I am no expert in machine learning, there are some packages that can make your life simpler. One such package is the LazyPredict, which ranks the machine learning models that will most likely be suitable. The LazyPredict contains both Lazy Classifier and Lazy Regressor that can allow you to predict binary and continuous variables respectively.
To install the lazypredict package, execute the following code:
pip install lazypredict
The lazypredict package will require XGBoost and LightGBM extensions to be functional. Execute the following commands in terminal:
conda install -c conda-forge xgboost conda install -c conda-forge lightgbm
Now you are ready to analyse with machine learning. First, import the essential packages:
%matplotlib inline import numpy as np import matplotlib.pyplot as plt import pandas as pd import plotly.express as px from IPython.display import display
We will import the iris dataset as an example. In this exercise, we will predict sepal width from the iris dataset using machine learning:
df = px.data.iris() df.head()
Ouput is as follows. The dataframe contains values for sepal length/width, petal length/width, species (setosa, versicolour, virginica) and species id (where setosa = 1, versicolour = 2, virginica = 3):
Before doing machine learning, it is a good practice to “sense” your data to know which machine learning model will likely fit your data. You can perform a pair plot in this case to understand the distribution of your dataset. Some critical considerations include, (i) detecting presence of outliers, (ii) whether the relationship between variables follow a linear or non-linear relationship and (iii) distribution of variables follow a gaussian or skewed relationship. Plotting a pair plot in python is easy:
from pandas.plotting import scatter_matrix attributes = ["sepal_length", "sepal_width", "petal_length", "petal_width"] scatter_matrix(df[attributes], figsize = (10,8))
Output is as follows:
The data suggest that there are no big anomalies within the data, and most of the variables follow a linear relationship. At this point, we may not know which variable best predicts “sepal width” as it is not obvious. Notice that we have not considered the “species” variable in our analysis. The “species” variable contains 3 categories, namely setosa, versicolour, virginica. One possibility of processing the categorical data is to assign different numbers to it, as shown in the “species ID” column, where setosa = 1, versicolour = 2 and virginica = 3. However, in this case, assigning these respective variables may not be appropriate as this may seem like virginica is more important than versicolour, and versicolour is more important than setosa. To circumvent this issue, we can use the one-hot encoding of the data, which assigns binary numbers (1 or 0) to the individual categories. Fortunately, the conversion is relatively easy to execute in python:
df_cat_to_array = pd.get_dummies(df) df_cat_to_array = df_cat_to_array.drop('species_id', axis=1) df_cat_to_array
Output is as follows:
Each categorical variable is assigned with equal weightage, where species are assigned as 1 or 0, where 1 indicates “yes” and 0 indicates “no”. We also drop the species ID column as we have already assigned the categories using the get_dummies function.
Now that all columns are converted to numerical values, we are ready to use the lazypredict package to see which machine learning model is best in predicting sepal width. We first import the required packages:
import lazypredict from sklearn.model_selection import train_test_split from lazypredict.Supervised import LazyRegressor
Next, we use lazypredict to identify the machine learning regressor model that can potentially predict sepal width:
X = df_cat_to_array .drop(['sepal_width'], axis=1) Y = df_cat_to_array ['sepal_width'] X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 64) reg = LazyRegressor(verbose=0, ignore_warnings=False, custom_metric=None) models,pred = reg.fit(X_train, X_test, y_train, y_test) models
The above commands define that the explanatory variables are in X, and the response variable (sepal width) in Y. Hence, we drop the columns for sepal width for X and filter the column with sepal width in Y. The test size in this case is 20% of the data and the training size is 80%. It is a good habit to specify the random state in case we want to run this same analysis on a different day. The output is as follows:
|Model||Adjusted R-Squared||R-Squared||RMSE||Time Taken|
Here we go. We have identified SVR, NuSVR, K-Neighbors Regressor, Random Forest Regressor and Gradient Boosting Regressor as the top 5 models (with R2 values of ~0.7), which is not bad for a start! Of course, we can further refine the models by hyperparameter tuning to explore if we can further improve our predictions.
To predict categorical samples, you can use the Lazy Classifier from Lazy Predict package. Let’s assume this time we are interested in whether the sepal and petal length/width can be used to predict the species, versicolor. First, import the necessary packages:
from lazypredict.Supervised import LazyClassifier from sklearn.model_selection import train_test_split
The Lazy Classifier can then be executed with the following command:
X = df_cat_to_array.drop(['species_setosa', 'species_versicolor', 'species_virginica'], axis=1) Y = df_cat_to_array['species_versicolor'] X_train, X_test, y_train, y_test = train_test_split(X, Y,test_size=0.2,random_state =55) clf = LazyClassifier(verbose=0,ignore_warnings=True, custom_metric=None) models,predictions = clf.fit(X_train, X_test, y_train, y_test) models
Note that I have dropped the columns for the other flowers as we are interested in predicting versicolor from the petal and sepal length/width. This shows that the research question is critical for machine learning! The output is as follows:
|Model||Accuracy||Balanced Accuracy||ROC AUC||F1 Score||Time Taken|
This output suggests that K-Neighbours Classifier, SVC, Ada Boost Classifier, XGB Classifier and Random Forest Classifiers can be potentially explored as machine learning models to predict the versicolor species. You can further refine the model by hyperparameter tuning to improve the accuracy.
Overall, Lazy Predict can be a handy tool for selecting which of the 36 machine learning models is most suitable for your predicting your response variable before testing against the hyperparameters. Certainly a good starting point for beginners venturing into machine learning!