Posted in machine learning, python

Choosing the best machine learning model with LazyPredict

Predicting flower species with machine learning. Source from: https://weheartit.com/entry/14173501

The idea of using machine learning models to make predictions is attractive. But a challenging question remains: There are so many machine learning models out there, but which model is most suitable? Which model should I start exploring? Should I fit every single model that I have learnt and try my luck and see which models work for me the best?

Indeed, these questions cannot be easily answered, especially for beginners. While I am no expert in machine learning, there are some packages that can make your life simpler. One such package is the LazyPredict, which ranks the machine learning models that will most likely be suitable. The LazyPredict contains both Lazy Classifier and Lazy Regressor that can allow you to predict binary and continuous variables respectively.

To install the lazypredict package, execute the following code:

pip install lazypredict

The lazypredict package will require XGBoost and LightGBM extensions to be functional. Execute the following commands in terminal:

conda install -c conda-forge xgboost
conda install -c conda-forge lightgbm

Now you are ready to analyse with machine learning. First, import the essential packages:

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import plotly.express as px
from IPython.display import display

We will import the iris dataset as an example. In this exercise, we will predict sepal width from the iris dataset using machine learning:

df = px.data.iris()
df.head()

Ouput is as follows. The dataframe contains values for sepal length/width, petal length/width, species (setosa, versicolour, virginica) and species id (where setosa = 1, versicolour = 2, virginica = 3):

sepal_lengthsepal_widthpetal_lengthpetal_widthspeciesspecies_id
05.13.51.40.2setosa1
14.93.01.40.2setosa1
24.73.21.30.2setosa1
34.63.11.50.2setosa1
45.03.61.40.2setosa1

Before doing machine learning, it is a good practice to “sense” your data to know which machine learning model will likely fit your data. You can perform a pair plot in this case to understand the distribution of your dataset. Some critical considerations include, (i) detecting presence of outliers, (ii) whether the relationship between variables follow a linear or non-linear relationship and (iii) distribution of variables follow a gaussian or skewed relationship. Plotting a pair plot in python is easy:

from pandas.plotting import scatter_matrix

attributes = ["sepal_length", "sepal_width", "petal_length", "petal_width"]
scatter_matrix(df[attributes], figsize = (10,8))

Output is as follows:

The data suggest that there are no big anomalies within the data, and most of the variables follow a linear relationship. At this point, we may not know which variable best predicts “sepal width” as it is not obvious. Notice that we have not considered the “species” variable in our analysis. The “species” variable contains 3 categories, namely setosa, versicolour, virginica. One possibility of processing the categorical data is to assign different numbers to it, as shown in the “species ID” column, where setosa = 1, versicolour = 2 and virginica = 3. However, in this case, assigning these respective variables may not be appropriate as this may seem like virginica is more important than versicolour, and versicolour is more important than setosa. To circumvent this issue, we can use the one-hot encoding of the data, which assigns binary numbers (1 or 0) to the individual categories. Fortunately, the conversion is relatively easy to execute in python:

df_cat_to_array = pd.get_dummies(df)
df_cat_to_array = df_cat_to_array.drop('species_id', axis=1)
df_cat_to_array

Output is as follows:

sepal_lengthsepal_widthpetal_lengthpetal_widthspecies_setosaspecies_versicolorspecies_virginica
05.13.51.40.2100
14.93.01.40.2100
24.73.21.30.2100
34.63.11.50.2100
45.03.61.40.2100
1456.73.05.22.3001
1466.32.55.01.9001
1476.53.05.22.0001
1486.23.45.42.3001
1495.93.05.11.8001

Each categorical variable is assigned with equal weightage, where species are assigned as 1 or 0, where 1 indicates “yes” and 0 indicates “no”. We also drop the species ID column as we have already assigned the categories using the get_dummies function.

Now that all columns are converted to numerical values, we are ready to use the lazypredict package to see which machine learning model is best in predicting sepal width. We first import the required packages:

import lazypredict
from sklearn.model_selection import train_test_split
from lazypredict.Supervised import LazyRegressor

Next, we use lazypredict to identify the machine learning regressor model that can potentially predict sepal width:

X = df_cat_to_array .drop(['sepal_width'], axis=1)
Y = df_cat_to_array ['sepal_width']
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 64)
reg = LazyRegressor(verbose=0, ignore_warnings=False, custom_metric=None)
models,pred = reg.fit(X_train, X_test, y_train, y_test)
models

The above commands define that the explanatory variables are in X, and the response variable (sepal width) in Y. Hence, we drop the columns for sepal width for X and filter the column with sepal width in Y. The test size in this case is 20% of the data and the training size is 80%. It is a good habit to specify the random state in case we want to run this same analysis on a different day. The output is as follows:

ModelAdjusted R-SquaredR-SquaredRMSETime Taken
SVR0.700.760.190.01
NuSVR0.700.760.190.02
KNeighborsRegressor0.640.710.200.02
RandomForestRegressor0.620.700.210.19
GradientBoostingRegressor0.620.700.210.05
LGBMRegressor0.600.690.210.04
HistGradientBoostingRegressor0.600.680.210.12
HuberRegressor0.600.680.210.04
Ridge0.600.680.220.01
RidgeCV0.600.680.220.01
BayesianRidge0.590.680.220.02
ElasticNetCV0.590.680.220.08
LassoCV0.590.680.220.11
LassoLarsIC0.590.670.220.01
LassoLarsCV0.590.670.220.04
TransformedTargetRegressor0.590.670.220.01
LinearRegression0.590.670.220.01
OrthogonalMatchingPursuitCV0.590.670.220.02
LinearSVR0.590.670.220.01
BaggingRegressor0.590.670.220.03
AdaBoostRegressor0.570.660.220.10
Lars0.530.620.230.02
LarsCV0.530.620.230.04
SGDRegressor0.470.580.250.02
ExtraTreesRegressor0.470.580.250.13
PoissonRegressor0.450.560.250.02
XGBRegressor0.350.480.270.12
ExtraTreeRegressor0.340.470.280.02
GeneralizedLinearRegressor0.290.440.280.02
TweedieRegressor0.290.440.280.01
DecisionTreeRegressor0.290.440.290.02
MLPRegressor0.290.440.290.18
GammaRegressor0.290.430.290.01
OrthogonalMatchingPursuit0.280.430.290.02
RANSACRegressor0.270.420.290.07
PassiveAggressiveRegressor0.030.230.330.01
DummyRegressor-0.31-0.040.390.01
Lasso-0.31-0.040.390.02
ElasticNet-0.31-0.040.390.01
LassoLars-0.31-0.040.390.02
KernelRidge-82.96-65.593.110.02
GaussianProcessRegressor-483.87-383.557.470.02

Here we go. We have identified SVR, NuSVR, K-Neighbors Regressor, Random Forest Regressor and Gradient Boosting Regressor as the top 5 models (with R2 values of ~0.7), which is not bad for a start! Of course, we can further refine the models by hyperparameter tuning to explore if we can further improve our predictions.

To predict categorical samples, you can use the Lazy Classifier from Lazy Predict package. Let’s assume this time we are interested in whether the sepal and petal length/width can be used to predict the species, versicolor. First, import the necessary packages:

from lazypredict.Supervised import LazyClassifier
from sklearn.model_selection import train_test_split

The Lazy Classifier can then be executed with the following command:

X =  df_cat_to_array.drop(['species_setosa', 'species_versicolor', 'species_virginica'], axis=1)
Y = df_cat_to_array['species_versicolor']

X_train, X_test, y_train, y_test = train_test_split(X, Y,test_size=0.2,random_state =55)
clf = LazyClassifier(verbose=0,ignore_warnings=True, custom_metric=None)
models,predictions = clf.fit(X_train, X_test, y_train, y_test)
models

Note that I have dropped the columns for the other flowers as we are interested in predicting versicolor from the petal and sepal length/width. This shows that the research question is critical for machine learning! The output is as follows:

ModelAccuracyBalanced AccuracyROC AUCF1 ScoreTime Taken
KNeighborsClassifier1.001.001.001.000.02
SVC1.001.001.001.000.02
AdaBoostClassifier0.970.950.950.970.13
XGBClassifier0.970.950.950.970.05
RandomForestClassifier0.970.950.950.970.32
NuSVC0.970.950.950.970.06
BaggingClassifier0.970.950.950.970.04
LGBMClassifier0.970.950.950.970.05
DecisionTreeClassifier0.970.950.950.970.02
ExtraTreesClassifier0.970.950.950.970.14
LabelPropagation0.930.930.920.930.02
LabelSpreading0.930.930.920.930.02
GaussianNB0.930.900.900.930.02
ExtraTreeClassifier0.930.900.900.930.01
QuadraticDiscriminantAnalysis0.930.900.900.930.03
NearestCentroid0.600.650.650.610.03
BernoulliNB0.670.650.650.670.02
CalibratedClassifierCV0.670.600.600.660.04
RidgeClassifierCV0.670.600.600.660.02
RidgeClassifier0.670.600.600.660.03
LinearSVC0.670.600.600.660.02
LogisticRegression0.670.600.600.660.04
LinearDiscriminantAnalysis0.670.600.600.660.02
SGDClassifier0.700.570.570.640.02
Perceptron0.630.550.550.610.06
PassiveAggressiveClassifier0.630.550.550.610.05
DummyClassifier0.570.470.470.540.02

This output suggests that K-Neighbours Classifier, SVC, Ada Boost Classifier, XGB Classifier and Random Forest Classifiers can be potentially explored as machine learning models to predict the versicolor species. You can further refine the model by hyperparameter tuning to improve the accuracy.

Overall, Lazy Predict can be a handy tool for selecting which of the 36 machine learning models is most suitable for your predicting your response variable before testing against the hyperparameters. Certainly a good starting point for beginners venturing into machine learning!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s