Hands-on End-to-End Automated Machine Learning

We start with a basic pipeline approach in Python, which actually has no AutoML in it, and quickly pass to famous AutoML libraries. We also compare the trending AutoML SaaS solutions from OptiWisdom, such as OptiScorer with the classical approaches.
This is a very fast introduction to the end-to-end automated machine learning process with many different libraries from different perspectives. Furthermore, you will be able to compare and make strategic decisions on the AutoML libraries.
You can also start reading the below article if you have no idea about, what is AutoML or the purpose of it.
Introduction to Pipelines and End-to-End solution for machine learning in Python.
Steps of this article are demonstrated in the below figure:
In Python, a relatively new approach called pipelining is a classical way of implementing a system and holding all steps together. As a hands-on coding practice, it is possible to write a pipelined approach in Python as below:
  1. Loading the Data Set: Actually, this step is the data connection layer and for this very simple prototype, we will keep it simple and easy as loading a data set from sklearn library:
"""
@author: sadievrenseker
"""
import pandas as pd
import numpy as np

from sklearn import datasets
data = datasets.load_iris()
2. DataFrames: The above code is a very simple dataset loading from the sklearn datasets. The problem of loading the data set is not over until the data is loaded into the data layer object. In this example, we will use pandas for data access layer and loading the dataset into pandas object can be done by below code:
df = pd.DataFrame(data = np.column_stack((data.data, data.target)), columns=data.feature_names + ['Species'])
df.head()
The above code will return the following output:
DataFrame output for the data loading
The above image demonstrates data frame loaded is the original version from the sklearn datasets. In some datasets, the Species column might be the name of species like ‘Iris-Setosa’, ‘Iris-Versicolor’ and ‘Iris-Virginica’.
3. Exploratory Data Analysis (EDA): If this is the case, we will handle the string types later on the data preprocessing and from this point on, a data scientist might go with exploratory data analysis (EDA) if s/he is new to the dataset — which happens in most of the real life problems. So the codes following above code might be some visualization or information as below:
import seaborn as sns
sns.pairplot(df, hue='Species', size=3)
The above code will return the following output:
EDA Output for Iris Data Set
Above seaborn matrix of graphs,
· Holds the visualization of column by column matching and the data distribution in all possible 2D combinations.
· For the pipelining the whole system is inserted into a single pipeline and in a regular AutoML approach, there is data preprocessing and machine learning processes as already explained in the first section of this chapter.
· For this hands-on experience, we will use the K-NN together with the standard scaler and label encoder.
· So the standard scaler and the label encoder are the preprocessing phase and K-NN will be the machine learning phase.
4. Encoding: If the data set loaded has the species column with the labels, then the label encoder solves the transformation from string to numerical values. In our dataset case, this is not a problem because the original data set holds the numerical values in the species column as shown in the below code:
df['Species'] = LabelEncoder().fit_transform(df['Species'])
The above code will return the following output:
Encoded DataFrame
5. Normalization: After the label encoding we can continue with the normalization. Normalization transforms the data set features into the same range. We will transform all the features by using the standard distribution:
df.iloc[:,:4] = StandardScaler().fit_transform(df.iloc[:,:4])
The above code will return the following output:
Normalized version of DataFrame
The scaled version of the data is displayed above and the data is ready for the classification now.
6. Test and Train Sets: We will use the K-NN algorithm for the classification and in a regular coding we can do the splitting data into train and test sets and then the classification as below:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:,:-1].values,
 df['Species'],
 test_size = 0.4,
 random_state = 123)
7. Machine Learning: After splitting the data to train / test sets with 0.6 to 0.4 ratio, now we can apply the K-NN algorithm.
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train,y_train)
predictions = knn.predict(X_test)
You can easily notice in the above code that, the n_neighbors parameter is set to 3 in the example above.
8. Hyper-Parameter Optimisation: A solution for parameter optimization is using the grid search, but before going into the grid search, we will display the success of the algorithm with a confusion matrix and scorer as below:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,predictions)
print(cm)
And the confusion matrix we have after the execution of the code is as below:
[[22 0 0]
 [ 0 15 1]
 [ 0 2 20]]
9. Evaluation: For the readers, already familiar with the confusion matrix, it is already clear that we have only 3 errors in 60 data points. Also experienced data scientists, played with the iris data set before can easily recognize the clear classification of first class and problems between the second and third classes. To clarify the success in numbers, let’s calculate the score of accuracy as shown in the below code:
from sklearn.metrics import accuracy_score
score = accuracy_score(y_test,predictions)
print(score)
10. Score: The score for our case is 0.95 and now we can try to optimize the accuracy by using grid search.
from sklearn.model_selection import GridSearchCV
k_range = list(range(1, 31))
print(k_range)
param_grid = dict(n_neighbors=k_range)
print(param_grid)
grid = GridSearchCV(knn, param_grid, scoring='accuracy')
grid.fit(X_train, y_train)
 
print(grid.best_params_)
The above code uses the GridSearchCV class from sklearn library and it actually uses cross validation for finding the best parameter and we give a search space for k from 1 to 31. So, grid search here basically starts with k= 1 and increases the k parameter value 1 for each iteration. GridSearchCV class also outputs the best parameter as below for our case:
{'n_neighbors': 5}
So, the optimized parameter for the accuracy of K-NN model for iris data set is k=5.
11. Pipelining: Now, we can put all the above code into a pipeline as below:
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
 ('normalizer', StandardScaler()), #Step1 - normalize data
 ('clf', KNeighborsClassifier()) #step2 - classifier
])
print(pipeline.steps)
The output of pipeline steps are as below:
[('normalizer', StandardScaler(copy=True, with_mean=True, with_std=True)), ('clf', KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
 metric_params=None, n_jobs=None, n_neighbors=5, p=2,
 weights='uniform'))]
12. Now, the pipeline is open for any setup and also the parameters of normalizer or the classification algorithm can be deployed into the constructor while adding to the pipeline.
from sklearn.model_selection import cross_validate
scores = cross_validate(pipeline, X_train, y_train)
print(scores)
The above piece of code trains the whole pipeline by using the training data and labels. The scores variable, after the execution, holds the details of scores for each fold of the cross validation as shown below.
{'fit_time': array([0.00163412, 0.0012331 , 0.00207829]), 'score_time': array([0.00192475, 0.00164199, 0.00256586]), 'test_score': array([0.96875   , 1.        , 0.93103448]), 'train_score': array([0.96551724, 0.98360656, 1.        ])}

Summary

One more time please remember our steps can be demonstrated as below:
Steps of operations in this article
Please note that our approach is very similar to the CRISP-DM steps:
CRISP-DM Steps
This article is a very primitive introduction to the AutoML process and we have only implemented the pipelining for a better understanding.

Comments

Popular posts from this blog

SSO — WSO2 API Manager and Keycloak Identity Manager

Garbage Collectors - Serial vs. Parallel vs. CMS vs. G1 (and what’s new in Java 8)

Recommendation System Using Word2Vec with Python