Continue with Machine Learning - Try with Multiple Algorithms

Bài đăng này đã không được cập nhật trong 7 năm

In this post, what we are trying to do is finding a way to test several algorithm then choose the best one.

DATA

The data is from https://www.kaggle.com/uciml/breast-cancer-wisconsin-data/data

Purpose

Our machine learning model here is to predict whether the case diagnosis is benign or malignant (B, M). Let's look at the data:

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import time

df = pd.read_csv('data.csv')
df.head()
 id diagnosis  radius_mean  texture_mean  perimeter_mean  area_mean  \
0    842302         M        17.99         10.38          122.80     1001.0   
1    842517         M        20.57         17.77          132.90     1326.0   
2  84300903         M        19.69         21.25          130.00     1203.0   
3  84348301         M        11.42         20.38           77.58      386.1   
4  84358402         M        20.29         14.34          135.10     1297.0   

   smoothness_mean  compactness_mean  concavity_mean  concave points_mean  \
0          0.11840           0.27760          0.3001              0.14710   
1          0.08474           0.07864          0.0869              0.07017   
2          0.10960           0.15990          0.1974              0.12790   
3          0.14250           0.28390          0.2414              0.10520   
4          0.10030           0.13280          0.1980              0.10430   

      ...       texture_worst  perimeter_worst  area_worst  smoothness_worst  \
0     ...               17.33           184.60      2019.0            0.1622   
1     ...               23.41           158.80      1956.0            0.1238   
2     ...               25.53           152.50      1709.0            0.1444   
3     ...               26.50            98.87       567.7            0.2098   
4     ...               16.67           152.20      1575.0            0.1374   

   compactness_worst  concavity_worst  concave points_worst  symmetry_worst  \
0             0.6656           0.7119                0.2654          0.4601   
1             0.1866           0.2416                0.1860          0.2750   
2             0.4245           0.4504                0.2430          0.3613   
3             0.8663           0.6869                0.2575          0.6638   
4             0.2050           0.4000                0.1625          0.2364   

   fractal_dimension_worst  Unnamed: 32  
0                  0.11890          NaN  
1                  0.08902          NaN  
2                  0.08758          NaN  
3                  0.17300          NaN  
4                  0.07678          NaN  

[5 rows x 33 columns]

Data Description

There are 10 features measured in 3 ways: mean, standard error, worst. Those 10 features are: Ten real-valued features are computed for each cell nucleus:

a) radius (mean of distances from center to points on the perimeter)
b) texture (standard deviation of gray-scale values)
c) perimeter
d) area
e) smoothness (local variation in radius lengths)
f) compactness (perimeter^2 / area - 1.0)
g) concavity (severity of concave portions of the contour)
h) concave points (number of concave portions of the contour)
i) symmetry 
j) fractal dimension ("coastline approximation" - 1)

and diagnosis as malignant or benign (M,B)

Data Exploration

Let's separate the data into features and class label (what we want to predict)

y = df['diagnosis']
x = df.drop(['id', 'diagnosis', 'Unnamed: 32'], axis=1)
x.head()

#Output
radius_mean  texture_mean  perimeter_mean  area_mean  smoothness_mean  \
0        17.99         10.38          122.80     1001.0          0.11840   
1        20.57         17.77          132.90     1326.0          0.08474   
2        19.69         21.25          130.00     1203.0          0.10960   
3        11.42         20.38           77.58      386.1          0.14250   
4        20.29         14.34          135.10     1297.0          0.10030   

   compactness_mean  concavity_mean  concave points_mean  symmetry_mean  \
0           0.27760          0.3001              0.14710         0.2419   
1           0.07864          0.0869              0.07017         0.1812   
2           0.15990          0.1974              0.12790         0.2069   
3           0.28390          0.2414              0.10520         0.2597   
4           0.13280          0.1980              0.10430         0.1809   

   fractal_dimension_mean           ...             radius_worst  \
0                 0.07871           ...                    25.38   
1                 0.05667           ...                    24.99   
2                 0.05999           ...                    23.57   
3                 0.09744           ...                    14.91   
4                 0.05883           ...                    22.54   

   texture_worst  perimeter_worst  area_worst  smoothness_worst  \
0          17.33           184.60      2019.0            0.1622   
1          23.41           158.80      1956.0            0.1238   
2          25.53           152.50      1709.0            0.1444   
3          26.50            98.87       567.7            0.2098   
4          16.67           152.20      1575.0            0.1374   

   compactness_worst  concavity_worst  concave points_worst  symmetry_worst  \
0             0.6656           0.7119                0.2654          0.4601   
1             0.1866           0.2416                0.1860          0.2750   
2             0.4245           0.4504                0.2430          0.3613   
3             0.8663           0.6869                0.2575          0.6638   
4             0.2050           0.4000                0.1625          0.2364   

   fractal_dimension_worst  
0                  0.11890  
1                  0.08902  
2                  0.08758  
3                  0.17300  
4                  0.07678  
[5 rows x 30 columns]

Let check our data distribution by checking density plot on each feature:

x.plot(kind='density', subplots=True, layout=(6,6), sharex=False, legend=False, fontsize=1)
plt.show()

All the features quite follow a general gaussian distribution. Let's check the number of case of benign and malignant

ax = sns.countplot(y, label="Count")
b, m = y.value_counts()
print("Number of Benign: ", b)
print("Number of Malign: ", m)

Let's check data statistics:

x.describe()

x.describe()
       radius_mean  texture_mean  perimeter_mean    area_mean  \
count   569.000000    569.000000      569.000000   569.000000   
mean     14.127292     19.289649       91.969033   654.889104   
std       3.524049      4.301036       24.298981   351.914129   
min       6.981000      9.710000       43.790000   143.500000   
25%      11.700000     16.170000       75.170000   420.300000   
50%      13.370000     18.840000       86.240000   551.100000   
75%      15.780000     21.800000      104.100000   782.700000   
max      28.110000     39.280000      188.500000  2501.000000   

       smoothness_mean  compactness_mean  concavity_mean  concave points_mean  \
count       569.000000        569.000000      569.000000           569.000000   
mean          0.096360          0.104341        0.088799             0.048919   
std           0.014064          0.052813        0.079720             0.038803   
min           0.052630          0.019380        0.000000             0.000000   
25%           0.086370          0.064920        0.029560             0.020310   
50%           0.095870          0.092630        0.061540             0.033500   
75%           0.105300          0.130400        0.130700             0.074000   
max           0.163400          0.345400        0.426800             0.201200   

       symmetry_mean  fractal_dimension_mean           ...             \
count     569.000000              569.000000           ...              
mean        0.181162                0.062798           ...              
std         0.027414                0.007060           ...              
min         0.106000                0.049960           ...              
25%         0.161900                0.057700           ...              
50%         0.179200                0.061540           ...              
75%         0.195700                0.066120           ...              
max         0.304000                0.097440           ...              

       radius_worst  texture_worst  perimeter_worst   area_worst  \
count    569.000000     569.000000       569.000000   569.000000   
mean      16.269190      25.677223       107.261213   880.583128   
std        4.833242       6.146258        33.602542   569.356993   
min        7.930000      12.020000        50.410000   185.200000   
25%       13.010000      21.080000        84.110000   515.300000   
50%       14.970000      25.410000        97.660000   686.500000   
75%       18.790000      29.720000       125.400000  1084.000000   
max       36.040000      49.540000       251.200000  4254.000000   

       smoothness_worst  compactness_worst  concavity_worst  \
count        569.000000         569.000000       569.000000   
mean           0.132369           0.254265         0.272188   
std            0.022832           0.157336         0.208624   
min            0.071170           0.027290         0.000000   
25%            0.116600           0.147200         0.114500   
50%            0.131300           0.211900         0.226700   
75%            0.146000           0.339100         0.382900   
max            0.222600           1.058000         1.252000   

       concave points_worst  symmetry_worst  fractal_dimension_worst  
count            569.000000      569.000000               569.000000  
mean               0.114606        0.290076                 0.083946  
std                0.065732        0.061867                 0.018061  
min                0.000000        0.156500                 0.055040  
25%                0.064930        0.250400                 0.071460  
50%                0.099930        0.282200                 0.080040  
75%                0.161400        0.317900                 0.092080  
max                0.291000        0.663800                 0.207500  

[8 rows x 30 columns]

Let's check data features corrolletion:

f, ax = plt.subplots(figsize=(18,18))
sns.heatmap(x.corr(), annot=True, linewidths=0.5, fmt='.1f', ax=ax)

Training model

There are several algorithm that are good for binary classification. We will test with 5 algorithm and check out which one is the best one: Classification and Regression Trees (CART), Linear Support Vector Machines (SVM), Gaussian Naive Bayes (NB) and k-Nearest Neighbors (KNN) and RandomForestClassifier(RF).

from sklearn.model_selection import KFold, cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC, LinearSVC


models = []
models.append(('CART', DecisionTreeClassifier()))
models.append(('SVM', SVC()))
models.append(('NB', GaussianNB()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('LinearSVC', LinearSVC()))

num_folds = 10
results = []
names = []
kfold = KFold(n_splits=num_folds, random_state=123)
for name, model in models:
   start = time.time()
   cv_results = cross_val_score(model, x_train, y_train, cv=kfold, scoring='accuracy')
   end = time.time()
   results.append(cv_results)
   names.append(name)
   print("%s: %f (%f) (run time: %f)" % (name, cv_results.mean(), cv_results.std(), end-start))

#Output
CART: 0.919551 (0.024681) (run time: 0.069104)
SVM: 0.625769 (0.074918) (run time: 0.569782)
NB: 0.921987 (0.034719) (run time: 0.039054)
KNN: 0.901859 (0.044437) (run time: 0.046674)
RF: 0.934679 (0.032022) (run time: 0.326259)

Let's make a graph of the performance

fig = plt.figure()
fig.suptitle('Performance Comparision')
ax= fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()

We find that the svm performance is not so good. This may be due to data not scaled yet. Let's scale before training check the performance again.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
import warnings

pipelines = []

pipelines.append(('ScaledCART', Pipeline([('Scaler', StandardScaler()), ('CART', DecisionTreeClassifier())])))
pipelines.append(('ScaledSVM', Pipeline([('Scaler', StandardScaler()), ('SVM', SVC())])))
pipelines.append(('ScaledNB', Pipeline([('Scaler', StandardScaler()), ('NB', GaussianNB())])))
pipelines.append(('ScaledKNN', Pipeline([('Scaler', StandardScaler()), ('KNN', KNeighborsClassifier())])))
pipelines.append(('ScaledRF', Pipeline([('Scaler', StandardScaler()), ('RF', RandomForestClassifier())])))

results = []
names = []

with warnings.catch_warnings():
    warnings.simplefilter('ignore')
    kfold = KFold(n_splits=num_folds, random_state=123)
    for name, model in pipelines:
        start = time.time()
        cv_results = cross_val_score(model, x_train, y_train, cv=kfold, scoring='accuracy')
        end = time.time()
        results.append(cv_results)
        names.append(name)
        print("%s: %f (%f) (%f)" % (name, cv_results.mean(), cv_results.std(), end-start))

#Output
ScaledCART: 0.937179 (0.025657) (0.154313)
ScaledSVM: 0.969744 (0.027240) (0.134548)
ScaledNB: 0.937051 (0.039612) (0.058743)
ScaledKNN: 0.952115 (0.043058) (0.091208)
ScaledRF: 0.949744 (0.031627) (0.405433)

There are a lot of improvement. and SVM is the best. Here is the crux of this post. We will use GridSearchCV from model_selection to run each important params to tune for the best params.

from sklearn.model_selection import GridSearchCV

scaler = StandardScaler().fit(x_train)
scaledX = scaler.transform(x_train)
c_values = [round(0.1 * (i+1), 1) for i in range(20)]
kernel_values = ['linear', 'poly', 'rbf', 'sigmoid']
params_grid = dict(C=c_values, kernel=kernel_values)
kfold = KFold(n_splits=num_folds, random_state=121)
grid = GridSearchCV(estimator=SVC(), param_grid=params_grid, scoring='accuracy', cv=kfold)
grid_result = grid.fit(scaledX, y_train)
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, std, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, std, param))
#Output
Best: 0.972362 using {'C': 0.1, 'kernel': 'linear'}
0.972362 (0.026491) with: {'C': 0.1, 'kernel': 'linear'}
0.841709 (0.053980) with: {'C': 0.1, 'kernel': 'poly'}
0.932161 (0.039436) with: {'C': 0.1, 'kernel': 'rbf'}
0.939698 (0.020594) with: {'C': 0.1, 'kernel': 'sigmoid'}
0.964824 (0.036358) with: {'C': 0.2, 'kernel': 'linear'}
0.861809 (0.040516) with: {'C': 0.2, 'kernel': 'poly'}
0.947236 (0.030812) with: {'C': 0.2, 'kernel': 'rbf'}
0.944724 (0.022233) with: {'C': 0.2, 'kernel': 'sigmoid'}
0.962312 (0.034665) with: {'C': 0.3, 'kernel': 'linear'}
0.866834 (0.043296) with: {'C': 0.3, 'kernel': 'poly'}
0.952261 (0.028829) with: {'C': 0.3, 'kernel': 'rbf'}
0.954774 (0.027544) with: {'C': 0.3, 'kernel': 'sigmoid'}
0.959799 (0.038022) with: {'C': 0.4, 'kernel': 'linear'}
0.869347 (0.042970) with: {'C': 0.4, 'kernel': 'poly'}
0.957286 (0.030066) with: {'C': 0.4, 'kernel': 'rbf'}
0.959799 (0.025934) with: {'C': 0.4, 'kernel': 'sigmoid'}
0.959799 (0.038022) with: {'C': 0.5, 'kernel': 'linear'}
0.871859 (0.046718) with: {'C': 0.5, 'kernel': 'poly'}
0.967337 (0.027764) with: {'C': 0.5, 'kernel': 'rbf'}
0.954774 (0.027398) with: {'C': 0.5, 'kernel': 'sigmoid'}
0.959799 (0.034560) with: {'C': 0.6, 'kernel': 'linear'}
0.876884 (0.042568) with: {'C': 0.6, 'kernel': 'poly'}
0.967337 (0.027764) with: {'C': 0.6, 'kernel': 'rbf'}
0.959799 (0.030474) with: {'C': 0.6, 'kernel': 'sigmoid'}
0.959799 (0.034560) with: {'C': 0.7, 'kernel': 'linear'}
0.884422 (0.046459) with: {'C': 0.7, 'kernel': 'poly'}
0.967337 (0.027764) with: {'C': 0.7, 'kernel': 'rbf'}
0.962312 (0.028466) with: {'C': 0.7, 'kernel': 'sigmoid'}
0.957286 (0.034271) with: {'C': 0.8, 'kernel': 'linear'}
0.894472 (0.043221) with: {'C': 0.8, 'kernel': 'poly'}
0.972362 (0.026400) with: {'C': 0.8, 'kernel': 'rbf'}
0.959799 (0.028338) with: {'C': 0.8, 'kernel': 'sigmoid'}
0.957286 (0.034271) with: {'C': 0.9, 'kernel': 'linear'}
0.896985 (0.041029) with: {'C': 0.9, 'kernel': 'poly'}
0.969849 (0.027207) with: {'C': 0.9, 'kernel': 'rbf'}
0.959799 (0.028338) with: {'C': 0.9, 'kernel': 'sigmoid'}
0.959799 (0.034560) with: {'C': 1.0, 'kernel': 'linear'}
0.902010 (0.039532) with: {'C': 1.0, 'kernel': 'poly'}
0.969849 (0.027207) with: {'C': 1.0, 'kernel': 'rbf'}
0.947236 (0.026665) with: {'C': 1.0, 'kernel': 'sigmoid'}
0.957286 (0.034271) with: {'C': 1.1, 'kernel': 'linear'}
0.902010 (0.039532) with: {'C': 1.1, 'kernel': 'poly'}
0.969849 (0.027207) with: {'C': 1.1, 'kernel': 'rbf'}
0.962312 (0.030593) with: {'C': 1.1, 'kernel': 'sigmoid'}
0.957286 (0.034271) with: {'C': 1.2, 'kernel': 'linear'}
0.902010 (0.039532) with: {'C': 1.2, 'kernel': 'poly'}
0.969849 (0.027207) with: {'C': 1.2, 'kernel': 'rbf'}
0.954774 (0.037240) with: {'C': 1.2, 'kernel': 'sigmoid'}
0.959799 (0.034560) with: {'C': 1.3, 'kernel': 'linear'}
0.902010 (0.039532) with: {'C': 1.3, 'kernel': 'poly'}
0.969849 (0.027207) with: {'C': 1.3, 'kernel': 'rbf'}
0.947236 (0.030890) with: {'C': 1.3, 'kernel': 'sigmoid'}
0.957286 (0.032387) with: {'C': 1.4, 'kernel': 'linear'}
0.902010 (0.039532) with: {'C': 1.4, 'kernel': 'poly'}
0.969849 (0.027207) with: {'C': 1.4, 'kernel': 'rbf'}
0.947236 (0.030812) with: {'C': 1.4, 'kernel': 'sigmoid'}
0.962312 (0.034665) with: {'C': 1.5, 'kernel': 'linear'}
0.907035 (0.042054) with: {'C': 1.5, 'kernel': 'poly'}
0.969849 (0.027207) with: {'C': 1.5, 'kernel': 'rbf'}
0.939698 (0.030308) with: {'C': 1.5, 'kernel': 'sigmoid'}
0.962312 (0.034665) with: {'C': 1.6, 'kernel': 'linear'}
0.907035 (0.042054) with: {'C': 1.6, 'kernel': 'poly'}
0.967337 (0.025401) with: {'C': 1.6, 'kernel': 'rbf'}
0.942211 (0.030020) with: {'C': 1.6, 'kernel': 'sigmoid'}
0.962312 (0.034665) with: {'C': 1.7, 'kernel': 'linear'}
0.907035 (0.042054) with: {'C': 1.7, 'kernel': 'poly'}
0.967337 (0.025401) with: {'C': 1.7, 'kernel': 'rbf'}
0.934673 (0.037762) with: {'C': 1.7, 'kernel': 'sigmoid'}
0.962312 (0.034665) with: {'C': 1.8, 'kernel': 'linear'}
0.907035 (0.042054) with: {'C': 1.8, 'kernel': 'poly'}
0.967337 (0.025401) with: {'C': 1.8, 'kernel': 'rbf'}
0.934673 (0.039574) with: {'C': 1.8, 'kernel': 'sigmoid'}
0.962312 (0.034665) with: {'C': 1.9, 'kernel': 'linear'}
0.907035 (0.042054) with: {'C': 1.9, 'kernel': 'poly'}
0.969849 (0.027207) with: {'C': 1.9, 'kernel': 'rbf'}
0.937186 (0.041050) with: {'C': 1.9, 'kernel': 'sigmoid'}
0.962312 (0.034665) with: {'C': 2.0, 'kernel': 'linear'}
0.909548 (0.037588) with: {'C': 2.0, 'kernel': 'poly'}
0.969849 (0.027207) with: {'C': 2.0, 'kernel': 'rbf'}
0.937186 (0.042723) with: {'C': 2.0, 'kernel': 'sigmoid'}

We found the best params. Now let's apply and test our test data.

scaler = StandardScaler().fit(x_train)
scaledx = scaler.transform(x_train)
model = SVC(C=0.1, kernel='linear')
start = time.time()
model.fit(scaledx, y_train)
end = time.time()
print("Run time: %f" % (end-start))

scaledx = scaler.transform(x_test)
y_predicted = model.predict(scaledx)
    
print("Accuracy score: %f" % accuracy_score(y_test, y_predicted))
print(classification_report(y_test, y_predicted))
print(confusion_matrix(y_test, y_predicted))

#Output
Run time: 0.003853
Accuracy score: 0.988304
             precision    recall  f1-score   support

          B       0.99      0.99      0.99       103
          M       0.99      0.99      0.99        68

avg / total       0.99      0.99      0.99       171

[[102   1]
 [  1  67]]

Reference

https://datavizcatalogue.com/methods/density_plot.html