Machine Learning with Random Forest Algorithm

Bài đăng này đã không được cập nhật trong 7 năm

What is Random Forest?

Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees' habit of overfitting to their training set.

Advantages using Random Forest

Here are some advantages of using random forests algorithm:

Apply on both classification and the regression task
Handle the missing values and maintains accuracy for missing data
It will not overfit the model
Handle large data set with higher dimensionality

Disadvantages using Random Forest

Good at classification but not as good as for regression
You have very little control on what the model does

Two major belief

Most of the tree can provide correct prediction of class for most part of the data
The tree are making mistake at different place That's why if we conduct voting for each of the observation and then decide about the class of the observation based on poll result, it is expected to be more close to the correct result

Examples

Let's get into an example using RandomForest Regressor We'll try to predict wine quality base on its attributes(features) You can get the code and data from this repo https://github.com/kchivorn/wine First of all we need to set up environment. I assume you have python3 installed and necessary packages

#Import numpy module
import numpy as np
#Import pandas
import pandas as pd
#Import sampling helper
from sklearn.model_selection import train_test_split
#Import preprocessing modules
from sklearn import preprocessing
#Import random forest model
from sklearn.ensemble import RandomForestRegressor
#Import cross-validation pipeline
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
#Import evaluation metrics
from sklearn.metrics import mean_squared_error, r2_score
#Import module for saving scikit-learn models
from sklearn.externals import joblib

#Load red wine data
data = pd.read_csv('winequality-red.csv', sep=';')
#Output the first 5 rows of data
print(data.head())
fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \
0            7.4              0.70         0.00             1.9      0.076   
1            7.8              0.88         0.00             2.6      0.098   
2            7.8              0.76         0.04             2.3      0.092   
3           11.2              0.28         0.56             1.9      0.075   
4            7.4              0.70         0.00             1.9      0.076   

   free sulfur dioxide  total sulfur dioxide  density    pH  sulphates  \
0                 11.0                  34.0   0.9978  3.51       0.56   
1                 25.0                  67.0   0.9968  3.20       0.68   
2                 15.0                  54.0   0.9970  3.26       0.65   
3                 17.0                  60.0   0.9980  3.16       0.58   
4                 11.0                  34.0   0.9978  3.51       0.56   

   alcohol  quality  
0      9.4        5  
1      9.8        5  
2      9.8        5  
3      9.8        6  
4      9.4        5 
print(data.shape)
(1599, 12)
print(data.describe())
fixed acidity  volatile acidity  citric acid  residual sugar  \
count    1599.000000       1599.000000  1599.000000     1599.000000   
mean        8.319637          0.527821     0.270976        2.538806   
std         1.741096          0.179060     0.194801        1.409928   
min         4.600000          0.120000     0.000000        0.900000   
25%         7.100000          0.390000     0.090000        1.900000   
50%         7.900000          0.520000     0.260000        2.200000   
75%         9.200000          0.640000     0.420000        2.600000   
max        15.900000          1.580000     1.000000       15.500000   

         chlorides  free sulfur dioxide  total sulfur dioxide      density  \
count  1599.000000          1599.000000           1599.000000  1599.000000   
mean      0.087467            15.874922             46.467792     0.996747   
std       0.047065            10.460157             32.895324     0.001887   
min       0.012000             1.000000              6.000000     0.990070   
25%       0.070000             7.000000             22.000000     0.995600   
50%       0.079000            14.000000             38.000000     0.996750   
75%       0.090000            21.000000             62.000000     0.997835   
max       0.611000            72.000000            289.000000     1.003690   

                pH    sulphates      alcohol      quality  
count  1599.000000  1599.000000  1599.000000  1599.000000  
mean      3.311113     0.658149    10.422983     5.636023  
std       0.154386     0.169507     1.065668     0.807569  
min       2.740000     0.330000     8.400000     3.000000  
25%       3.210000     0.550000     9.500000     5.000000  
50%       3.310000     0.620000    10.200000     6.000000  
75%       3.400000     0.730000    11.100000     6.000000  
max       4.010000     2.000000    14.900000     8.000000

#Separate target from training features
y = data.quality
X = data.drop('quality', axis=1)

#Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.2,
                                                    random_state=123,
                                                    stratify=y)
                                                    
# Fitting the Transformer API
scaler = preprocessing.StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)

#Applying transformer to training data
print(X_train_scaled.mean(axis=0))
print(X_train_scaled.std(axis=0))

#Applying transformer to test data
X_test_scaled = scaler.transform(X_test)

print(X_test_scaled.mean(axis=0))
print(X_test_scaled.std(axis=0))

#Pipeline with preprocessing and model
pipeline = make_pipeline(preprocessing.StandardScaler(),
                          RandomForestRegressor(n_estimators=100))

#List tunable hyperparameters
print(pipeline.get_params())

#Declare hyperparameters to tune
hyperparameters = { 'randomforestregressor__max_features': ['auto', 'sqrt', 'log2'],
                  'randomforestregressor__max_depth': [None, 5, 3, 1] }

# Add sklearn cross-validation with pipeline
clf = GridSearchCV(pipeline, hyperparameters, cv=10)
# Fit and tune model
clf.fit(X_train, y_train)

# print out the best params
print(clf.best_params_)
# {'randomforestregressor__max_depth': None, 'randomforestregressor__max_features': 
#Confirm model will be retrained
print(clf.refit)
# True
#Predict a new set of data
y_pred = clf.predict(X_test)
print(y_pred)
[ 6.35  5.77  4.86  5.36  6.31  5.47  5.    4.71  5.01  6.02  5.24  5.68
  5.87  5.08  5.85  5.7   6.48  5.75  5.67  7.    5.55  5.61  5.06  6.08
  6.    5.01  5.48  5.22  5.97  5.98  5.91  6.47  5.96  5.07  5.    5.95
  5.03  6.    4.97  5.97  4.92  5.91  6.71  5.17  6.17  5.27  5.52  5.64
  5.11  6.45  6.06  5.19  5.85  5.14  5.64  5.7   5.36  5.39  4.94  5.25
  5.38  5.21  5.04  5.83  6.04  5.28  6.38  5.04  5.19  6.66  5.74  5.76
  5.11  5.03  5.45  5.98  5.29  5.17  5.25  5.32  6.32  5.63  6.12  6.29
  5.12  6.01  6.39  6.38  5.69  5.92  5.93  5.28  6.45  5.79  5.59  5.86
  6.79  6.71  5.64  6.8   5.1   5.34  5.1   6.43  5.06  4.72  5.74  5.04
  5.63  6.1   5.8   5.51  6.06  5.44  5.19  5.21  5.92  5.06  5.02  6.04
  5.9   5.09  5.72  6.21  5.32  5.4   5.27  5.97  5.42  5.44  5.92  6.18
  5.18  5.34  5.03  6.44  5.    5.11  6.72  5.36  5.18  5.08  5.75  6.04
  5.38  5.31  5.11  6.47  5.84  5.03  5.58  5.11  4.9   5.01  5.2   5.94
  5.45  5.71  5.76  5.27  5.46  5.3   5.26  5.87  5.01  5.93  5.13  5.4
  5.49  5.02  5.9   5.08  5.71  5.08  5.57  5.5   5.02  5.36  5.55  5.1
  5.99  5.58  4.99  5.    5.16  6.16  5.29  5.63  5.32  4.93  5.28  6.63
  5.71  5.94  5.36  5.24  5.49  5.1   6.19  4.71  6.32  5.07  5.29  5.25
  6.81  6.08  5.14  5.22  5.39  5.92  5.78  6.04  5.96  6.3   5.71  5.95
  5.21  5.25  5.69  5.26  5.2   6.02  6.17  5.54  5.97  5.89  5.53  6.25
  5.37  6.04  5.45  5.52  6.22  5.7   4.91  4.39  6.74  6.43  6.28  5.36
  5.42  5.47  5.36  6.17  6.    5.14  5.12  5.36  5.15  6.32  5.22  5.04
  5.18  5.15  5.91  6.34  5.74  5.38  5.43  6.46  5.49  6.02  5.33  5.19
  5.74  5.88  5.8   5.54  5.41  5.05  5.71  5.41  6.55  6.15  5.69  4.96
  5.95  6.41  5.99  5.45  5.74  5.33  5.34  5.94  6.87  5.32  6.36  5.92
  5.35  5.52  5.64  5.14  5.12  6.3   5.81  5.92  5.95  5.91  5.31  5.66
  5.52  6.09  5.67  6.84  6.91  5.88  6.27  5.04  5.31  5.95  5.37  5.35
  5.97  6.58  6.48  5.27  5.56  5.69  6.13  5.46]

#Print prediction errors
print(r2_score(y_test, y_pred))
0.45044082571584243
print(mean_squared_error(y_test, y_pred))
0.35461593750000003
# Save model to a .pkl file
joblib.dump(clf, 'rf_regressor.pkl')
# Load model from .pkl file
clf2 = joblib.load('rf_regressor.pkl')
clf2.predict(X_test)

Conclusion

Well, the rule of thumb is that your very first model probably won't be the best possible model. However, we recommend a combination of three strategies to decide if you're satisfied with your model performance. Start with the goal of the model. If the model is tied to a business problem, have you successfully solved the problem? Look in academic literature to get a sense of the current performance benchmarks for specific types of data. Try to find low-hanging fruit in terms of ways to improve your model.