Machine Learning with Random Forest Algorithm
Bài đăng này đã không được cập nhật trong 7 năm
What is Random Forest?
Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees' habit of overfitting to their training set.
Advantages using Random Forest
Here are some advantages of using random forests algorithm:
- Apply on both classification and the regression task
- Handle the missing values and maintains accuracy for missing data
- It will not overfit the model
- Handle large data set with higher dimensionality
Disadvantages using Random Forest
- Good at classification but not as good as for regression
- You have very little control on what the model does
Two major belief
- Most of the tree can provide correct prediction of class for most part of the data
- The tree are making mistake at different place That's why if we conduct voting for each of the observation and then decide about the class of the observation based on poll result, it is expected to be more close to the correct result
Examples
Let's get into an example using RandomForest Regressor We'll try to predict wine quality base on its attributes(features) You can get the code and data from this repo https://github.com/kchivorn/wine First of all we need to set up environment. I assume you have python3 installed and necessary packages
#Import numpy module
import numpy as np
#Import pandas
import pandas as pd
#Import sampling helper
from sklearn.model_selection import train_test_split
#Import preprocessing modules
from sklearn import preprocessing
#Import random forest model
from sklearn.ensemble import RandomForestRegressor
#Import cross-validation pipeline
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
#Import evaluation metrics
from sklearn.metrics import mean_squared_error, r2_score
#Import module for saving scikit-learn models
from sklearn.externals import joblib
#Load red wine data
data = pd.read_csv('winequality-red.csv', sep=';')
#Output the first 5 rows of data
print(data.head())
fixed acidity volatile acidity citric acid residual sugar chlorides \
0 7.4 0.70 0.00 1.9 0.076
1 7.8 0.88 0.00 2.6 0.098
2 7.8 0.76 0.04 2.3 0.092
3 11.2 0.28 0.56 1.9 0.075
4 7.4 0.70 0.00 1.9 0.076
free sulfur dioxide total sulfur dioxide density pH sulphates \
0 11.0 34.0 0.9978 3.51 0.56
1 25.0 67.0 0.9968 3.20 0.68
2 15.0 54.0 0.9970 3.26 0.65
3 17.0 60.0 0.9980 3.16 0.58
4 11.0 34.0 0.9978 3.51 0.56
alcohol quality
0 9.4 5
1 9.8 5
2 9.8 5
3 9.8 6
4 9.4 5
print(data.shape)
(1599, 12)
print(data.describe())
fixed acidity volatile acidity citric acid residual sugar \
count 1599.000000 1599.000000 1599.000000 1599.000000
mean 8.319637 0.527821 0.270976 2.538806
std 1.741096 0.179060 0.194801 1.409928
min 4.600000 0.120000 0.000000 0.900000
25% 7.100000 0.390000 0.090000 1.900000
50% 7.900000 0.520000 0.260000 2.200000
75% 9.200000 0.640000 0.420000 2.600000
max 15.900000 1.580000 1.000000 15.500000
chlorides free sulfur dioxide total sulfur dioxide density \
count 1599.000000 1599.000000 1599.000000 1599.000000
mean 0.087467 15.874922 46.467792 0.996747
std 0.047065 10.460157 32.895324 0.001887
min 0.012000 1.000000 6.000000 0.990070
25% 0.070000 7.000000 22.000000 0.995600
50% 0.079000 14.000000 38.000000 0.996750
75% 0.090000 21.000000 62.000000 0.997835
max 0.611000 72.000000 289.000000 1.003690
pH sulphates alcohol quality
count 1599.000000 1599.000000 1599.000000 1599.000000
mean 3.311113 0.658149 10.422983 5.636023
std 0.154386 0.169507 1.065668 0.807569
min 2.740000 0.330000 8.400000 3.000000
25% 3.210000 0.550000 9.500000 5.000000
50% 3.310000 0.620000 10.200000 6.000000
75% 3.400000 0.730000 11.100000 6.000000
max 4.010000 2.000000 14.900000 8.000000
#Separate target from training features
y = data.quality
X = data.drop('quality', axis=1)
#Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2,
random_state=123,
stratify=y)
# Fitting the Transformer API
scaler = preprocessing.StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
#Applying transformer to training data
print(X_train_scaled.mean(axis=0))
print(X_train_scaled.std(axis=0))
#Applying transformer to test data
X_test_scaled = scaler.transform(X_test)
print(X_test_scaled.mean(axis=0))
print(X_test_scaled.std(axis=0))
#Pipeline with preprocessing and model
pipeline = make_pipeline(preprocessing.StandardScaler(),
RandomForestRegressor(n_estimators=100))
#List tunable hyperparameters
print(pipeline.get_params())
#Declare hyperparameters to tune
hyperparameters = { 'randomforestregressor__max_features': ['auto', 'sqrt', 'log2'],
'randomforestregressor__max_depth': [None, 5, 3, 1] }
# Add sklearn cross-validation with pipeline
clf = GridSearchCV(pipeline, hyperparameters, cv=10)
# Fit and tune model
clf.fit(X_train, y_train)
# print out the best params
print(clf.best_params_)
# {'randomforestregressor__max_depth': None, 'randomforestregressor__max_features':
#Confirm model will be retrained
print(clf.refit)
# True
#Predict a new set of data
y_pred = clf.predict(X_test)
print(y_pred)
[ 6.35 5.77 4.86 5.36 6.31 5.47 5. 4.71 5.01 6.02 5.24 5.68
5.87 5.08 5.85 5.7 6.48 5.75 5.67 7. 5.55 5.61 5.06 6.08
6. 5.01 5.48 5.22 5.97 5.98 5.91 6.47 5.96 5.07 5. 5.95
5.03 6. 4.97 5.97 4.92 5.91 6.71 5.17 6.17 5.27 5.52 5.64
5.11 6.45 6.06 5.19 5.85 5.14 5.64 5.7 5.36 5.39 4.94 5.25
5.38 5.21 5.04 5.83 6.04 5.28 6.38 5.04 5.19 6.66 5.74 5.76
5.11 5.03 5.45 5.98 5.29 5.17 5.25 5.32 6.32 5.63 6.12 6.29
5.12 6.01 6.39 6.38 5.69 5.92 5.93 5.28 6.45 5.79 5.59 5.86
6.79 6.71 5.64 6.8 5.1 5.34 5.1 6.43 5.06 4.72 5.74 5.04
5.63 6.1 5.8 5.51 6.06 5.44 5.19 5.21 5.92 5.06 5.02 6.04
5.9 5.09 5.72 6.21 5.32 5.4 5.27 5.97 5.42 5.44 5.92 6.18
5.18 5.34 5.03 6.44 5. 5.11 6.72 5.36 5.18 5.08 5.75 6.04
5.38 5.31 5.11 6.47 5.84 5.03 5.58 5.11 4.9 5.01 5.2 5.94
5.45 5.71 5.76 5.27 5.46 5.3 5.26 5.87 5.01 5.93 5.13 5.4
5.49 5.02 5.9 5.08 5.71 5.08 5.57 5.5 5.02 5.36 5.55 5.1
5.99 5.58 4.99 5. 5.16 6.16 5.29 5.63 5.32 4.93 5.28 6.63
5.71 5.94 5.36 5.24 5.49 5.1 6.19 4.71 6.32 5.07 5.29 5.25
6.81 6.08 5.14 5.22 5.39 5.92 5.78 6.04 5.96 6.3 5.71 5.95
5.21 5.25 5.69 5.26 5.2 6.02 6.17 5.54 5.97 5.89 5.53 6.25
5.37 6.04 5.45 5.52 6.22 5.7 4.91 4.39 6.74 6.43 6.28 5.36
5.42 5.47 5.36 6.17 6. 5.14 5.12 5.36 5.15 6.32 5.22 5.04
5.18 5.15 5.91 6.34 5.74 5.38 5.43 6.46 5.49 6.02 5.33 5.19
5.74 5.88 5.8 5.54 5.41 5.05 5.71 5.41 6.55 6.15 5.69 4.96
5.95 6.41 5.99 5.45 5.74 5.33 5.34 5.94 6.87 5.32 6.36 5.92
5.35 5.52 5.64 5.14 5.12 6.3 5.81 5.92 5.95 5.91 5.31 5.66
5.52 6.09 5.67 6.84 6.91 5.88 6.27 5.04 5.31 5.95 5.37 5.35
5.97 6.58 6.48 5.27 5.56 5.69 6.13 5.46]
#Print prediction errors
print(r2_score(y_test, y_pred))
0.45044082571584243
print(mean_squared_error(y_test, y_pred))
0.35461593750000003
# Save model to a .pkl file
joblib.dump(clf, 'rf_regressor.pkl')
# Load model from .pkl file
clf2 = joblib.load('rf_regressor.pkl')
clf2.predict(X_test)
Conclusion
Well, the rule of thumb is that your very first model probably won't be the best possible model. However, we recommend a combination of three strategies to decide if you're satisfied with your model performance. Start with the goal of the model. If the model is tied to a business problem, have you successfully solved the problem? Look in academic literature to get a sense of the current performance benchmarks for specific types of data. Try to find low-hanging fruit in terms of ways to improve your model.
All rights reserved