Logistic regression with python
Bài đăng này đã không được cập nhật trong 7 năm
This will be my first post about machine learning using python. The prediction model has been done already by https://github.com/datacamp/course-resources-ml-with-experts-budgets/blob/master/notebooks/1.0-full-model.ipynb . But it can be too overwelming for most people to understand this is my attempt to elaborate more on the code written.
The dataset we are using can be obtained from https://www.drivendata.org/competitions/46/box-plots-for-education-reboot/data/ . We'll run our prediction model on local machine. The data downloaded is pretty huge. We have to find a way to randomly select the most representative data used for training data set. Now first let's download the data. Then load it to a DataFrame. First import necessary libraries such as numpy(for working with numeric values) and pandas for DataFrame
# ignore deprecation warnings in sklearn
import warnings
warnings.filterwarnings("ignore")
#matplotlib.pyplot is for plotting
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
#need to run local machine and path
import os
import sys
# add the 'src' directory as one where we can import modules
src_dir = os.path.join(os.getcwd(), os.pardir, 'src')
sys.path.append(src_dir)
#We will select randomly the most representative data used for training
#This model will have multiple label attached to each record.
#Then we split the data into training dataset and test dataset.
from data.multilabel import multilabel_sample_dataframe, multilabel_train_test_split
#This library takes into account the interaction between each features.
from features.SparseInteractions import SparseInteractions
#This library will measure the error produced by our model. So it says about how accurate our model is.
from models.metrics import multi_multi_log_loss
Next read the data into DataFrame.
path_to_training_data = os.path.join(os.pardir,
'data',
'TrainingSet.csv')
#Set the first column as index by which reach row can be accessed
df = pd.read_csv(path_to_training_data, index_col=0)
#print the shape of the DataFrame
print(df.shape)
#(400277, 25)
#400277 rows and 25 columns
The data is too much for our machine.
Resample the Data
After checking the data using EDA (Exploratory Data Analysis) we have see Feature with numeric value and non-numeric values.
#Non numeric values
LABELS = ['Function',
'Use',
'Sharing',
'Reporting',
'Student_Type',
'Position_Type',
'Object_Type',
'Pre_K',
'Operating_Status']
#Numeric values using list comprehension
NON_LABELS = [c for c in df.columns if c not in LABELS]
SAMPLE_SIZE = 40000
#Set the categorical value using dummie value with 0 and 1.
sampling = multilabel_sample_dataframe(df,
pd.get_dummies(df[LABELS]),
size=SAMPLE_SIZE,
min_count=25,
seed=43)
#Get dummies variables from the sampling created
dummy_labels = pd.get_dummies(sampling[LABELS])
#Split the sample data into train set and test set
X_train, X_test, y_train, y_test = multilabel_train_test_split(sampling[NON_LABELS],
dummy_labels,
0.2,
min_count=3,
seed=43)
Dummy label translates the categorical value of a variable into a variable of its own and only one value equals 1.
Create preprocessing tools
NUMERIC_COLUMNS = ['FTE', "Total"]
def combine_text_columns(data_frame, to_drop=NUMERIC_COLUMNS + LABELS):
""" Takes the dataset as read in, drops the non-feature, non-text columns and
then combines all of the text columns into a single vector that has all of
the text for a row.
:param data_frame: The data as read in with read_csv (no preprocessing necessary)
:param to_drop (optional): Removes the numeric and label columns by default.
"""
# drop non-text columns that are in the df
to_drop = set(to_drop) & set(data_frame.columns.tolist())
text_data = data_frame.drop(to_drop, axis=1)
# replace nans with blanks
text_data.fillna("", inplace=True)
# joins all of the text items in a row (axis=1)
# with a space in between
return text_data.apply(lambda x: " ".join(x), axis=1)
```
Create Function transformer
```Python
from sklearn.preprocessing import FunctionTransformer
get_text_data = FunctionTransformer(combine_text_columns, validate=False)
get_numeric_data = FunctionTransformer(lambda x: x[NUMERIC_COLUMNS], validate=False)
Combine text columns into one column
get_text_data.fit_transform(sampling.head(5))
With output
38 OTHER PURCHASED SERVICES SCHOOL-WIDE SCHOOL P...
70 Extra Duty Pay/Overtime For Support Personnel ...
198 Supplemental * Operation and Maintenance of P...
209 REPAIR AND MAINTENANCE SERVICES PUPIL TRANSPO...
614 GENERAL EDUCATION LOCAL EDUCATIONAL AIDE,70 H...
dtype: object
There are 2 numeric columns: "FTE" and "Total"
get_numeric_data.fit_transform(sampling.head(5))
With output
FTE Total
38 NaN 653.460000
70 NaN 2153.530000
198 NaN -8291.860000
209 NaN 618.290000
614 0.71 21747.666875
Create function to evaluate model
from sklearn.metrics.scorer import make_scorer
log_loss_scorer = make_scorer(multi_multi_log_loss)
It is necessary to use pipeline that the output for one function can be used as an input for another function.
from sklearn.feature_selection import chi2, SelectKBest
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import Imputer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MaxAbsScaler
TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)'
Here we use Logistic regression for classifying. We also do normalization using MaxAbscaler.
%%time
#set a reasonable number of features before adding interactions
chi_k = 300
#create the pipeline object
pl = Pipeline([
('union', FeatureUnion( # Use FeatureUnion to combine 2 transformers.
transformer_list = [
('numeric_features', Pipeline([
('selector', get_numeric_data),
('imputer', Imputer())
])),
('text_features', Pipeline([
('selector', get_text_data),
('vectorizer', HashingVectorizer(token_pattern=TOKENS_ALPHANUMERIC, # create vectorization with each Token as alphanumeric.
non_negative=True, norm=None, binary=False,
ngram_range=(1, 2))),
('dim_red', SelectKBest(chi2, chi_k))
]))
]
)),
('int', SparseInteractions(degree=2)),
('scale', MaxAbsScaler()),
('clf', OneVsRestClassifier(LogisticRegression()))
])
#fit the pipeline to our training data
pl.fit(X_train, y_train.values)
#print the score of our trained pipeline on our test set
print("Logloss score of trained pipeline: ", log_loss_scorer(pl, X_test, y_test.values))
Predict holdout set and write submission
path_to_holdout_data = os.path.join(os.pardir,
'data',
'TestSet.csv')
# Load holdout data
holdout = pd.read_csv(path_to_holdout_data, index_col=0)
# Make predictions
predictions = pl.predict_proba(holdout)
# Format correctly in new DataFrame: prediction_df
prediction_df = pd.DataFrame(columns=pd.get_dummies(df[LABELS]).columns,
index=holdout.index,
data=predictions)
# Save prediction_df to csv called "predictions.csv"
prediction_df.to_csv("predictions.csv")
All rights reserved