Continue with Machine Learning - Linear Regression
Bài đăng này đã không được cập nhật trong 6 năm
In this post we'll use some financial data to test and apply linear regression. Quandl is:
The premier source for financial, economic, and alternative datasets, serving investment professionals. Quandl’s platform is used by over 250,000 people, including analysts from the world’s top hedge funds, asset managers and investment banks.
Set up
pip install sklearn # machine learning library
pip instal quandl # library for loading financial data
pip install pandas # library for working with python dataframe
Definition
So what is linear regression?
In statistics, linear regression is a linear approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables (or independent variables) denoted X. Linear regression map continuos data from X -> Y. First it learns from known data, then build a model and use this model to predict new X to an unknown Y.
Import Data
Let create a file called main.py
with following codes:
import pandas as pd
import quandl
df = quandl.get('WIKI/GOOGL')
print(df.head())
We get a nice data below:
Open High Low Close Volume Ex-Dividend \
Date
2004-08-19 100.01 104.06 95.96 100.335 44659000.0 0.0
2004-08-20 101.01 109.08 100.50 108.310 22834300.0 0.0
2004-08-23 110.76 113.48 109.05 109.400 18256100.0 0.0
2004-08-24 111.24 111.60 103.57 104.870 15247300.0 0.0
2004-08-25 104.76 108.00 103.88 106.000 9188600.0 0.0
Split Ratio Adj. Open Adj. High Adj. Low Adj. Close \
Date
2004-08-19 1.0 50.159839 52.191109 48.128568 50.322842
2004-08-20 1.0 50.661387 54.708881 50.405597 54.322689
2004-08-23 1.0 55.551482 56.915693 54.693835 54.869377
2004-08-24 1.0 55.792225 55.972783 51.945350 52.597363
2004-08-25 1.0 52.542193 54.167209 52.100830 53.164113
Adj. Volume
Date
2004-08-19 44659000.0
2004-08-20 22834300.0
2004-08-23 18256100.0
2004-08-24 15247300.0
2004-08-25 9188600.0
Close, Open, Volume... are all features. We won't use all the feature to do machine learning. Some features are more useful than the others. With some experiences and observation we can choose some features to start learning.
import pandas as pd
import quandl
df = quandl.get('WIKI/GOOGL')
df = df[['Adj. Open', 'Adj. High', 'Adj. Low', 'Adj. Close', 'Adj. Volume']]
df['HL_PCT'] = (df['Adj. High'] -df['Adj. Close']) /df['Adj. Close'] * 100.0
df['PCT_change'] = (df['Adj. Close'] -df['Adj. Open']) /df['Adj. Open'] * 100.0
df = df[['Adj. Close', 'HL_PCT', 'PCT_change', 'Adj. Volume']]
print(df.head())
Adj. Close HL_PCT PCT_change Adj. Volume
Date
2004-08-19 50.322842 3.712563 0.324968 44659000.0
2004-08-20 54.322689 0.710922 7.227007 22834300.0
2004-08-23 54.869377 3.729433 -1.227880 18256100.0
2004-08-24 52.597363 6.417469 -5.726357 15247300.0
2004-08-25 53.164113 1.886792 1.183658 9188600.0
What we are trying to predict here is Adj Close
price.
forecast_col = 'Adj. Close'
Let's also fill some missing data with some value since python cannot work with na
.
df.fillna(-99999, inplace=True)
Let's get out top 1 percent of dataframe that we'll try to forecast by making a simple observation.
df.fillna(-99999, inplace=True)
forecast_out = int(math.ceil(0.01 * len(df)))
df['label'] = df[forecast_col].shift(-forecast_out)
df.dropna(inplace=True)
print(df.tail())
Adj. Close HL_PCT PCT_change Adj. Volume label
Date
2017-10-27 1033.67 2.897443 0.259944 5139945.0 1085.09
2017-10-30 1033.13 0.648515 0.385751 2245352.0 1079.78
2017-10-31 1033.04 0.770541 0.003872 1490660.0 1073.56
2017-11-01 1042.60 0.504508 0.605990 2105729.0 1070.85
2017-11-02 1042.97 0.244494 0.286541 1233333.0 1068.86
We'll see a Adj. Close and label price get pretty close to each other.
Let's get to real training and prediction.
X = np.array(df.drop(['label'],1))
y = np.array(df['label'])
X= preprocessing.scale(X)
df.dropna(inplace=True)
y = np.array(df['label'])
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.2)
clf = LinearRegression()
clf.fit(X_train, y_train)
accuracy = clf.score(X_test, y_test)
print(accuracy)
0.973621902095
High accuracy
All rights reserved