Continue with Machine Learning - Linear Regression

In this post we'll use some financial data to test and apply linear regression. Quandl is:

The premier source for financial, economic, and alternative datasets, serving investment professionals. Quandl’s platform is used by over 250,000 people, including analysts from the world’s top hedge funds, asset managers and investment banks.

Set up

pip install sklearn # machine learning library
pip instal quandl  # library for loading financial data
pip install pandas # library for working with python dataframe

Definition

So what is linear regression?

In statistics, linear regression is a linear approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables (or independent variables) denoted X. Linear regression map continuos data from X -> Y. First it learns from known data, then build a model and use this model to predict new X to an unknown Y.

Import Data

Let create a file called main.py with following codes:

import pandas as pd
import quandl

df = quandl.get('WIKI/GOOGL')
print(df.head())

We get a nice data below:

Open    High     Low    Close      Volume  Ex-Dividend  \
Date                                                                   
2004-08-19  100.01  104.06   95.96  100.335  44659000.0          0.0   
2004-08-20  101.01  109.08  100.50  108.310  22834300.0          0.0   
2004-08-23  110.76  113.48  109.05  109.400  18256100.0          0.0   
2004-08-24  111.24  111.60  103.57  104.870  15247300.0          0.0   
2004-08-25  104.76  108.00  103.88  106.000   9188600.0          0.0   

            Split Ratio  Adj. Open  Adj. High   Adj. Low  Adj. Close  \
Date                                                                   
2004-08-19          1.0  50.159839  52.191109  48.128568   50.322842   
2004-08-20          1.0  50.661387  54.708881  50.405597   54.322689   
2004-08-23          1.0  55.551482  56.915693  54.693835   54.869377   
2004-08-24          1.0  55.792225  55.972783  51.945350   52.597363   
2004-08-25          1.0  52.542193  54.167209  52.100830   53.164113   

            Adj. Volume  
Date                     
2004-08-19   44659000.0  
2004-08-20   22834300.0  
2004-08-23   18256100.0  
2004-08-24   15247300.0  
2004-08-25    9188600.0 

Close, Open, Volume... are all features. We won't use all the feature to do machine learning. Some features are more useful than the others. With some experiences and observation we can choose some features to start learning.

import pandas as pd
import quandl

df = quandl.get('WIKI/GOOGL')
df = df[['Adj. Open', 'Adj. High', 'Adj. Low', 'Adj. Close', 'Adj. Volume']]
df['HL_PCT'] = (df['Adj. High'] -df['Adj. Close']) /df['Adj. Close'] * 100.0
df['PCT_change'] = (df['Adj. Close'] -df['Adj. Open']) /df['Adj. Open'] * 100.0
df = df[['Adj. Close', 'HL_PCT', 'PCT_change', 'Adj. Volume']]
print(df.head())

Adj. Close    HL_PCT  PCT_change  Adj. Volume
Date                                                     
2004-08-19   50.322842  3.712563    0.324968   44659000.0
2004-08-20   54.322689  0.710922    7.227007   22834300.0
2004-08-23   54.869377  3.729433   -1.227880   18256100.0
2004-08-24   52.597363  6.417469   -5.726357   15247300.0
2004-08-25   53.164113  1.886792    1.183658    9188600.0

What we are trying to predict here is Adj Close price.

forecast_col = 'Adj. Close'

Let's also fill some missing data with some value since python cannot work with na.

df.fillna(-99999, inplace=True)

Let's get out top 1 percent of dataframe that we'll try to forecast by making a simple observation.

df.fillna(-99999, inplace=True)
forecast_out = int(math.ceil(0.01 * len(df)))
df['label'] = df[forecast_col].shift(-forecast_out)
df.dropna(inplace=True)
print(df.tail())
Adj. Close    HL_PCT  PCT_change  Adj. Volume    label
Date                                                              
2017-10-27     1033.67  2.897443    0.259944    5139945.0  1085.09
2017-10-30     1033.13  0.648515    0.385751    2245352.0  1079.78
2017-10-31     1033.04  0.770541    0.003872    1490660.0  1073.56
2017-11-01     1042.60  0.504508    0.605990    2105729.0  1070.85
2017-11-02     1042.97  0.244494    0.286541    1233333.0  1068.86

We'll see a Adj. Close and label price get pretty close to each other.

Let's get to real training and prediction.

X = np.array(df.drop(['label'],1))
y = np.array(df['label'])

X= preprocessing.scale(X)
df.dropna(inplace=True)
y = np.array(df['label'])

X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.2)
clf = LinearRegression()
clf.fit(X_train, y_train)
accuracy = clf.score(X_test, y_test)
print(accuracy)
0.973621902095

High accuracy