# Continue with Machine Learning - Linear Regression

In this post we'll use some financial data to test and apply linear regression. Quandl is:

The premier source for financial, economic, and alternative datasets, serving investment professionals. Quandl’s platform is used by over 250,000 people, including analysts from the world’s top hedge funds, asset managers and investment banks.

## Set up

``````pip install sklearn # machine learning library
pip install pandas # library for working with python dataframe
``````

## Definition

So what is linear regression?

In statistics, linear regression is a linear approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables (or independent variables) denoted X. Linear regression map continuos data from X -> Y. First it learns from known data, then build a model and use this model to predict new X to an unknown Y.

## Import Data

Let create a file called `main.py` with following codes:

``````import pandas as pd
import quandl

df = quandl.get('WIKI/GOOGL')
``````

We get a nice data below:

``````Open    High     Low    Close      Volume  Ex-Dividend  \
Date
2004-08-19  100.01  104.06   95.96  100.335  44659000.0          0.0
2004-08-20  101.01  109.08  100.50  108.310  22834300.0          0.0
2004-08-23  110.76  113.48  109.05  109.400  18256100.0          0.0
2004-08-24  111.24  111.60  103.57  104.870  15247300.0          0.0
2004-08-25  104.76  108.00  103.88  106.000   9188600.0          0.0

Date
2004-08-19          1.0  50.159839  52.191109  48.128568   50.322842
2004-08-20          1.0  50.661387  54.708881  50.405597   54.322689
2004-08-23          1.0  55.551482  56.915693  54.693835   54.869377
2004-08-24          1.0  55.792225  55.972783  51.945350   52.597363
2004-08-25          1.0  52.542193  54.167209  52.100830   53.164113

Date
2004-08-19   44659000.0
2004-08-20   22834300.0
2004-08-23   18256100.0
2004-08-24   15247300.0
2004-08-25    9188600.0
``````

Close, Open, Volume... are all features. We won't use all the feature to do machine learning. Some features are more useful than the others. With some experiences and observation we can choose some features to start learning.

``````import pandas as pd
import quandl

df = quandl.get('WIKI/GOOGL')

Date
2004-08-19   50.322842  3.712563    0.324968   44659000.0
2004-08-20   54.322689  0.710922    7.227007   22834300.0
2004-08-23   54.869377  3.729433   -1.227880   18256100.0
2004-08-24   52.597363  6.417469   -5.726357   15247300.0
2004-08-25   53.164113  1.886792    1.183658    9188600.0
``````

What we are trying to predict here is `Adj Close` price.

``````forecast_col = 'Adj. Close'
``````

Let's also fill some missing data with some value since python cannot work with `na`.

``````df.fillna(-99999, inplace=True)
``````

Let's get out top 1 percent of dataframe that we'll try to forecast by making a simple observation.

``````df.fillna(-99999, inplace=True)
forecast_out = int(math.ceil(0.01 * len(df)))
df['label'] = df[forecast_col].shift(-forecast_out)
df.dropna(inplace=True)
print(df.tail())
Date
2017-10-27     1033.67  2.897443    0.259944    5139945.0  1085.09
2017-10-30     1033.13  0.648515    0.385751    2245352.0  1079.78
2017-10-31     1033.04  0.770541    0.003872    1490660.0  1073.56
2017-11-01     1042.60  0.504508    0.605990    2105729.0  1070.85
2017-11-02     1042.97  0.244494    0.286541    1233333.0  1068.86
``````

We'll see a Adj. Close and label price get pretty close to each other.

### Let's get to real training and prediction.

``````X = np.array(df.drop(['label'],1))
y = np.array(df['label'])

X= preprocessing.scale(X)
df.dropna(inplace=True)
y = np.array(df['label'])

X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.2)
clf = LinearRegression()
clf.fit(X_train, y_train)
accuracy = clf.score(X_test, y_test)
print(accuracy)
0.973621902095
``````

High accuracy