Machine Learning - Cleaning Data

We will explore stocks data and try to analyze them. The first piece of data we want to obtain is a snapshot of the stocks we want to analyze. One of the best ways to do this is to download data from one of the many stock screener applications that exist. But It is not possible to download it directly now. So I have try to scrape the data. Our favorite screener to download stock data from belongs to http://finviz.com.

Get the data

import requests
import pandas as pd
from bs4 import BeautifulSoup
import csv
import pdb
import datetime
#pdb.set_trace() - python step by step debugger command
print datetime.datetime.now()
print "Finviz Overview Start"
url = "https://finviz.com/screener.ashx?v=152&c=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68"
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html, "html.parser")
firstcount = soup.find_all('option')
lastnum = len(firstcount) - 1
lastpagenum = firstcount[lastnum].attrs['value']
currentpage = int(lastpagenum)

alldata = []
templist = []
# Overview = 111, Valuation = 121, Financial = 161, Ownership = 131, Performance = 141
#pagesarray = [111,121,161,131,141]
titleslist = soup.find_all('td',{"class" : "table-top"})
titleslisttickerid = soup.find_all('td',{"class" : "table-top-s"})
titleticker = titleslisttickerid[0].text
titlesarray = []
for title in titleslist:
    titlesarray.append(title.text)

titlesarray.insert(1,titleticker)
i = 0

while(currentpage > 0):
    i += 1
    print str(i) + " page(s) done"
    secondurl = "https://finviz.com/screener.ashx?v=152&c=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68" + "&r=" + str(currentpage)
    secondresponse = requests.get(secondurl)
    secondhtml = secondresponse.content
    secondsoup = BeautifulSoup(secondhtml, "html.parser")
    stockdata = secondsoup.find_all('a', {"class" : "screener-link"})
    stockticker = secondsoup.find_all('a', {"class" : "screener-link-primary"})
    datalength = len(stockdata)
    #tickerdatalength = len(stockticker)

    j=0
    m=0
    while(j < datalength):
        templist = [stockdata[j+k].text for k in range(0, 68)]
        templist.insert(1, stockticker[m].text)
        alldata.append(templist)
        templist = []
        j += 68
        m += 1
    currentpage -= 20

with open('stockoverview.csv', 'wb') as csvfile:
    overview = csv.DictWriter(csvfile, delimiter=',', lineterminator='\n', fieldnames=titlesarray)
    overview.writeheader()

    for stock in alldata:
        df = dict()
        for i in range(0,69):
            df[titlesarray[i]] = stock[i]
        overview.writerow(df)

print datetime.datetime.now()
print "Finviz Overview Completed"

Explore and Clean Data

We can test and see the first 5 rows

import pandas as pd
import re

df = pd.read_csv('stockoverview.csv')
df = df.sort_values(df.columns[0])
df.to_csv('new_stockoverview.csv', sep=',', index=False)
df.head()

The result can look something like this:

No. Ticker                                 Company           Sector  \
0    1      A              Agilent Technologies, Inc.       Healthcare   
1    2     AA                       Alcoa Corporation  Basic Materials   
2    3   AAAP  Advanced Accelerator Applications S.A.       Healthcare   
3    4   AABA                             Altaba Inc.        Financial   
4    5    AAC                      AAC Holdings, Inc.       Healthcare   

                          Industry Country Market Cap    P/E Fwd P/E   PEG  \
0  Medical Laboratories & Research     USA     23.69B  35.12   25.71  3.07   
1                         Aluminum     USA      9.84B  45.99   14.14     -   
2            Diagnostic Substances  France      3.63B      -       -     -   
3                 Asset Management     USA     67.84B  41.57  146.81     -   
4      Specialized Health Services     USA    209.88M      -   16.85     -   

     ...       RSI from Open     Gap Recom Avg Volume Rel Volume  Price  \
0    ...     71.21    -0.11%   0.30%  1.70      1.85M       0.95  73.58   
1    ...     55.48     0.76%   0.78%  2.00      4.35M       0.58  53.30   
2    ...     69.52     0.45%  -0.09%  2.60    744.38K       0.01  82.32   
3    ...     67.92     0.88%   0.73%  2.00      9.34M       0.67  77.81   
4    ...     47.53    -1.32%   0.55%  1.70    169.23K       0.70   8.95   

   Change     Volume  Earnings  
0   0.19%  1,754,331  Nov 20/a  
1   1.54%  2,521,352  Jan 17/a  
2   0.37%      7,485  Feb 16/b  
3   1.62%  6,245,947         -  
4  -0.78%    117,818  Feb 26/a

As we can see the data is not so clean. Imported numeric data often contains special characters such as percentage signs, dollar signs, commas, and so on. Beside the data are all of type objects except the first column(int64). But in fact if we observe most values numeric.

print(new_df.info)
No.              7134 non-null int64
Ticker           7134 non-null object
Company          7134 non-null object
Sector           7134 non-null object
Industry         7134 non-null object
Country          7134 non-null object
Market Cap       7134 non-null object
P/E              7134 non-null object
Fwd P/E          7134 non-null object
PEG              7134 non-null object
P/S              7134 non-null object
P/B              7134 non-null object
P/C              7134 non-null object
P/FCF            7134 non-null object
Dividend         7134 non-null object
Payout Ratio     7134 non-null object
EPS              7134 non-null object
EPS this Y       7134 non-null object
EPS next Y       7134 non-null object
EPS past 5Y      7134 non-null object
EPS next 5Y      7134 non-null object
Sales past 5Y    7134 non-null object
EPS Q/Q          7134 non-null object
Sales Q/Q        7134 non-null object
Outstanding      7134 non-null object
Float            7134 non-null object
Insider Own      7134 non-null object
Insider Trans    7134 non-null object
Inst Own         7134 non-null object
Inst Trans       7134 non-null object
Float Short      7134 non-null object
Short Ratio      7134 non-null object
ROA              7134 non-null object
ROE              7134 non-null object
ROI              7134 non-null object
Curr R           7134 non-null object
Quick R          7134 non-null object
LTDebt/Eq        7134 non-null object
Debt/Eq          7134 non-null object
Gross M          7134 non-null object
Oper M           7134 non-null object
Profit M         7134 non-null object
Perf Week        7134 non-null object
Perf Month       7134 non-null object
Perf Quart       7134 non-null object
Perf Half        7134 non-null object
Perf Year        7134 non-null object
Perf YTD         7134 non-null object
Beta             7134 non-null object
ATR              7134 non-null object
Volatility W     7134 non-null object
Volatility M     7134 non-null object
SMA20            7134 non-null object
SMA50            7134 non-null object
SMA200           7134 non-null object
50D High         7134 non-null object
50D Low          7134 non-null object
52W High         7134 non-null object
52W Low          7134 non-null object
RSI              7134 non-null object
from Open        7134 non-null object
Gap              7134 non-null object
Recom            7134 non-null object
Avg Volume       7134 non-null object
Rel Volume       7134 non-null object
Price            7134 non-null object
Change           7134 non-null object
Volume           7134 non-null int64
Earnings         7134 non-null object
dtypes: int64(2), object(67)
memory usage: 3.8+ MB
None

So let also convert some values to its appropriate type. Let's create function to clean data

def clean_numeric(s):
  return pd.to_numeric(re.sub(r'\%|\$|,|\)|\(', '', s), errors='coerce')

for i in new_df.columns[7:68]:
  new_df[i] = new_df[i].apply(clean_numeric)

print(new_df.head())
print new_df.info()
print new_df.describe()

Now what we get is a cleaner data make further analysis:

[5 rows x 69 columns]
   No. Ticker                                 Company           Sector  \
0    1      A              Agilent Technologies, Inc.       Healthcare   
1    2     AA                       Alcoa Corporation  Basic Materials   
2    3   AAAP  Advanced Accelerator Applications S.A.       Healthcare   
3    4   AABA                             Altaba Inc.        Financial   
4    5    AAC                      AAC Holdings, Inc.       Healthcare   

                          Industry Country Market Cap    P/E  Fwd P/E   PEG  \
0  Medical Laboratories & Research     USA     23.69B  35.12    25.71  3.07   
1                         Aluminum     USA      9.84B  45.99    14.14   NaN   
2            Diagnostic Substances  France      3.63B    NaN      NaN   NaN   
3                 Asset Management     USA     67.84B  41.57   146.81   NaN   
4      Specialized Health Services     USA    209.88M    NaN    16.85   NaN   

     ...       RSI  from Open   Gap  Recom  Avg Volume  Rel Volume  Price  \
0    ...     71.21      -0.11  0.30    1.7         NaN        0.95  73.58   
1    ...     55.48       0.76  0.78    2.0         NaN        0.58  53.30   
2    ...     69.52       0.45 -0.09    2.6         NaN        0.01  82.32   
3    ...     67.92       0.88  0.73    2.0         NaN        0.67  77.81   
4    ...     47.53      -1.32  0.55    1.7         NaN        0.70   8.95   

   Change   Volume  Earnings  
0    0.19  1754331  Nov 20/a  
1    1.54  2521352  Jan 17/a  
2    0.37     7485  Feb 16/b  
3    1.62  6245947         -  
4   -0.78   117818  Feb 26/a

Let's also check print(df.info)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7134 entries, 0 to 7133
Data columns (total 69 columns):
No.              7134 non-null int64
Ticker           7134 non-null object
Company          7134 non-null object
Sector           7134 non-null object
Industry         7134 non-null object
Country          7134 non-null object
Market Cap       7134 non-null object
P/E              3040 non-null float64
Fwd P/E          3079 non-null float64
PEG              2252 non-null float64
P/S              4538 non-null float64
P/B              4440 non-null float64
P/C              4274 non-null float64
P/FCF            2684 non-null float64
Dividend         3714 non-null float64
Payout Ratio     2842 non-null float64
EPS              4732 non-null float64
EPS this Y       4560 non-null float64
EPS next Y       3833 non-null float64
EPS past 5Y      4304 non-null float64
EPS next 5Y      3236 non-null float64
Sales past 5Y    3859 non-null float64
EPS Q/Q          4469 non-null float64
Sales Q/Q        4450 non-null float64
Outstanding      0 non-null float64
Float            0 non-null float64
Insider Own      4548 non-null float64
Insider Trans    4246 non-null float64
Inst Own         4172 non-null float64
Inst Trans       4324 non-null float64
Float Short      4636 non-null float64
Short Ratio      6627 non-null float64
ROA              4331 non-null float64
ROE              4282 non-null float64
ROI              4244 non-null float64
Curr R           3523 non-null float64
Quick R          3452 non-null float64
LTDebt/Eq        4228 non-null float64
Debt/Eq          4228 non-null float64
Gross M          3464 non-null float64
Oper M           3841 non-null float64
Profit M         3817 non-null float64
Perf Week        7130 non-null float64
Perf Month       7110 non-null float64
Perf Quart       7011 non-null float64
Perf Half        6888 non-null float64
Perf Year        6671 non-null float64
Perf YTD         7131 non-null float64
Beta             3984 non-null float64
ATR              7131 non-null float64
Volatility W     7130 non-null float64
Volatility M     7110 non-null float64
SMA20            7131 non-null float64
SMA50            7131 non-null float64
SMA200           7131 non-null float64
50D High         7131 non-null float64
50D Low          7131 non-null float64
52W High         7131 non-null float64
52W Low          7131 non-null float64
RSI              7069 non-null float64
from Open        7131 non-null float64
Gap              7131 non-null float64
Recom            4131 non-null float64
Avg Volume       0 non-null float64
Rel Volume       7127 non-null float64
Price            7131 non-null float64
Change           7131 non-null float64
Volume           7134 non-null int64
Earnings         7134 non-null object
dtypes: float64(60), int64(2), object(7)
memory usage: 3.8+ MB

As we can see now there are many numeric types. Besides other disturbing symbols are removed. We can check the basic statistics of the data by using new_df.describe(). This methods only work on numeric column.

               No.           P/E       Fwd P/E          PEG            P/S  \
count  7134.000000   3040.000000   3079.000000  2252.000000    4538.000000   
mean   3567.500000     76.372526     40.255661     9.272615      69.915591   
std    2059.552743    632.983556    429.527497    74.322798    2049.459084   
min       1.000000      0.000000      1.070000     0.000000       0.000000   
25%    1784.250000     16.857500     13.550000     1.570000       1.010000   
50%    3567.500000     23.485000     18.150000     2.435000       2.465000   
75%    5350.750000     37.670000     26.160000     4.162500       5.470000   
max    7134.000000  22495.000000  23250.000000  2001.670000  132866.310000   

               P/B            P/C         P/FCF     Dividend  Payout Ratio  \
count  4440.000000    4274.000000   2684.000000  3714.000000   2842.000000   
mean      6.980770     125.173613     83.639676     2.948244     45.881175   
std      59.992237    2671.309330    938.556950     3.221843     83.578210   
min       0.000000       0.020000      0.050000     0.010000      0.000000   
25%       1.330000       4.532500     12.417500     1.190000      0.000000   
50%       2.180000      10.035000     22.415000     2.030000     23.100000   
75%       4.310000      27.540000     41.190000     3.410000     55.100000   
max    3028.000000  169596.620000  42543.750000    48.410000    979.100000   

           ...            52W Low          RSI    from Open          Gap  \
count      ...        7131.000000  7069.000000  7131.000000  7131.000000   
mean       ...          45.540217    58.054506    -0.464629     0.321847   
std        ...         148.317658    14.748561     2.131743     1.609517   
min        ...          -8.330000    11.580000   -23.530000   -26.350000   
25%        ...          13.850000    47.590000    -1.130000     0.000000   
50%        ...          27.670000    58.140000    -0.290000     0.240000   
75%        ...          49.580000    68.990000     0.110000     0.590000   
max        ...       11133.330000    98.110000    28.220000    70.000000   

             Recom  Avg Volume   Rel Volume          Price       Change  \
count  4131.000000         0.0  7127.000000    7131.000000  7131.000000   
mean      2.299371         NaN     1.150251      87.844182    -0.148512   
std       0.594775         NaN     2.038652    3831.215053     2.487509   
min       1.000000         NaN     0.000000       0.020000   -28.820000   
25%       2.000000         NaN     0.560000      11.915000    -0.910000   
50%       2.200000         NaN     0.860000      26.380000    -0.110000   
75%       2.700000         NaN     1.270000      50.190000     0.520000   
max       5.000000         NaN    81.610000  323480.000000    32.840000   

             Volume  
count  7.134000e+03  
mean   1.085803e+06  
std    4.414247e+06  
min    0.000000e+00  
25%    2.200200e+04  
50%    1.498810e+05  
75%    6.555612e+05  
max    1.671263e+08  

[8 rows x 62 columns]