Machine Learning - Cleaning Data
Bài đăng này đã không được cập nhật trong 6 năm
We will explore stocks data and try to analyze them. The first piece of data we want to obtain is a snapshot of the stocks we want to analyze. One of the best ways to do this is to download data from one of the many stock screener applications that exist. But It is not possible to download it directly now. So I have try to scrape the data. Our favorite screener to download stock data from belongs to http://finviz.com.
Get the data
import requests
import pandas as pd
from bs4 import BeautifulSoup
import csv
import pdb
import datetime
#pdb.set_trace() - python step by step debugger command
print datetime.datetime.now()
print "Finviz Overview Start"
url = "https://finviz.com/screener.ashx?v=152&c=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68"
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html, "html.parser")
firstcount = soup.find_all('option')
lastnum = len(firstcount) - 1
lastpagenum = firstcount[lastnum].attrs['value']
currentpage = int(lastpagenum)
alldata = []
templist = []
# Overview = 111, Valuation = 121, Financial = 161, Ownership = 131, Performance = 141
#pagesarray = [111,121,161,131,141]
titleslist = soup.find_all('td',{"class" : "table-top"})
titleslisttickerid = soup.find_all('td',{"class" : "table-top-s"})
titleticker = titleslisttickerid[0].text
titlesarray = []
for title in titleslist:
titlesarray.append(title.text)
titlesarray.insert(1,titleticker)
i = 0
while(currentpage > 0):
i += 1
print str(i) + " page(s) done"
secondurl = "https://finviz.com/screener.ashx?v=152&c=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68" + "&r=" + str(currentpage)
secondresponse = requests.get(secondurl)
secondhtml = secondresponse.content
secondsoup = BeautifulSoup(secondhtml, "html.parser")
stockdata = secondsoup.find_all('a', {"class" : "screener-link"})
stockticker = secondsoup.find_all('a', {"class" : "screener-link-primary"})
datalength = len(stockdata)
#tickerdatalength = len(stockticker)
j=0
m=0
while(j < datalength):
templist = [stockdata[j+k].text for k in range(0, 68)]
templist.insert(1, stockticker[m].text)
alldata.append(templist)
templist = []
j += 68
m += 1
currentpage -= 20
with open('stockoverview.csv', 'wb') as csvfile:
overview = csv.DictWriter(csvfile, delimiter=',', lineterminator='\n', fieldnames=titlesarray)
overview.writeheader()
for stock in alldata:
df = dict()
for i in range(0,69):
df[titlesarray[i]] = stock[i]
overview.writerow(df)
print datetime.datetime.now()
print "Finviz Overview Completed"
Explore and Clean Data
We can test and see the first 5 rows
import pandas as pd
import re
df = pd.read_csv('stockoverview.csv')
df = df.sort_values(df.columns[0])
df.to_csv('new_stockoverview.csv', sep=',', index=False)
df.head()
The result can look something like this:
No. Ticker Company Sector \
0 1 A Agilent Technologies, Inc. Healthcare
1 2 AA Alcoa Corporation Basic Materials
2 3 AAAP Advanced Accelerator Applications S.A. Healthcare
3 4 AABA Altaba Inc. Financial
4 5 AAC AAC Holdings, Inc. Healthcare
Industry Country Market Cap P/E Fwd P/E PEG \
0 Medical Laboratories & Research USA 23.69B 35.12 25.71 3.07
1 Aluminum USA 9.84B 45.99 14.14 -
2 Diagnostic Substances France 3.63B - - -
3 Asset Management USA 67.84B 41.57 146.81 -
4 Specialized Health Services USA 209.88M - 16.85 -
... RSI from Open Gap Recom Avg Volume Rel Volume Price \
0 ... 71.21 -0.11% 0.30% 1.70 1.85M 0.95 73.58
1 ... 55.48 0.76% 0.78% 2.00 4.35M 0.58 53.30
2 ... 69.52 0.45% -0.09% 2.60 744.38K 0.01 82.32
3 ... 67.92 0.88% 0.73% 2.00 9.34M 0.67 77.81
4 ... 47.53 -1.32% 0.55% 1.70 169.23K 0.70 8.95
Change Volume Earnings
0 0.19% 1,754,331 Nov 20/a
1 1.54% 2,521,352 Jan 17/a
2 0.37% 7,485 Feb 16/b
3 1.62% 6,245,947 -
4 -0.78% 117,818 Feb 26/a
As we can see the data is not so clean. Imported numeric data often contains special characters such as percentage signs, dollar signs, commas, and so on. Beside the data are all of type objects except the first column(int64). But in fact if we observe most values numeric.
print(new_df.info)
No. 7134 non-null int64
Ticker 7134 non-null object
Company 7134 non-null object
Sector 7134 non-null object
Industry 7134 non-null object
Country 7134 non-null object
Market Cap 7134 non-null object
P/E 7134 non-null object
Fwd P/E 7134 non-null object
PEG 7134 non-null object
P/S 7134 non-null object
P/B 7134 non-null object
P/C 7134 non-null object
P/FCF 7134 non-null object
Dividend 7134 non-null object
Payout Ratio 7134 non-null object
EPS 7134 non-null object
EPS this Y 7134 non-null object
EPS next Y 7134 non-null object
EPS past 5Y 7134 non-null object
EPS next 5Y 7134 non-null object
Sales past 5Y 7134 non-null object
EPS Q/Q 7134 non-null object
Sales Q/Q 7134 non-null object
Outstanding 7134 non-null object
Float 7134 non-null object
Insider Own 7134 non-null object
Insider Trans 7134 non-null object
Inst Own 7134 non-null object
Inst Trans 7134 non-null object
Float Short 7134 non-null object
Short Ratio 7134 non-null object
ROA 7134 non-null object
ROE 7134 non-null object
ROI 7134 non-null object
Curr R 7134 non-null object
Quick R 7134 non-null object
LTDebt/Eq 7134 non-null object
Debt/Eq 7134 non-null object
Gross M 7134 non-null object
Oper M 7134 non-null object
Profit M 7134 non-null object
Perf Week 7134 non-null object
Perf Month 7134 non-null object
Perf Quart 7134 non-null object
Perf Half 7134 non-null object
Perf Year 7134 non-null object
Perf YTD 7134 non-null object
Beta 7134 non-null object
ATR 7134 non-null object
Volatility W 7134 non-null object
Volatility M 7134 non-null object
SMA20 7134 non-null object
SMA50 7134 non-null object
SMA200 7134 non-null object
50D High 7134 non-null object
50D Low 7134 non-null object
52W High 7134 non-null object
52W Low 7134 non-null object
RSI 7134 non-null object
from Open 7134 non-null object
Gap 7134 non-null object
Recom 7134 non-null object
Avg Volume 7134 non-null object
Rel Volume 7134 non-null object
Price 7134 non-null object
Change 7134 non-null object
Volume 7134 non-null int64
Earnings 7134 non-null object
dtypes: int64(2), object(67)
memory usage: 3.8+ MB
None
So let also convert some values to its appropriate type. Let's create function to clean data
def clean_numeric(s):
return pd.to_numeric(re.sub(r'\%|\$|,|\)|\(', '', s), errors='coerce')
for i in new_df.columns[7:68]:
new_df[i] = new_df[i].apply(clean_numeric)
print(new_df.head())
print new_df.info()
print new_df.describe()
Now what we get is a cleaner data make further analysis:
[5 rows x 69 columns]
No. Ticker Company Sector \
0 1 A Agilent Technologies, Inc. Healthcare
1 2 AA Alcoa Corporation Basic Materials
2 3 AAAP Advanced Accelerator Applications S.A. Healthcare
3 4 AABA Altaba Inc. Financial
4 5 AAC AAC Holdings, Inc. Healthcare
Industry Country Market Cap P/E Fwd P/E PEG \
0 Medical Laboratories & Research USA 23.69B 35.12 25.71 3.07
1 Aluminum USA 9.84B 45.99 14.14 NaN
2 Diagnostic Substances France 3.63B NaN NaN NaN
3 Asset Management USA 67.84B 41.57 146.81 NaN
4 Specialized Health Services USA 209.88M NaN 16.85 NaN
... RSI from Open Gap Recom Avg Volume Rel Volume Price \
0 ... 71.21 -0.11 0.30 1.7 NaN 0.95 73.58
1 ... 55.48 0.76 0.78 2.0 NaN 0.58 53.30
2 ... 69.52 0.45 -0.09 2.6 NaN 0.01 82.32
3 ... 67.92 0.88 0.73 2.0 NaN 0.67 77.81
4 ... 47.53 -1.32 0.55 1.7 NaN 0.70 8.95
Change Volume Earnings
0 0.19 1754331 Nov 20/a
1 1.54 2521352 Jan 17/a
2 0.37 7485 Feb 16/b
3 1.62 6245947 -
4 -0.78 117818 Feb 26/a
Let's also check print(df.info)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7134 entries, 0 to 7133
Data columns (total 69 columns):
No. 7134 non-null int64
Ticker 7134 non-null object
Company 7134 non-null object
Sector 7134 non-null object
Industry 7134 non-null object
Country 7134 non-null object
Market Cap 7134 non-null object
P/E 3040 non-null float64
Fwd P/E 3079 non-null float64
PEG 2252 non-null float64
P/S 4538 non-null float64
P/B 4440 non-null float64
P/C 4274 non-null float64
P/FCF 2684 non-null float64
Dividend 3714 non-null float64
Payout Ratio 2842 non-null float64
EPS 4732 non-null float64
EPS this Y 4560 non-null float64
EPS next Y 3833 non-null float64
EPS past 5Y 4304 non-null float64
EPS next 5Y 3236 non-null float64
Sales past 5Y 3859 non-null float64
EPS Q/Q 4469 non-null float64
Sales Q/Q 4450 non-null float64
Outstanding 0 non-null float64
Float 0 non-null float64
Insider Own 4548 non-null float64
Insider Trans 4246 non-null float64
Inst Own 4172 non-null float64
Inst Trans 4324 non-null float64
Float Short 4636 non-null float64
Short Ratio 6627 non-null float64
ROA 4331 non-null float64
ROE 4282 non-null float64
ROI 4244 non-null float64
Curr R 3523 non-null float64
Quick R 3452 non-null float64
LTDebt/Eq 4228 non-null float64
Debt/Eq 4228 non-null float64
Gross M 3464 non-null float64
Oper M 3841 non-null float64
Profit M 3817 non-null float64
Perf Week 7130 non-null float64
Perf Month 7110 non-null float64
Perf Quart 7011 non-null float64
Perf Half 6888 non-null float64
Perf Year 6671 non-null float64
Perf YTD 7131 non-null float64
Beta 3984 non-null float64
ATR 7131 non-null float64
Volatility W 7130 non-null float64
Volatility M 7110 non-null float64
SMA20 7131 non-null float64
SMA50 7131 non-null float64
SMA200 7131 non-null float64
50D High 7131 non-null float64
50D Low 7131 non-null float64
52W High 7131 non-null float64
52W Low 7131 non-null float64
RSI 7069 non-null float64
from Open 7131 non-null float64
Gap 7131 non-null float64
Recom 4131 non-null float64
Avg Volume 0 non-null float64
Rel Volume 7127 non-null float64
Price 7131 non-null float64
Change 7131 non-null float64
Volume 7134 non-null int64
Earnings 7134 non-null object
dtypes: float64(60), int64(2), object(7)
memory usage: 3.8+ MB
As we can see now there are many numeric types. Besides other disturbing symbols are removed.
We can check the basic statistics of the data by using new_df.describe()
. This methods only work on numeric column.
No. P/E Fwd P/E PEG P/S \
count 7134.000000 3040.000000 3079.000000 2252.000000 4538.000000
mean 3567.500000 76.372526 40.255661 9.272615 69.915591
std 2059.552743 632.983556 429.527497 74.322798 2049.459084
min 1.000000 0.000000 1.070000 0.000000 0.000000
25% 1784.250000 16.857500 13.550000 1.570000 1.010000
50% 3567.500000 23.485000 18.150000 2.435000 2.465000
75% 5350.750000 37.670000 26.160000 4.162500 5.470000
max 7134.000000 22495.000000 23250.000000 2001.670000 132866.310000
P/B P/C P/FCF Dividend Payout Ratio \
count 4440.000000 4274.000000 2684.000000 3714.000000 2842.000000
mean 6.980770 125.173613 83.639676 2.948244 45.881175
std 59.992237 2671.309330 938.556950 3.221843 83.578210
min 0.000000 0.020000 0.050000 0.010000 0.000000
25% 1.330000 4.532500 12.417500 1.190000 0.000000
50% 2.180000 10.035000 22.415000 2.030000 23.100000
75% 4.310000 27.540000 41.190000 3.410000 55.100000
max 3028.000000 169596.620000 42543.750000 48.410000 979.100000
... 52W Low RSI from Open Gap \
count ... 7131.000000 7069.000000 7131.000000 7131.000000
mean ... 45.540217 58.054506 -0.464629 0.321847
std ... 148.317658 14.748561 2.131743 1.609517
min ... -8.330000 11.580000 -23.530000 -26.350000
25% ... 13.850000 47.590000 -1.130000 0.000000
50% ... 27.670000 58.140000 -0.290000 0.240000
75% ... 49.580000 68.990000 0.110000 0.590000
max ... 11133.330000 98.110000 28.220000 70.000000
Recom Avg Volume Rel Volume Price Change \
count 4131.000000 0.0 7127.000000 7131.000000 7131.000000
mean 2.299371 NaN 1.150251 87.844182 -0.148512
std 0.594775 NaN 2.038652 3831.215053 2.487509
min 1.000000 NaN 0.000000 0.020000 -28.820000
25% 2.000000 NaN 0.560000 11.915000 -0.910000
50% 2.200000 NaN 0.860000 26.380000 -0.110000
75% 2.700000 NaN 1.270000 50.190000 0.520000
max 5.000000 NaN 81.610000 323480.000000 32.840000
Volume
count 7.134000e+03
mean 1.085803e+06
std 4.414247e+06
min 0.000000e+00
25% 2.200200e+04
50% 1.498810e+05
75% 6.555612e+05
max 1.671263e+08
[8 rows x 62 columns]
All rights reserved