上一篇【網路爬蟲】臺灣證券交易所歷史資料教學(1) 提到如何從證交所網站剖析連接後,建立了抓取歷史股價的基本爬蟲,但是證交所網頁比較麻煩的地方在於每一次只能抓一個月的資料,所以必須要回溯過去撈取,再將資料拼湊起來。
創建日期序列
import pandas as pd
Dates = pd.date_range(start = '2000-01-01', end = '2020-09-01', freq = 'MS')
Output:
DatetimeIndex(['2000-01-01', '2000-02-01', '2000-03-01', '2000-04-01',
'2000-05-01', '2000-06-01', '2000-07-01', '2000-08-01',
'2000-09-01', '2000-10-01',
...
'2019-12-01', '2020-01-01', '2020-02-01', '2020-03-01',
'2020-04-01', '2020-05-01', '2020-06-01', '2020-07-01',
'2020-08-01', '2020-09-01'],
dtype='datetime64[ns]', length=249, freq='MS')
由於帶入網址時必須帶入字串格式的日期,所以先將日期序列格式轉為文字,而且格式為yyyymmdd
Dates = Dates.astype(str)
Output:
Index(['2000-01-01', '2000-02-01', '2000-03-01', '2000-04-01', '2000-05-01',
'2000-06-01', '2000-07-01', '2000-08-01', '2000-09-01', '2000-10-01',
...
'2019-12-01', '2020-01-01', '2020-02-01', '2020-03-01', '2020-04-01',
'2020-05-01', '2020-06-01', '2020-07-01', '2020-08-01', '2020-09-01'],
dtype='object', length=249)
帶入上次寫好的Get_StockPrice()函數中
產生各個日期的字串後,必須要把字串中間的「-」消除,所以使用replace(‘-‘,”),用空白來取代減號;最後我們利用For迴圈來抓取證交所的網站,最後加一個time.sleep()就是要讓程式稍微停一秒鐘,讓對方比較不會感覺我們再用爬蟲程式攻擊網頁。
Symbol = '2330'
Dates = pd.date_range(start = '2000-01-01', end = '2020-09-01', freq = 'MS').astype(str)
for Date in Dates:
print(Get_StockPrice(Symbol, Date.replace('-','')))
time.sleep()
完整程式碼
import pandas as pd
import numpy as np
import json
import requests
import datetime
import time
def Get_StockPrice(Symbol, Date):
url = f'https://www.twse.com.tw/exchangeReport/STOCK_DAY?response=json&date={Date}&stockNo={Symbol}'
print(url)
data = requests.get(url).text
json_data = json.loads(data)
Stock_data = json_data['data']
StockPrice = pd.DataFrame(Stock_data, columns = ['Date','Volume','Volume_Cash','Open','High','Low','Close','Change','Order'])
StockPrice['Date'] = StockPrice['Date'].str.replace('/','').astype(int) + 19110000
StockPrice['Date'] = pd.to_datetime(StockPrice['Date'].astype(str))
StockPrice['Volume'] = StockPrice['Volume'].str.replace(',','').astype(float)/1000
StockPrice['Volume_Cash'] = StockPrice['Volume_Cash'].str.replace(',','').astype(float)
StockPrice['Order'] = StockPrice['Order'].str.replace(',','').astype(float)
StockPrice['Open'] = StockPrice['Open'].str.replace(',','').astype(float)
StockPrice['High'] = StockPrice['High'].str.replace(',','').astype(float)
StockPrice['Low'] = StockPrice['Low'].str.replace(',','').astype(float)
StockPrice['Close'] = StockPrice['Close'].str.replace(',','').astype(float)
StockPrice = StockPrice.set_index('Date', drop = True)
StockPrice = StockPrice[['Open','High','Low','Close','Volume']]
print(StockPrice)
return StockPrice
if __name__ == '__main__':
Symbol = '2330'
Dates = pd.date_range(start = '2010-01-01', end = '2020-09-01', freq = 'MS').astype(str)
data = Get_StockPrice(Symbol, Dates[0].replace('-',''))
for Date in Dates[1:]:
print(Date)
try:
data = pd.concat([data,Get_StockPrice(Symbol, Date.replace('-',''))], axis = 0)
time.sleep(5)
except:
pass
print(data)