Python股票分析系列——数据整合.p7
Python股票分析系列——数据整合.p7
欢迎来到Python for Finance教程系列的第7部分。 在之前的教程中,我们为整个标准普尔500强公司抓取了雅虎财经数据。 在本教程中,我们将把这些数据组合到一个DataFrame中。
到此为止的代码:
import bs4 as bs import datetime as dt import os import pandas_datareader.data as web import pickle import requests def save_sp500_tickers(): resp = requests.get('http://en.wikipedia.org/wiki/List_of_S%26P_500_companies') soup = bs.BeautifulSoup(resp.text, 'lxml') table = soup.find('table', {'class': 'wikitable sortable'}) tickers = [] for row in table.findAll('tr')[1:]: ticker = row.findAll('td')[0].text tickers.append(ticker) with open("sp500tickers.pickle", "wb") as f: pickle.dump(tickers, f) return tickers # save_sp500_tickers() def get_data_from_yahoo(reload_sp500=False): if reload_sp500: tickers = save_sp500_tickers() else: with open("sp500tickers.pickle", "rb") as f: tickers = pickle.load(f) if not os.path.exists('stock_dfs'): os.makedirs('stock_dfs') start = dt.datetime(2010, 1, 1) end = dt.datetime.now() for ticker in tickers: # just in case your connection breaks, we'd like to save our progress! if not os.path.exists('stock_dfs/{}.csv'.format(ticker)): df = web.DataReader(ticker, 'morningstar', start, end) df.reset_index(inplace=True) df.set_index("Date", inplace=True) df = df.drop("Symbol", axis=1) df.to_csv('stock_dfs/{}.csv'.format(ticker)) else: print('Already have {}'.format(ticker)) get_data_from_yahoo()
尽管我们掌握了所有数据,但我们可能想要一起评估数据。为此,我们将把所有的股票数据集合在一起。目前的每个股票文件都有:开盘价,最高价,最低价,收盘价,成交量和调整收盘价。至少要开始,我们现在大多只对调整后的收盘感兴趣。
def compile_data(): with open("sp500tickers.pickle","rb") as f: tickers = pickle.load(f) main_df = pd.DataFrame()
首先,我们拉取我们之前制作的代码列表,并从一个名为main_df的空数据框开始。现在,我们准备读取每个股票的数据集合:
for count,ticker in enumerate(tickers): df = pd.read_csv('stock_dfs/{}.csv'.format(ticker)) df.set_index('Date', inplace=True)
你不需要在这里使用Python的枚举,我只是使用它,所以我们知道我们在读取所有数据的过程中。你可以迭代代码。从这一点,我们*可以*生成有趣数据的额外列,如:
df ['{} _ HL_pct_diff'.format(ticker)] =(df ['High'] - df ['Low'])/ df ['Low'] df ['{} _ daily_pct_chng'.format(ticker)] =(df ['Close'] - df ['Open'])/ df ['Open']
但现在,我们不会因此而烦恼。只要知道这可能是一条追寻道路的道路。相反,我们真的只是对Adj Adj列感兴趣:
df.rename(columns={'Adj Close':ticker}, inplace=True) df.drop(['Open','High','Low','Close','Volume'],1,inplace=True)
现在我们已经有了这个专栏(或者像上面那样额外的......但是请记住,在这个例子中,我们没有做HL_pct_diff或daily_pct_chng)。请注意,我们已将Adj Adj列重命名为任何股票代码名称。我们开始构建共享数据框:
if main_df.empty: main_df = df else: main_df = main_df.join(df, how='outer')
如果main_df中没有任何内容,那么我们将从当前的df开始,否则我们将使用Pandas的加入。
仍然在这个for循环中,我们将再添加两行:
if count % 10 == 0: print(count)
这将只输出当前股票的数量,如果它可以被10整除。什么样的计数%10给我们的是余数,如果计数除以10.因此,如果我们问如果计数%10 == 0,我们是 只有看到if语句,如果当前计数除以10,余数为0,或者如果它完全可以被10整除,那么才会出现True。
当我们完成for循环时:
print(main_df.head()) main_df.to_csv('sp500_joined_closes.csv')
这个函数调用它到这一点:
with open("sp500tickers.pickle","rb") as f: tickers = pickle.load(f) main_df = pd.DataFrame() for count,ticker in enumerate(tickers): df = pd.read_csv('stock_dfs/{}.csv'.format(ticker)) df.set_index('Date', inplace=True) df.rename(columns={'Adj Close':ticker}, inplace=True) df.drop(['Open','High','Low','Close','Volume'],1,inplace=True) if main_df.empty: main_df = df else: main_df = main_df.join(df, how='outer') if count % 10 == 0: print(count) print(main_df.head()) main_df.to_csv('sp500_joined_closes.csv') compile_data()
当前完整的代码为:
import bs4 as bs import datetime as dt import os import pandas as pd import pandas_datareader.data as web import pickle import requests def save_sp500_tickers(): resp = requests.get('http://en.wikipedia.org/wiki/List_of_S%26P_500_companies') soup = bs.BeautifulSoup(resp.text, 'lxml') table = soup.find('table', {'class': 'wikitable sortable'}) tickers = [] for row in table.findAll('tr')[1:]: ticker = row.findAll('td')[0].text tickers.append(ticker) with open("sp500tickers.pickle", "wb") as f: pickle.dump(tickers, f) return tickers # save_sp500_tickers() def get_data_from_yahoo(reload_sp500=False): if reload_sp500: tickers = save_sp500_tickers() else: with open("sp500tickers.pickle", "rb") as f: tickers = pickle.load(f) if not os.path.exists('stock_dfs'): os.makedirs('stock_dfs') start = dt.datetime(2010, 1, 1) end = dt.datetime.now() for ticker in tickers: # just in case your connection breaks, we'd like to save our progress! if not os.path.exists('stock_dfs/{}.csv'.format(ticker)): df = web.DataReader(ticker, 'morningstar', start, end) df.reset_index(inplace=True) df.set_index("Date", inplace=True) df = df.drop("Symbol", axis=1) df.to_csv('stock_dfs/{}.csv'.format(ticker)) else: print('Already have {}'.format(ticker)) def compile_data(): with open("sp500tickers.pickle", "rb") as f: tickers = pickle.load(f) main_df = pd.DataFrame() for count, ticker in enumerate(tickers): df = pd.read_csv('stock_dfs/{}.csv'.format(ticker)) df.set_index('Date', inplace=True) df.rename(columns={'Adj Close': ticker}, inplace=True) df.drop(['Open', 'High', 'Low', 'Close', 'Volume'], 1, inplace=True) if main_df.empty: main_df = df else: main_df = main_df.join(df, how='outer') if count % 10 == 0: print(count) print(main_df.head()) main_df.to_csv('sp500_joined_closes.csv') compile_data()
在下一个教程中,我们将试图查看我们是否能够快速找到数据中的任何关系。