[Machine Learning for Trading] {ud501} Lesson 5: 01-04 Statistical analysis of time series | Lesson 6: 01-05 Incomplete data

Lesson outline

Pandas makes it very convenient to compute various statistics on a dataframe:

Global statistics: mean, median, std, sum, etc. [more]
Rolling statistics: rolling_mean, rolling_std, etc. [more]

You will use these functions to analyze stock movement over time.

Specifically, you will compute:

Bollinger Bands: A way of quantifying how far stock price has deviated from some norm.
Daily returns: Day-to-day change in stock price.

Global statistics

Global statistics: mean, median, std, sum, etc. [more]

Rolling statistics

rolling mean is one of the technical indicators,

one thing it looks at is the places where the proce crosses through the rolling average , eg:

a hypothesis is that this rolling mean may be a good representation of sort of the true underlying price of a stock

and that significant deviation from that eventually result in a return to the mean

a challenge is when is that deviation significant enough: see the quiz and next slide

Which statistic to use?

Bollinger Bands (1980s)

Computing rolling statistics

Rolling statistics: rolling_mean, rolling_std, etc. [more]

no data in the first 20 days

line 44 & 47

"""Bollinger Bands."""

import os
import pandas as pd
import matplotlib.pyplot as plt

def symbol_to_path(symbol, base_dir="data"):
    """Return CSV file path given ticker symbol."""
    return os.path.join(base_dir, "{}.csv".format(str(symbol)))


def get_data(symbols, dates):
    """Read stock data (adjusted close) for given symbols from CSV files."""
    df = pd.DataFrame(index=dates)
    if 'SPY' not in symbols:  # add SPY for reference, if absent
        symbols.insert(0, 'SPY')

    for symbol in symbols:
        df_temp = pd.read_csv(symbol_to_path(symbol), index_col='Date',
                parse_dates=True, usecols=['Date', 'Adj Close'], na_values=['nan'])
        df_temp = df_temp.rename(columns={'Adj Close': symbol})
        df = df.join(df_temp)
        if symbol == 'SPY':  # drop dates SPY did not trade
            df = df.dropna(subset=["SPY"])

    return df


def plot_data(df, title="Stock prices"):
    """Plot stock prices with a custom title and meaningful axis labels."""
    ax = df.plot(title=title, fontsize=12)
    ax.set_xlabel("Date")
    ax.set_ylabel("Price")
    plt.show()


def get_rolling_mean(values, window):
    """Return rolling mean of given values, using specified window size."""
    return pd.rolling_mean(values, window=window)


def get_rolling_std(values, window):
    """Return rolling standard deviation of given values, using specified window size."""
    # TODO: Compute and return rolling standard deviation
    return pd.rolling_std(values,window=window)


def get_bollinger_bands(rm, rstd):
    """Return upper and lower Bollinger Bands."""
    # TODO: Compute upper_band and lower_band
    upper_band = rm + 2*rstd
    lower_band = rm - 2*rstd
    return upper_band, lower_band


def test_run():
    # Read data
    dates = pd.date_range('2012-01-01', '2012-12-31')
    symbols = ['SPY']
    df = get_data(symbols, dates)

    # Compute Bollinger Bands
    # 1. Compute rolling mean
    rm_SPY = get_rolling_mean(df['SPY'], window=20)

    # 2. Compute rolling standard deviation
    rstd_SPY = get_rolling_std(df['SPY'], window=20)

    # 3. Compute upper and lower bands
    upper_band, lower_band = get_bollinger_bands(rm_SPY, rstd_SPY)
    
    # Plot raw SPY values, rolling mean and Bollinger Bands
    ax = df['SPY'].plot(title="Bollinger Bands", label='SPY')
    rm_SPY.plot(label='Rolling mean', ax=ax)
    upper_band.plot(label='upper band', ax=ax)
    lower_band.plot(label='lower band', ax=ax)

Daily returns

Compute daily returns

Cumulative returns

Pristine data

no exact correct price for a stock at a time point

Why data goes missing

An ETF or Exchange-Traded Fund is a basket of equities allocated in such a way that the overall portfolio tracks the performance of a stock exchange index. ETFs can be bought and sold on the market like shares.

For example, SPY tracks the S&P 500 index (Standard & Poor's selection of 500 large publicly-traded companies).

You will learn more about ETFs and other funds in the second mini-course.

case1: ends

case2: starts

case3: starts and ends occasionally

Why this is bad - what can we do?

3, dont interpolate

Documentation: pandas

Documentation: pandas.DataFrame.fillna()

You could also use the 'pad' method, same as 'ffill': fillna(method='pad')

Using fillna()

Read about method, inplace and other parameters: pandas.DataFrame.fillna()

Correction: The inplace parameter accepts a boolean value (whereas method accepts a string), so the function call (in line 47) should look like:

df_data.fillna(method="ffill", inplace=True)

Boolean constants in Python are True and False (without quotes).

"""Fill missing values"""

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os

def fill_missing_values(df_data):
    """Fill missing values in data frame, in place."""
    df_data.fillna(method="ffill",inplace=True)
    return df_data.fillna(method="bfill",inplace=True)


def symbol_to_path(symbol, base_dir="data"):
    """Return CSV file path given ticker symbol."""
    return os.path.join(base_dir, "{}.csv".format(str(symbol)))


def get_data(symbols, dates):
    """Read stock data (adjusted close) for given symbols from CSV files."""
    df_final = pd.DataFrame(index=dates)
    if "SPY" not in symbols:  # add SPY for reference, if absent
        symbols.insert(0, "SPY")

    for symbol in symbols:
        file_path = symbol_to_path(symbol)
        df_temp = pd.read_csv(file_path, parse_dates=True, index_col="Date",
            usecols=["Date", "Adj Close"], na_values=["nan"])
        df_temp = df_temp.rename(columns={"Adj Close": symbol})
        df_final = df_final.join(df_temp)
        if symbol == "SPY":  # drop dates SPY did not trade
            df_final = df_final.dropna(subset=["SPY"])

    return df_final


def plot_data(df_data):
    """Plot stock data with appropriate axis labels."""
    ax = df_data.plot(title="Stock Data", fontsize=2)
    ax.set_xlabel("Date")
    ax.set_ylabel("Price")
    plt.show()


def test_run():
    """Function called by Test Run."""
    # Read data
    symbol_list = ["JAVA", "FAKE1", "FAKE2"]  # list of symbols
    start_date = "2005-12-31"
    end_date = "2014-12-07"
    dates = pd.date_range(start_date, end_date)  # date range as index
    df_data = get_data(symbol_list, dates)  # get data for each symbol

    # Fill missing values
    fill_missing_values(df_data)

    # Plot
    plot_data(df_data)


if __name__ == "__main__":
    test_run()

posted @ 2019-06-04 09:31 ecoflex 阅读(252) 评论(0) 编辑收藏举报

刷新页面返回顶部

ecoflex

[Machine Learning for Trading] {ud501} Lesson 5: 01-04 Statistical analysis of time series | Lesson 6: 01-05 Incomplete data

Lesson outline

公告