[Machine Learning for Trading] {ud501} Lesson 5: 01-04 Statistical analysis of time series | Lesson 6: 01-05 Incomplete data

Lesson outline

Pandas makes it very convenient to compute various statistics on a dataframe:

You will use these functions to analyze stock movement over time.

Specifically, you will compute:

  • Bollinger Bands: A way of quantifying how far stock price has deviated from some norm.
  • Daily returns: Day-to-day change in stock price.

 

 




 

 

 

Global statistics

 

Global statistics: meanmedianstdsum, etc. [more]

 

 

 

 Rolling statistics

 

rolling mean is one of the technical indicators, 

one thing it looks at is the places where the proce crosses through the rolling average , eg: 

 

a hypothesis is that this rolling mean may be a good representation of sort of the true underlying price of a stock

and that significant deviation from that eventually result in a return to the mean

 

a challenge is when is that deviation significant enough: see the quiz and next slide

 

 

 

Which statistic to use? 

 

 

 

 

Bollinger Bands (1980s)

 

 

 

 Computing rolling statistics

 

 

 Rolling statistics: rolling_meanrolling_std, etc. [more]

no data in the first 20 days

 line 44 & 47

 

 

 

 

"""Bollinger Bands."""

import os
import pandas as pd
import matplotlib.pyplot as plt

def symbol_to_path(symbol, base_dir="data"):
    """Return CSV file path given ticker symbol."""
    return os.path.join(base_dir, "{}.csv".format(str(symbol)))


def get_data(symbols, dates):
    """Read stock data (adjusted close) for given symbols from CSV files."""
    df = pd.DataFrame(index=dates)
    if 'SPY' not in symbols:  # add SPY for reference, if absent
        symbols.insert(0, 'SPY')

    for symbol in symbols:
        df_temp = pd.read_csv(symbol_to_path(symbol), index_col='Date',
                parse_dates=True, usecols=['Date', 'Adj Close'], na_values=['nan'])
        df_temp = df_temp.rename(columns={'Adj Close': symbol})
        df = df.join(df_temp)
        if symbol == 'SPY':  # drop dates SPY did not trade
            df = df.dropna(subset=["SPY"])

    return df


def plot_data(df, title="Stock prices"):
    """Plot stock prices with a custom title and meaningful axis labels."""
    ax = df.plot(title=title, fontsize=12)
    ax.set_xlabel("Date")
    ax.set_ylabel("Price")
    plt.show()


def get_rolling_mean(values, window):
    """Return rolling mean of given values, using specified window size."""
    return pd.rolling_mean(values, window=window)


def get_rolling_std(values, window):
    """Return rolling standard deviation of given values, using specified window size."""
    # TODO: Compute and return rolling standard deviation
    return pd.rolling_std(values,window=window)


def get_bollinger_bands(rm, rstd):
    """Return upper and lower Bollinger Bands."""
    # TODO: Compute upper_band and lower_band
    upper_band = rm + 2*rstd
    lower_band = rm - 2*rstd
    return upper_band, lower_band


def test_run():
    # Read data
    dates = pd.date_range('2012-01-01', '2012-12-31')
    symbols = ['SPY']
    df = get_data(symbols, dates)

    # Compute Bollinger Bands
    # 1. Compute rolling mean
    rm_SPY = get_rolling_mean(df['SPY'], window=20)

    # 2. Compute rolling standard deviation
    rstd_SPY = get_rolling_std(df['SPY'], window=20)

    # 3. Compute upper and lower bands
    upper_band, lower_band = get_bollinger_bands(rm_SPY, rstd_SPY)
    
    # Plot raw SPY values, rolling mean and Bollinger Bands
    ax = df['SPY'].plot(title="Bollinger Bands", label='SPY')
    rm_SPY.plot(label='Rolling mean', ax=ax)
    upper_band.plot(label='upper band', ax=ax)
    lower_band.plot(label='lower band', ax=ax)

 

 

 

 

 Daily returns

 

 

 

 

 Compute daily returns

 

 

 

 

 

 Cumulative returns

 

 

 

 






 

 

 

Pristine data

 

no exact correct price for a stock at a time point

 

 

 

 Why data goes missing

 

 

An ETF or Exchange-Traded Fund is a basket of equities allocated in such a way that the overall portfolio tracks the performance of a stock exchange index. ETFs can be bought and sold on the market like shares.

For example, SPY tracks the S&P 500 index (Standard & Poor's selection of 500 large publicly-traded companies).

You will learn more about ETFs and other funds in the second mini-course.

 

 case1: ends

case2: starts

case3: starts and ends occasionally

 

 

 

 Why this is bad - what can we do?

 

3, dont interpolate

 

 

 

 

 

Documentation: pandas

Documentation: pandas.DataFrame.fillna()

You could also use the 'pad' method, same as 'ffill': fillna(method='pad')

 

 

 

Using fillna() 

 

 

Read about methodinplace and other parameters: pandas.DataFrame.fillna()

Correction: The inplace parameter accepts a boolean value (whereas method accepts a string), so the function call (in line 47) should look like:

df_data.fillna(method="ffill", inplace=True)

Boolean constants in Python are True and False (without quotes).

 

 

 

 

"""Fill missing values"""

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os

def fill_missing_values(df_data):
    """Fill missing values in data frame, in place."""
    df_data.fillna(method="ffill",inplace=True)
    return df_data.fillna(method="bfill",inplace=True)


def symbol_to_path(symbol, base_dir="data"):
    """Return CSV file path given ticker symbol."""
    return os.path.join(base_dir, "{}.csv".format(str(symbol)))


def get_data(symbols, dates):
    """Read stock data (adjusted close) for given symbols from CSV files."""
    df_final = pd.DataFrame(index=dates)
    if "SPY" not in symbols:  # add SPY for reference, if absent
        symbols.insert(0, "SPY")

    for symbol in symbols:
        file_path = symbol_to_path(symbol)
        df_temp = pd.read_csv(file_path, parse_dates=True, index_col="Date",
            usecols=["Date", "Adj Close"], na_values=["nan"])
        df_temp = df_temp.rename(columns={"Adj Close": symbol})
        df_final = df_final.join(df_temp)
        if symbol == "SPY":  # drop dates SPY did not trade
            df_final = df_final.dropna(subset=["SPY"])

    return df_final


def plot_data(df_data):
    """Plot stock data with appropriate axis labels."""
    ax = df_data.plot(title="Stock Data", fontsize=2)
    ax.set_xlabel("Date")
    ax.set_ylabel("Price")
    plt.show()


def test_run():
    """Function called by Test Run."""
    # Read data
    symbol_list = ["JAVA", "FAKE1", "FAKE2"]  # list of symbols
    start_date = "2005-12-31"
    end_date = "2014-12-07"
    dates = pd.date_range(start_date, end_date)  # date range as index
    df_data = get_data(symbol_list, dates)  # get data for each symbol

    # Fill missing values
    fill_missing_values(df_data)

    # Plot
    plot_data(df_data)


if __name__ == "__main__":
    test_run()

 

 

posted @ 2019-06-04 09:31  ecoflex  阅读(249)  评论(0编辑  收藏  举报