Python量化投资——时间序列数据指数平滑移动平均值EMA的高效计算

Python量化投资——时间序列数据指数平滑移动平均值的高效计算

定义

在对股票的历史价格数据进行分析的过程中,不同的移动平均值是非常常用的技术手段。在多种移动平均值中,指数平滑移动平均(Exponentially Weighted Moving Average, EWMA或Exponential Moving Average, EMA)时比较常用的一种。很多常见均线指标如MACD或TRIX都是基于EMA的。

鉴于EMA的重要性,我们有必要建立高效的生成方法。首先来看看EMA的计算公式:
E M A t = α P r i c e + ( 1 − α ) E M A t − 1 {EMA}_t =\alpha Price + (1-\alpha) EMA_{t-1} EMAt=αPrice+(1α)EMAt1
根据EMA的定义,指数平滑移动平均值是从起点开始迄今为止所有历史价格的一个加权平均值,昨天的历史价格的权重最高,距离今天越远则价格权重越低。
公式中的 α \alpha α被称为衰减因子(Decay Factor),取值为 0 < α < 1 0<\alpha<1 0<α<1,代表权重衰减的速度。
EMA的值仅仅与 α \alpha α相关,但是我们可以推出另一个与 α \alpha α相关的参数: α = 2 s p a n + 1 \alpha = \frac2 {span+1} α=span+12
其中 s p a n span span被称为跨度,与标准移动平均值的 w i n d o w window window参数一样,代表今日的平均价与过去多少天的价格有关,不过在 E M A EMA EMA这里,今日的平均价只是大体上与 s p a n span span内的价格相关。通常在计算函数中,我们都会给定 s p a n span span的值而不是 α \alpha α的值。

EMA循环生成方法

既然了解了EMA的计算公式,很容易看出EMA是一个循环定义的数值,需要从第一天的价格开始每日往后推算,因此很自然,EMA的计算函数可以用循环来写:

In [610]: def ewm(arr, span): 
     ...:     n0 = arr[0] 
     ...:     res = [] 
     ...:     res.append(n0) 
     ...:     alpha = 2 / (span + 1) 
     ...:     prev = n0 
     ...:     for num in arr[1:]: 
     ...:         if not np.isnan(num): 
     ...:             new = (alpha * num + (1 - alpha) * prev) 
     ...:         res.append(new)         
     ...:         prev = new 
     ...:     return res 

在这个函数中使用了for-loop,因为是对ndarray对象操作,因此还可以尝试使用numpy自带的迭代器nditer,看看是否比for-loop更快:

In [614]: def ewm_nditer(arr, span): 
     ...:     n0 = arr[0] 
     ...:     res = [] 
     ...:     res.append(n0) 
     ...:     alpha = 2 / (span + 1) 
     ...:     prev = n0 
     ...:     it = np.nditer(arr[1:]) 
     ...:     for num in it: 
     ...:         if not np.isnan(num): 
     ...:             new = (alpha * num + (1 - alph
     ...: a) * prev) 
     ...:         res.append(new)         
     ...:         prev = new 
     ...:     return res 
     ...:               

Pandas提供的方法

其实,如果使用pandas,那么我们不需要自己写ema的函数。Pandas自带了一个ewm对象,可以方便地计算移动平均:

In [605]: ser.ewm(span = 10, adjust = False).mean() 

其中,ser是一个Series对象。
需要注意的是,Pandas的ewm对象在默认情况下会对生成的移动平均值进行调整,目的是为了缩小最初阶段数据与普通移动平均值之间的差距。对于这种调整,对金融和财经数据来说是没必要的,大家普遍采用的都是标准的公式计算,因此在调用ewm对象时应该显式声明adjust = False以关闭自动调整。否则会发现pandas的计算结果跟上面的公式算出来的不同。

基于Numpy的向量化方法

理论上讲Numpy的向量化实现方式应该比Pandas更快,可惜找遍numpy的文档,都没有发现原生的ema计算函数,因此,只能通过公式推导,找到一个向量化的计算公式,并且在Numpy中实现出来:

In [878]: def numpy_ewma_vectorized_v2(data, window): 
     ...:     alpha = 2 /(window + 1.0) 
     ...:     alpha_rev = 1-alpha 
     ...:     n = data.shape[0] 
     ...:     pows = alpha_rev**(np.arange(n+1)) 
     ...:     scale_arr = 1/pows[:-1] 
     ...:     offset = data[0]*pows[1:] 
     ...:     pw0 = alpha*alpha_rev**(n-1) 
     ...:     mult = data*pw0*scale_arr 
     ...:     cumsums = mult.cumsum() 
     ...:     out = offset + cumsums*scale_arr[::-1] 
     ...:     return out 

因为EMA的值受到全部历史数据的影响,因此在上述函数的计算过程中,需要计算 α ( n − 1 ) \alpha^{(n-1)} α(n1),(推导过程就不写了)。因此,当n非常大,且 α \alpha α足够大时,有可能会造成overflow的错误。
但不管怎么说,在小规模的数据上测试上述代码是可行的。

性能对比

现在我们有了四种不同的算法:按照公式的循环方法、nditer循环,Pandas的ewm()对象,以及我们的基于Numpy()的方法,下面使用包含2500元素的时序数据测试每种方法的性能:

In [611]: %timeit ewm(ser.values, 10)               
4.99 ms ± 24.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [616]: %timeit ewm_nditer(ser.values, 10)        
5.33 ms ± 29.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [605]: %timeit ser.ewm(span = 10, adjust = False).mean() 
497 µs ± 1.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [606]: %timeit numpy_ewma_vectorized_v2(ser.values, 10) 
81.6 µs ± 854 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

可以看出,基于numpy的方法全面胜过循环方法,甚至大大优于Pandas的ewm对象方法。numpy方法相对循环法效率提升60倍,相对Pandas方法效率提升6倍左右。

Numpy方法的局限性及解决方案

如前所述,向量化的方法因为涉及到 α ( n − 1 ) \alpha^{(n-1)} α(n1)的计算,有overflow溢出的风险存在。我在StackOverflow上找到了一位名叫Jake Walden的大神,他给出了自己的解决方案。
首先,他根据计算所需要达到的精度,通过以下代码计算出了不同的span值下,系统能够计算的最大数据长度。在这个长度限制以内, α ( n − 1 ) \alpha^{(n-1)} α(n1)不会溢出,能够正常计算,否则就会导致溢出。

def get_max_row_size(alpha, dtype=float):
    assert 0. <= alpha < 1.
    epsilon = np.finfo(dtype).tiny
    return int(np.log(epsilon)/np.log(1-alpha)) + 1
span 跨度alpha 衰减系数max item number 最大允许数据量
20.6666667645
70.252463
120.1538464241
170.111111116015
220.0869567787
270.071428579559

其次,如果需要计算的数据量超过了上述限额,那么把数据切分成为几块,确保每一块都不超过限额,再分别计算后组合起来
这样就能完美解决overflow的问题了。
大神贴出了他的计算函数的全部完整代码,不敢独享,我也转帖如下。该部分的代码完全归功于Jake Walden1
以下是全部代码:

# tested with python3 & numpy 1.15.2
import numpy as np

def ewma_vectorized_safe(data, alpha, row_size=None, dtype=None, order='C', out=None):
    """
    Reshapes data before calculating EWMA, then iterates once over the rows
    to calculate the offset without precision issues
    :param data: Input data, will be flattened.
    :param alpha: scalar float in range (0,1)
        The alpha parameter for the moving average.
    :param row_size: int, optional
        The row size to use in the computation. High row sizes need higher precision,
        low values will impact performance. The optimal value depends on the
        platform and the alpha being used. Higher alpha values require lower
        row size. Default depends on dtype.
    :param dtype: optional
        Data type used for calculations. Defaults to float64 unless
        data.dtype is float32, then it will use float32.
    :param order: {'C', 'F', 'A'}, optional
        Order to use when flattening the data. Defaults to 'C'.
    :param out: ndarray, or None, optional
        A location into which the result is stored. If provided, it must have
        the same shape as the desired output. If not provided or `None`,
        a freshly-allocated array is returned.
    :return: The flattened result.
    """
    data = np.array(data, copy=False)

    if dtype is None:
        if data.dtype == np.float32:
            dtype = np.float32
        else:
            dtype = np.float
    else:
        dtype = np.dtype(dtype)

    if row_size is not None:
        row_size = int(row_size) 
    else:
        row_size = get_max_row_size(alpha, dtype)

    if data.size <= row_size:
        # The normal function can handle this input, use that
        return ewma_vectorized(data, alpha, dtype=dtype, order=order, out=out)

    if data.ndim > 1:
        # flatten input
        data = np.reshape(data, -1, order=order)

    if out is None:
        out = np.empty_like(data, dtype=dtype)
    else:
        assert out.shape == data.shape
        assert out.dtype == dtype

    row_n = int(data.size // row_size)  # the number of rows to use
    trailing_n = int(data.size % row_size)  # the amount of data leftover
    first_offset = data[0]

    if trailing_n > 0:
        # set temporary results to slice view of out parameter
        out_main_view = np.reshape(out[:-trailing_n], (row_n, row_size))
        data_main_view = np.reshape(data[:-trailing_n], (row_n, row_size))
    else:
        out_main_view = out
        data_main_view = data

    # get all the scaled cumulative sums with 0 offset
    ewma_vectorized_2d(data_main_view, alpha, axis=1, offset=0, dtype=dtype,
                       order='C', out=out_main_view)

    scaling_factors = (1 - alpha) ** np.arange(1, row_size + 1)
    last_scaling_factor = scaling_factors[-1]

    # create offset array
    offsets = np.empty(out_main_view.shape[0], dtype=dtype)
    offsets[0] = first_offset
    # iteratively calculate offset for each row
    for i in range(1, out_main_view.shape[0]):
        offsets[i] = offsets[i - 1] * last_scaling_factor + out_main_view[i - 1, -1]

    # add the offsets to the result
    out_main_view += offsets[:, np.newaxis] * scaling_factors[np.newaxis, :]

    if trailing_n > 0:
        # process trailing data in the 2nd slice of the out parameter
        ewma_vectorized(data[-trailing_n:], alpha, offset=out_main_view[-1, -1],
                        dtype=dtype, order='C', out=out[-trailing_n:])
    return out

def get_max_row_size(alpha, dtype=float):
    assert 0. <= alpha < 1.
    # This will return the maximum row size possible on 
    # your platform for the given dtype. I can find no impact on accuracy
    # at this value on my machine.
    # Might not be the optimal value for speed, which is hard to predict
    # due to numpy's optimizations
    # Use np.finfo(dtype).eps if you  are worried about accuracy
    # and want to be extra safe.
    epsilon = np.finfo(dtype).tiny
    # If this produces an OverflowError, make epsilon larger
    return int(np.log(epsilon)/np.log(1-alpha)) + 1

def ewma_vectorized(data, alpha, offset=None, dtype=None, order='C', out=None):
    """
    Calculates the exponential moving average over a vector.
    Will fail for large inputs.
    :param data: Input data
    :param alpha: scalar float in range (0,1)
        The alpha parameter for the moving average.
    :param offset: optional
        The offset for the moving average, scalar. Defaults to data[0].
    :param dtype: optional
        Data type used for calculations. Defaults to float64 unless
        data.dtype is float32, then it will use float32.
    :param order: {'C', 'F', 'A'}, optional
        Order to use when flattening the data. Defaults to 'C'.
    :param out: ndarray, or None, optional
        A location into which the result is stored. If provided, it must have
        the same shape as the input. If not provided or `None`,
        a freshly-allocated array is returned.
    """
    data = np.array(data, copy=False)

    if dtype is None:
        if data.dtype == np.float32:
            dtype = np.float32
        else:
            dtype = np.float64
    else:
        dtype = np.dtype(dtype)

    if data.ndim > 1:
        # flatten input
        data = data.reshape(-1, order)

    if out is None:
        out = np.empty_like(data, dtype=dtype)
    else:
        assert out.shape == data.shape
        assert out.dtype == dtype

    if data.size < 1:
        # empty input, return empty array
        return out

    if offset is None:
        offset = data[0]

    alpha = np.array(alpha, copy=False).astype(dtype, copy=False)

    # scaling_factors -> 0 as len(data) gets large
    # this leads to divide-by-zeros below
    scaling_factors = np.power(1. - alpha, np.arange(data.size + 1, dtype=dtype),
                               dtype=dtype)
    # create cumulative sum array
    np.multiply(data, (alpha * scaling_factors[-2]) / scaling_factors[:-1],
                dtype=dtype, out=out)
    np.cumsum(out, dtype=dtype, out=out)

    # cumsums / scaling
    out /= scaling_factors[-2::-1]

    if offset != 0:
        offset = np.array(offset, copy=False).astype(dtype, copy=False)
        # add offsets
        out += offset * scaling_factors[1:]

    return out

def ewma_vectorized_2d(data, alpha, axis=None, offset=None, dtype=None, order='C', out=None):
    """
    Calculates the exponential moving average over a given axis.
    :param data: Input data, must be 1D or 2D array.
    :param alpha: scalar float in range (0,1)
        The alpha parameter for the moving average.
    :param axis: The axis to apply the moving average on.
        If axis==None, the data is flattened.
    :param offset: optional
        The offset for the moving average. Must be scalar or a
        vector with one element for each row of data. If set to None,
        defaults to the first value of each row.
    :param dtype: optional
        Data type used for calculations. Defaults to float64 unless
        data.dtype is float32, then it will use float32.
    :param order: {'C', 'F', 'A'}, optional
        Order to use when flattening the data. Ignored if axis is not None.
    :param out: ndarray, or None, optional
        A location into which the result is stored. If provided, it must have
        the same shape as the desired output. If not provided or `None`,
        a freshly-allocated array is returned.
    """
    data = np.array(data, copy=False)

    assert data.ndim <= 2

    if dtype is None:
        if data.dtype == np.float32:
            dtype = np.float32
        else:
            dtype = np.float64
    else:
        dtype = np.dtype(dtype)

    if out is None:
        out = np.empty_like(data, dtype=dtype)
    else:
        assert out.shape == data.shape
        assert out.dtype == dtype

    if data.size < 1:
        # empty input, return empty array
        return out

    if axis is None or data.ndim < 2:
        # use 1D version
        if isinstance(offset, np.ndarray):
            offset = offset[0]
        return ewma_vectorized(data, alpha, offset, dtype=dtype, order=order,
                               out=out)

    assert -data.ndim <= axis < data.ndim

    # create reshaped data views
    out_view = out
    if axis < 0:
        axis = data.ndim - int(axis)

    if axis == 0:
        # transpose data views so columns are treated as rows
        data = data.T
        out_view = out_view.T

    if offset is None:
        # use the first element of each row as the offset
        offset = np.copy(data[:, 0])
    elif np.size(offset) == 1:
        offset = np.reshape(offset, (1,))

    alpha = np.array(alpha, copy=False).astype(dtype, copy=False)

    # calculate the moving average
    row_size = data.shape[1]
    row_n = data.shape[0]
    scaling_factors = np.power(1. - alpha, np.arange(row_size + 1, dtype=dtype),
                               dtype=dtype)
    # create a scaled cumulative sum array
    np.multiply(
        data,
        np.multiply(alpha * scaling_factors[-2], np.ones((row_n, 1), dtype=dtype),
                    dtype=dtype)
        / scaling_factors[np.newaxis, :-1],
        dtype=dtype, out=out_view
    )
    np.cumsum(out_view, axis=1, dtype=dtype, out=out_view)
    out_view /= scaling_factors[np.newaxis, -2::-1]

    if not (np.size(offset) == 1 and offset == 0):
        offset = offset.astype(dtype, copy=False)
        # add the offsets to the scaled cumulative sums
        out_view += offset[:, np.newaxis] * scaling_factors[np.newaxis, 1:]

    return out

  1. 参考文献:来源StackOverflow,作者Jake Walden:https://stackoverflow.com/questions/42869495/numpy-version-of-exponential-weighted-moving-average-equivalent-to-pandas-ewm ↩︎

posted @ 2020-01-31 01:47  JackiePENG  阅读(56)  评论(0编辑  收藏  举报  来源