论文实战

Flood prediction using Time Series Data Mining

http://www.sciencedirect.com/science/article/pii/S0022169406004331

Introduction

Hidden Markov Models (HMM) (Ayewah, 2003), Artificial Neural Networks (ANN) and Nonlinear Prediction (NLP) have been applied to discharge forecasting
这些方法都有人用过了,The HMM, ANN and NLP methods 预测未来值 values of discharge ,洪水的预测需要预测洪水事件,Time Series Data Mining (TSDM) 满足了这个需求

有很多算法去估计最优嵌入维数和延时estimation of optimal embedding dimension and time delay.
对于TDSM来说,这俩参数的精确估计并不重要,它不是利用重构相空间的预测结构的,而是利用重构的相空间提供非线性时间序列的简化表示(哎呀,仅仅为了简化???)

The accurate calculation of time delay and number of embedding dimensions is not a requirement for TSDM. TSDM does not try to exploit the predictive structure of the reconstructed phase space. The purpose of a phase space reconstruction is to provide a simplified representation of the nonlinear time series.

但是参数选好了可以重建更简化的相空间,以反映系统的#dynamics#(这个dynamics啊,其实就是state evovles with time)

However, selecting optimum time delay and embedding dimension would assist in reconstructing a more simplified phase space that closely reflects the true dynamics of the system. A simplified reconstruction should reduce the amount of computation to determine the optimal temporal pattern clusters as the number of dimensions to be searched is reduced.

Application of TSDM to flood prediction

使用TSDM之前,先使用 Surrogate data method (Kantz and Schrieber, 1997). 确认非线性,拒绝线性;作者也使用Correlation Dimension Method in Sivakumar and Jayawardena (2002)
确认了非线性

阈值选择

对于accurate identification and prediction of flood events 很重要,低的阈值易造成大的fp(False positive),高阈值易造成洪水事件的丢失
April 1933 to September 2003, consisting of 25,750 个数据,其中发生洪水的 1943, 1944, 1947, 1973, 1993 and 1995这几年的最低值是780,000 ft3/s.,选为阈值。将前15,000 个作为训练集。

数据获取

论文提到的数据是

flood forecasting at the St. Louis gauging station on the Mississippi River. The daily discharge time series is obtained from the United States Geological Survey website (http://www.usgs.gov), covering the period from April 1933 to September 2003, consisting of 25,750 data points.

经过探索找到了数据

数据在这个页面搜索http://waterdata.usgs.gov/nwis/uv?site_no=07010000
Google关键词daily discharge time series at saint louis
最终数据页面http://waterdata.usgs.gov/nwis/dv?cb_00060=on&format=html&site_no=07010000&referred_module=sw&period=&begin_date=1933-04-01&end_date=2015-03-16

pandas的read_html读不出来这个页面
查看网页代码,发现数据如下

<tr align="center"><td nowrap="nowrap"> 04/02/1933 </td><td nowrap="nowrap">245,000<sup>A&nbsp;&nbsp;</sup></td></tr>
<tr align="center"><td nowrap="nowrap"> 04/03/1933 </td><td nowrap="nowrap">264,000<sup>A&nbsp;&nbsp;</sup></td></tr>
<tr align="center"><td nowrap="nowrap"> 04/04/1933 </td><td nowrap="nowrap">280,000<sup>A&nbsp;&nbsp;</sup></td></tr>

所以只好写了个re来弄,代码如下

import pandas as pd
import requests
import StringIO
import re
url=r'http://waterdata.usgs.gov/nwis/dv?cb_00060=on&format=html&site_no=07010000&referred_module=sw&period=&begin_date=1933-04-01&end_date=2015-03-16'
r=requests.get(url)
sio=StringIO.StringIO(r.content)
p=re.compile(r"""
<td \s nowrap="nowrap"> \s 
(?P<date>\d\d/\d\d/\d\d\d\d) 
\s  </td><td \s nowrap="nowrap">
(?P<data>[\d+,]+\d+)
""",re.VERBOSE)
date=[]
data=[]
for line in sio:
    m=p.search(line)
    if m :
        date.append(m.group('date'))
        data.append(m.group('data'))
df=pd.DataFrame(dict(zip(('Date','discharge'),(date,data))))
df.to_csv('.\dataFlood.csv')

数据处理

df.info()
Data columns (total 3 columns):
Unnamed: 0    29935 non-null int64
Date          29935 non-null object
discharge     29935 non-null object
dtypes: int64(1), object(2)

Discharge 数据是字符类型,没法使用df.discharge=df.discharge.astype('float64')因为数据是245,000这种加了逗号的类型

df.discharge=df.discharge.apply(lambda x:''.join(x.split(',')))
df.discharge=df.discharge.astype('float64')
再保存回去吧
df.to_csv('.\dataFlood.csv')
保存的时候会把index保存成一列,这个没法取消啊,下次读入的时候要
df=df.drop([u'Unnamed: 0'],axis=1)
posted @ 2015-03-18 10:18  marquis  阅读(277)  评论(0编辑  收藏  举报