第三课创建函数 - 从EXCEL读取 - 导出到EXCEL - 异常值 - Lambda函数 - 切片和骰子数据

第 3 课

获取数据 - 我们的数据集将包含一个Excel文件，其中包含每天的客户数量。我们将学习如何对 excel 文件进行处理。
准备数据 - 数据是有重复日期的不规则时间序列。我们将挑战数据压缩，并进行预测明年的客户数量。
分析数据 - 我们使用图形来显示趋势并发现异常值。一些内置的计算工具将用来预测未来几年的客户数量。
呈现数据 - 绘制结果。

注意：确保你已经浏览了以前的所有课程，因为以前课程中学到的知识将用于此练习。

In [1]:

# Import libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy.random as np
import sys
import matplotlib

%matplotlib inline  #这是魔法函数

In [2]:

print('Python version ' + sys.version)
print('Pandas version: ' + pd.__version__)
print('Matplotlib version ' + matplotlib.__version__)

Python version 3.5.1 |Anaconda custom (64-bit)| (default, Feb 16 2016, 09:49:46) [MSC v.1900 64 bit (AMD64)]
Pandas version: 0.20.1
Matplotlib version 1.5.1

我们将创建我们自己的测试数据进行分析。

In [3]:

# set seed
np.seed(111)

# Function to generate test data
def CreateDataSet(Number=1):
    
    Output = []
    
    for i in range(Number):
        
        # Create a weekly (mondays) date range
        rng = pd.date_range(start='1/1/2009', end='12/31/2012', freq='W-MON')
        
        # Create random data
        data = np.randint(low=25,high=1000,size=len(rng))
        
        # Status pool
        status = [1,2,3]
        
        # Make a random list of statuses
        random_status = [status[np.randint(low=0,high=len(status))] for i in range(len(rng))]
        
        # State pool
        states = ['GA','FL','fl','NY','NJ','TX']
        
        # Make a random list of states 
        random_states = [states[np.randint(low=0,high=len(states))] for i in range(len(rng))]
    
        Output.extend(zip(random_states, random_status, data, rng))
        
    return Output

现在我们有了生成测试数据的函数，我们可以创建一些数据并将其粘贴到数据帧中。

In [4]:

dataset = CreateDataSet(4)
df = pd.DataFrame(data=dataset, columns=['State','Status','CustomerCount','StatusDate'])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 836 entries, 0 to 835
Data columns (total 4 columns):
State            836 non-null object
Status           836 non-null int64
CustomerCount    836 non-null int64
StatusDate       836 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(2), object(1)
memory usage: 26.2+ KB

In [5]:

df.head()

Out[5]:

	State	Status	CustomerCount	StatusDate
0	GA	1	877	2009-01-05
1	FL	1	901	2009-01-12
2	fl	3	749	2009-01-19
3	FL	3	111	2009-01-26
4	GA	1	300	2009-02-02

我们现在将把这个数据帧保存到一个Excel文件中，然后将其返回到一个数据帧。我们这样做，只是向您展示如何读写Excel文件。

我们不会将数据帧的索引值写入Excel文件，因为它们不是我们初始测试数据集的一部分。

to_execel,read_excel 函数需库xlrd(0.9.0以上版本)支持。需先安装，可 pip install xlrd

In [6]:

# Save results to excel
df.to_excel('Lesson3.xlsx', index=False)
print('Done')

Done

从Excel获取数据

我们将使用read_excel函数从Excel文件中读取数据。该函数允许您按名称或位置读取特定的表单。

In [7]:

pd.read_excel?

注意：除非有指定，否则Excel文件上的位置指当前目录。

In [8]:

# Location of file
Location = r'C:\Users\david\notebooks\update\Lesson3.xlsx'

# Parse a specific sheet
df = pd.read_excel(Location, 0, index_col='StatusDate')
df.dtypes

Out[8]:

State            object
Status            int64
CustomerCount     int64
dtype: object

In [9]:

df.index

Out[9]:

DatetimeIndex(['2009-01-05', '2009-01-12', '2009-01-19', '2009-01-26',
               '2009-02-02', '2009-02-09', '2009-02-16', '2009-02-23',
               '2009-03-02', '2009-03-09',
               ...
               '2012-10-29', '2012-11-05', '2012-11-12', '2012-11-19',
               '2012-11-26', '2012-12-03', '2012-12-10', '2012-12-17',
               '2012-12-24', '2012-12-31'],
              dtype='datetime64[ns]', name='StatusDate', length=836, freq=None)

In [10]:

df.head()

Out[10]:

	State	Status	CustomerCount
StatusDate
2009-01-05	GA	1	877
2009-01-12	FL	1	901
2009-01-19	fl	3	749
2009-01-26	FL	3	111
2009-02-02	GA	1	300

准备数据

本节尝试清理数据以供分析。

确保状态栏全部大写
只选择状态等于“1”的记录
合并（新泽西州NJ和纽约州NY）到纽约州NY列
删除所有异常值（数据集中的任何奇怪结果）

让我们快速看看：一些州的值是大写的，有些是小写的

In [11]:

df['State'].unique()

Out[11]:

array(['GA', 'FL', 'fl', 'TX', 'NY', 'NJ'], dtype=object)

将所有州的值转换为大写，我们使用upper()函数和数据帧的apply属性。使用lambda函数将应用在State列的大写函数上。

In [12]:

# Clean State Column, convert to upper case
df['State'] = df.State.apply(lambda x: x.upper())

In [13]:

df['State'].unique()

Out[13]:

array(['GA', 'FL', 'TX', 'NY', 'NJ'], dtype=object)

In [14]:

# Only grab where Status == 1
mask = df['Status'] == 1
df = df[mask]

把NJ变成NY，我们只需...

[df.State =='NJ'] - 查找 State列中他们等于 NJ的所有记录。
df.State [df.State =='NJ'] ='NY' - 对于 State列中与 NJ等同的所有记录，将其替换为 NY。

In [15]:

# Convert NJ to NY
mask = df.State == 'NJ'
df['State'][mask] = 'NY'

我们看看结果

In [16]:

df['State'].unique()

Out[16]:

array(['GA', 'FL', 'NY', 'TX'], dtype=object)

现在，我们可能想要绘制数据图来检查数据中的异常值。我们使用数据帧的plot属性。从下面的图表中可以看到，它不是非常确定的，可能是我们需要进行更多数据准备的标志

In [17]:

df['CustomerCount'].plot(figsize=(15,5));

我们看看数据，发现同一个State，StatusDate和Status组合有多个值。这可能意味着您正在使用的数据是脏/不良/不准确的，但我们也会另有其他假设。我们可以假设这个数据集是一个更大的数据集的一个子集，如果我们简单地在每个 State, StatusDate, 和 Status的CustomerCount列中添加值，我们将获得每天的总客户数。

In [18]:

sortdf = df[df['State']=='NY'].sort_index(axis=0)
sortdf.head(10)

Out[18]:

	State	Status	CustomerCount
StatusDate
2009-01-19	NY	1	522
2009-02-23	NY	1	710
2009-03-09	NY	1	992
2009-03-16	NY	1	355
2009-03-23	NY	1	728
2009-03-30	NY	1	863
2009-04-13	NY	1	520
2009-04-20	NY	1	820
2009-04-20	NY	1	937
2009-04-27	NY	1	447

现在我们的任务是创建一个数据压缩了的新数据帧，以便每个州和每个州都有每日的客户数量。我们可以忽略Status列，因为此列中的所有值均为值1。为了实现这一点，我们将使用数据帧的函数groupby()和sum()。

请注意，我们必须使用reset_index。如果我们不这样做，我们将无法通过State和StatusDate进行分组，因为groupby函数只需要列作为输入。该reset_index函数将数据帧按StatusDate列索引。

In [19]:

# Group by State and StatusDate
Daily = df.reset_index().groupby(['State','StatusDate']).sum()
Daily.head()

Out[19]:

		Status	CustomerCount
State	StatusDate
FL	2009-01-12	1	901
	2009-02-02	1	653
	2009-03-23	1	752
	2009-04-06	2	1086
	2009-06-08	1	649

数据帧里State 和StatusDate列自动按日期索引。您可以将索引视为数据库表的主键，但不具有唯一值的限制。您将看到索引中的列允许我们轻松地选择，绘图并对数据执行计算。

下面我们删除Status列，因为它全部等于1，不再需要。

In [20]:

del Daily['Status']
Daily.head()

Out[20]:

		CustomerCount
State	StatusDate
FL	2009-01-12	901
	2009-02-02	653
	2009-03-23	752
	2009-04-06	1086
	2009-06-08	649

In [21]:

# What is the index of the dataframe
Daily.index

Out[21]:

MultiIndex(levels=[['FL', 'GA', 'NY', 'TX'], [2009-01-05 00:00:00, 2009-01-12 00:00:00, 2009-01-19 00:00:00, 2009-02-02 00:00:00, 2009-02-23 00:00:00, 2009-03-09 00:00:00, 2009-03-16 00:00:00, 2009-03-23 00:00:00, 2009-03-30 00:00:00, 2009-04-06 00:00:00, 2009-04-13 00:00:00, 2009-04-20 00:00:00, 2009-04-27 00:00:00, 2009-05-04 00:00:00, 2009-05-11 00:00:00, 2009-05-18 00:00:00, 2009-05-25 00:00:00, 2009-06-08 00:00:00, 2009-06-22 00:00:00, 2009-07-06 00:00:00, 2009-07-13 00:00:00, 2009-07-20 00:00:00, 2009-07-27 00:00:00, 2009-08-10 00:00:00, 2009-08-17 00:00:00, 2009-08-24 00:00:00, 2009-08-31 00:00:00, 2009-09-07 00:00:00, 2009-09-14 00:00:00, 2009-09-21 00:00:00, 2009-09-28 00:00:00, 2009-10-05 00:00:00, 2009-10-12 00:00:00, 2009-10-19 00:00:00, 2009-10-26 00:00:00, 2009-11-02 00:00:00, 2009-11-23 00:00:00, 2009-11-30 00:00:00, 2009-12-07 00:00:00, 2009-12-14 00:00:00, 2010-01-04 00:00:00, 2010-01-11 00:00:00, 2010-01-18 00:00:00, 2010-01-25 00:00:00, 2010-02-08 00:00:00, 2010-02-15 00:00:00, 2010-02-22 00:00:00, 2010-03-01 00:00:00, 2010-03-08 00:00:00, 2010-03-15 00:00:00, 2010-04-05 00:00:00, 2010-04-12 00:00:00, 2010-04-26 00:00:00, 2010-05-03 00:00:00, 2010-05-10 00:00:00, 2010-05-17 00:00:00, 2010-05-24 00:00:00, 2010-05-31 00:00:00, 2010-06-14 00:00:00, 2010-06-28 00:00:00, 2010-07-05 00:00:00, 2010-07-19 00:00:00, 2010-07-26 00:00:00, 2010-08-02 00:00:00, 2010-08-09 00:00:00, 2010-08-16 00:00:00, 2010-08-30 00:00:00, 2010-09-06 00:00:00, 2010-09-13 00:00:00, 2010-09-20 00:00:00, 2010-09-27 00:00:00, 2010-10-04 00:00:00, 2010-10-11 00:00:00, 2010-10-18 00:00:00, 2010-10-25 00:00:00, 2010-11-01 00:00:00, 2010-11-08 00:00:00, 2010-11-15 00:00:00, 2010-11-29 00:00:00, 2010-12-20 00:00:00, 2011-01-03 00:00:00, 2011-01-10 00:00:00, 2011-01-17 00:00:00, 2011-02-07 00:00:00, 2011-02-14 00:00:00, 2011-02-21 00:00:00, 2011-02-28 00:00:00, 2011-03-07 00:00:00, 2011-03-14 00:00:00, 2011-03-21 00:00:00, 2011-03-28 00:00:00, 2011-04-04 00:00:00, 2011-04-18 00:00:00, 2011-04-25 00:00:00, 2011-05-02 00:00:00, 2011-05-09 00:00:00, 2011-05-16 00:00:00, 2011-05-23 00:00:00, 2011-05-30 00:00:00, 2011-06-06 00:00:00, 2011-06-20 00:00:00, 2011-06-27 00:00:00, 2011-07-04 00:00:00, 2011-07-11 00:00:00, 2011-07-25 00:00:00, 2011-08-01 00:00:00, 2011-08-08 00:00:00, 2011-08-15 00:00:00, 2011-08-29 00:00:00, 2011-09-05 00:00:00, 2011-09-12 00:00:00, 2011-09-26 00:00:00, 2011-10-03 00:00:00, 2011-10-24 00:00:00, 2011-10-31 00:00:00, 2011-11-07 00:00:00, 2011-11-14 00:00:00, 2011-11-28 00:00:00, 2011-12-05 00:00:00, 2011-12-12 00:00:00, 2011-12-19 00:00:00, 2011-12-26 00:00:00, 2012-01-02 00:00:00, 2012-01-09 00:00:00, 2012-01-16 00:00:00, 2012-02-06 00:00:00, 2012-02-13 00:00:00, 2012-02-20 00:00:00, 2012-02-27 00:00:00, 2012-03-05 00:00:00, 2012-03-12 00:00:00, 2012-03-19 00:00:00, 2012-04-02 00:00:00, 2012-04-09 00:00:00, 2012-04-23 00:00:00, 2012-04-30 00:00:00, 2012-05-07 00:00:00, 2012-05-14 00:00:00, 2012-05-28 00:00:00, 2012-06-04 00:00:00, 2012-06-18 00:00:00, 2012-07-02 00:00:00, 2012-07-09 00:00:00, 2012-07-16 00:00:00, 2012-07-30 00:00:00, 2012-08-06 00:00:00, 2012-08-20 00:00:00, 2012-08-27 00:00:00, 2012-09-03 00:00:00, 2012-09-10 00:00:00, 2012-09-17 00:00:00, 2012-09-24 00:00:00, 2012-10-01 00:00:00, 2012-10-08 00:00:00, 2012-10-22 00:00:00, 2012-10-29 00:00:00, 2012-11-05 00:00:00, 2012-11-12 00:00:00, 2012-11-19 00:00:00, 2012-11-26 00:00:00, 2012-12-10 00:00:00]],
           labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3], [1, 3, 7, 9, 17, 19, 20, 21, 23, 25, 27, 28, 29, 30, 31, 35, 38, 40, 41, 44, 45, 46, 47, 48, 49, 52, 54, 56, 57, 59, 60, 62, 66, 68, 69, 70, 71, 72, 75, 76, 77, 78, 79, 85, 88, 89, 92, 96, 97, 99, 100, 101, 103, 104, 105, 108, 109, 110, 112, 114, 115, 117, 118, 119, 125, 126, 127, 128, 129, 131, 133, 134, 135, 136, 137, 140, 146, 150, 151, 152, 153, 157, 0, 3, 7, 22, 23, 24, 27, 28, 34, 37, 42, 47, 50, 55, 58, 66, 67, 69, 71, 73, 74, 75, 79, 82, 83, 84, 85, 91, 93, 95, 97, 106, 110, 120, 124, 125, 126, 127, 132, 133, 139, 143, 158, 159, 160, 2, 4, 5, 6, 7, 8, 10, 11, 12, 13, 14, 16, 19, 21, 22, 24, 26, 28, 29, 30, 31, 32, 33, 36, 39, 40, 42, 43, 51, 56, 61, 62, 63, 66, 67, 70, 71, 72, 73, 75, 78, 80, 81, 82, 83, 86, 87, 90, 91, 92, 94, 101, 102, 103, 105, 107, 108, 111, 113, 116, 118, 122, 125, 129, 130, 131, 132, 138, 139, 141, 142, 143, 144, 148, 149, 154, 156, 159, 160, 15, 16, 17, 18, 45, 47, 50, 53, 57, 61, 64, 65, 68, 84, 88, 94, 98, 107, 110, 112, 115, 121, 122, 123, 128, 130, 134, 135, 145, 146, 147, 148, 155]],
           names=['State', 'StatusDate'])

In [22]:

# Select the State index
Daily.index.levels[0]

Out[22]:

Index(['FL', 'GA', 'NY', 'TX'], dtype='object', name='State')

In [23]:

# Select the StatusDate index
Daily.index.levels[1]

Out[23]:

DatetimeIndex(['2009-01-05', '2009-01-12', '2009-01-19', '2009-02-02',
               '2009-02-23', '2009-03-09', '2009-03-16', '2009-03-23',
               '2009-03-30', '2009-04-06',
               ...
               '2012-09-24', '2012-10-01', '2012-10-08', '2012-10-22',
               '2012-10-29', '2012-11-05', '2012-11-12', '2012-11-19',
               '2012-11-26', '2012-12-10'],
              dtype='datetime64[ns]', name='StatusDate', length=161, freq=None)

现在让我们绘制每个州的数据。

如你所见，可以通过分析State列的图表，我们对数据的外观更加清晰。你能发现异常值吗？

In [24]:

Daily.loc['FL'].plot()
Daily.loc['GA'].plot()
Daily.loc['NY'].plot()
Daily.loc['TX'].plot();

我们也可以绘制特定日期的数据，如2012。由于数据由每周的客户数量组成，数据的可变性似乎是可疑的。对于本教程，我们将假设不良数据已处理

。

In [25]:

Daily.loc['FL']['2012':].plot()
Daily.loc['GA']['2012':].plot()
Daily.loc['NY']['2012':].plot()
Daily.loc['TX']['2012':].plot();

我们假定每个月的客户数量保持相对稳定。该月份特定范围以外的数据将从数据集中删除。最终的结果应该是没有尖峰的平滑图。

StateYearMonth - 在这里我们按State，StatusDate的年和StatusDate的月进行分组。
Daily ['Outlier'] - 一个布尔值（True或False）值，让我们知道CustomerCount列中的值是否在可接受的范围之外。

我们使用属性transform 而替代apply。原因是transform能保持数据帧的形状（行数和列数）相同，而apply不行。通过查看以前的图，我们可以看到它们不像高斯分布，这意味着我们不能使用均值和stDev之类的汇总统计量。我们使用百分位数代替。请注意，我们冒着消除良好数据的风险。

In [26]:

# Calculate Outliers
StateYearMonth = Daily.groupby([Daily.index.get_level_values(0), Daily.index.get_level_values(1).year, Daily.index.get_level_values(1).month])
Daily['Lower'] = StateYearMonth['CustomerCount'].transform( lambda x: x.quantile(q=.25) - (1.5*x.quantile(q=.75)-x.quantile(q=.25)) )
Daily['Upper'] = StateYearMonth['CustomerCount'].transform( lambda x: x.quantile(q=.75) + (1.5*x.quantile(q=.75)-x.quantile(q=.25)) )
Daily['Outlier'] = (Daily['CustomerCount'] < Daily['Lower']) | (Daily['CustomerCount'] > Daily['Upper']) 

# Remove Outliers
Daily = Daily[Daily['Outlier'] == False]

名为Daily的数据帧将保存每天汇总的客户数量。原始数据（df）每天有多个记录。我们剩下一个由State和StatusDate索引的数据集。异常值列等于False，表示该记录不是异常值。

In [27]:

Daily.head()

Out[27]:

		CustomerCount	Lower	Upper	Outlier
State	StatusDate
FL	2009-01-12	901	450.5	1351.5	False
	2009-02-02	653	326.5	979.5	False
	2009-03-23	752	376.0	1128.0	False
	2009-04-06	1086	543.0	1629.0	False
	2009-06-08	649	324.5	973.5	False

我们创建一个名为ALL的单独数据帧，它是将Daily数据帧按StatusDate分组而成。我们基本上摒弃了State列。Max列表示每月的最大客户数。Max列用于平滑的曲线。

In [28]:

# Combine all markets

# Get the max customer count by Date
ALL = pd.DataFrame(Daily['CustomerCount'].groupby(Daily.index.get_level_values(1)).sum())
ALL.columns = ['CustomerCount'] # rename column

# Group by Year and Month
YearMonth = ALL.groupby([lambda x: x.year, lambda x: x.month])

# What is the max customer count per Year and Month
ALL['Max'] = YearMonth['CustomerCount'].transform(lambda x: x.max())
ALL.head()

Out[28]:

	CustomerCount	Max
StatusDate
2009-01-05	877	901
2009-01-12	901	901
2009-01-19	522	901
2009-02-02	953	953
2009-02-23	710	953

正如您从上面的ALL数据帧中看到的那样，在2009年1月份，最大客户数为901.如果我们使用了apply，我们会得到一个以（Year 和 Month）作为索引的数据帧，并且只有Max列值为901。

还有一个兴趣来衡量当前客户数量是否达到公司既定的目标。这里的任务是直观地显示当前客户数量是否符合下面列出的目标。我们将称目标为BHAG（大额年度目标）。

12/31/2011 - 1,000位客户
2012年12月31日 - 2,000位客户
2013年12月31日 - 3,000名客户

我们将使用date_range函数来创建我们的日期。

定义： date_range（start = None，end = None，periods= None，freq ='D'，tz = None，normalize = False，name = None，closed = None）
描述：返回固定频率的日期索引，日期作为默认频率

通过选择频率为A或annual，我们将能够从上面获得三个目标日期。

In [29]:

pd.date_range?

In [30]:

# Create the BHAG dataframe
data = [1000,2000,3000]
idx = pd.date_range(start='12/31/2011', end='12/31/2013', freq='A')
BHAG = pd.DataFrame(data, index=idx, columns=['BHAG'])
BHAG

Out[30]:

	BHAG
2011-12-31	1000
2012-12-31	2000
2013-12-31	3000

使用concat函数可以简化前面课程中学习的数据帧的组合。请记住，当我们选择axis = 0时，我们会明智地追加行

In [31]:

# Combine the BHAG and the ALL data set 
combined = pd.concat([ALL,BHAG], axis=0)
combined = combined.sort_index(axis=0)
combined.tail()

Out[31]:

	BHAG	CustomerCount	Max
2012-11-19	NaN	136.0	1115.0
2012-11-26	NaN	1115.0	1115.0
2012-12-10	NaN	1269.0	1269.0
2012-12-31	2000.0	NaN	NaN
2013-12-31	3000.0	NaN	NaN

In [32]:

fig, axes = plt.subplots(figsize=(12, 7))

combined['BHAG'].fillna(method='pad').plot(color='green', label='BHAG')
combined['Max'].plot(color='blue', label='All Markets')
plt.legend(loc='best');

There was also a need to forecast next year's customer count and we can do this in a couple of simple steps. We will first group the combined dataframe by Yearand place the maximum customer count for that year. This will give us one row per Year.

还需要预测明年的客户数量，我们可以通过几个简单的步骤来完成。我们首先按year组合数据帧，并提取当年的最大客户数量。这会给我们每年一行。

In [33]:

# Group by Year and then get the max value per year
Year = combined.groupby(lambda x: x.year).max()
Year

Out[33]:

	BHAG	CustomerCount	Max
2009	NaN	2452.0	2452.0
2010	NaN	2065.0	2065.0
2011	1000.0	2711.0	2711.0
2012	2000.0	2061.0	2061.0
2013	3000.0	NaN	NaN

In [34]:

# Add a column representing the percent change per year
Year['YR_PCT_Change'] = Year['Max'].pct_change(periods=1)
Year

Out[34]:

	BHAG	CustomerCount	Max	YR_PCT_Change
2009	NaN	2452.0	2452.0	NaN
2010	NaN	2065.0	2065.0	-0.157830
2011	1000.0	2711.0	2711.0	0.312833
2012	2000.0	2061.0	2061.0	-0.239764
2013	3000.0	NaN	NaN	NaN

为了获得明年的最终客户数量，我们假设我们目前的增长率保持不变。然后，我们将增加这一年的客户数量，这将是我们对明年的预测。

In [35]:

(1 + Year.ix[2012,'YR_PCT_Change']) * Year.loc[2012,'Max']

C:\Users\david\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

  if __name__ == '__main__':

Out[35]:

1566.8465510881595

呈现数据

为每个State创建单独的图形。

In [36]:

# First Graph
ALL['Max'].plot(figsize=(10, 5));plt.title('ALL Markets')

# Last four Graphs
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(20, 10))
fig.subplots_adjust(hspace=1.0) ## Create space between plots

Daily.loc['FL']['CustomerCount']['2012':].fillna(method='pad').plot(ax=axes[0,0])
Daily.loc['GA']['CustomerCount']['2012':].fillna(method='pad').plot(ax=axes[0,1]) 
Daily.loc['TX']['CustomerCount']['2012':].fillna(method='pad').plot(ax=axes[1,0]) 
Daily.loc['NY']['CustomerCount']['2012':].fillna(method='pad').plot(ax=axes[1,1]) 

# Add titles
axes[0,0].set_title('Florida')
axes[0,1].set_title('Georgia')
axes[1,0].set_title('Texas')
axes[1,1].set_title('North East');

课程主页上一课下一课

created by 六尺巷人

posted on 2018-05-19 23:03 六尺巷人阅读(627) 评论(0) 编辑收藏举报

刷新页面返回顶部

第三课创建函数 - 从EXCEL读取 - 导出到EXCEL - 异常值 - Lambda函数 - 切片和骰子数据

第 3 课

从Excel获取数据

准备数据

呈现数据

导航

公告

第三课 创建函数 - 从EXCEL读取 - 导出到EXCEL - 异常值 - Lambda函数 - 切片和骰子数据

第 3 课

从Excel获取数据

准备数据

呈现数据

导航

公告

第三课创建函数 - 从EXCEL读取 - 导出到EXCEL - 异常值 - Lambda函数 - 切片和骰子数据