使用python进行数据分析
原文链接:Step by step approach to perform data analysis using Python
译文链接:使用Python一步一步地来进行数据分析--By Michael翔
有许多优秀的Python书籍和在线课程,然而我不并不推荐它们中的一些,因为,有些是给大众准备的而不是给那些用来数据分析的人准备的。同样也有许多书是“用Python科学编程”的,但它们是面向各种数学为导向的主题的,而不是成为为了数据分析和统计。不要浪费浪费你的时间去阅读那些为大众准备的Python书籍。
在进一步继续之前,首先设置好你的编程环境,然后学习怎么使用IPython notebook
学习途径
从code academy开始学起,完成上面的所有练习。每天投入3个小时,你应该在20天内完成它们。Code academy涵盖了Python基本概念。但是,它不像Udacity那样以项目为导向;没关系,因为你的目标是从事数据科学,而不是使用Python开发软件。
当完成了code academy练习之后,看看这个Ipython notebook:
Python必备教程(在总结部分我已经提供了下载链接)。
它包括了code academy中没有提到的一些概念。你能在1到2小时内学完这个教程。
现在,你知道足够的基础知识来学习Python库了。
Numpy
首先,开始学习Numpy吧,因为它是利用Python科学计算的基础包。对Numpy好的掌握将会帮助你有效地使用其他工具例如Pandas。
我已经准备好了IPython笔记,这包含了Numpy的一些基本概念。这个教程包含了Numpy中最频繁使用的操作,例如,N维数组,索引,数组切片,整数索引,数组转换,通用函数,使用数组处理数据,常用的统计方法,等等。
Index Numpy 遇到Numpy陌生函数,查询用法,推荐!
Pandas
Pandas包含了高级的数据结构和操作工具,它们使得Python数据分析更加快速和容易。
教程包含了series, data frams,从一个axis删除数据,缺失数据处理,等等。
Index Pandas 遇到陌生函数,查询用法,推荐!
Matplotlib
这是一个分为四部分的Matplolib教程。
1st 部分:
第一部分介绍了Matplotlib基本功能,基本figure类型。
Simple Plotting example
%matplotlib inline
import matplotlib.pyplot as plt #importing matplot lib library
import numpy as np
x = range(100)
#print x, print and check what is x
y =[val**2 for val in x]
#print y
plt.plot(x,y) #plotting x and y
fig, axes = plt.subplots(nrows=1, ncols=2)
for ax in axes:
ax.plot(x, y, 'r')
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_title('title')
fig.tight_layout()
Using Numpy
x = np.linspace(0, 2*np.pi, 100)
y =np.sin(x)
plt.plot(x,y)
x= np.linspace(-3,2, 200)
Y = x ** 2 - 2 * x + 1.
plt.plot(x,Y)
# plotting multiple plots
x =np.linspace(0, 2 * np.pi, 100)
y = np.sin(x)
z = np.cos(x)
plt.plot(x,y)
plt.plot(x,z)
plt.show()
# Matplot lib picks different colors for different plot.
cd C:\Users\tk\Desktop\Matplot
data = np.loadtxt('numpy.txt')
plt.plot(data[:,0], data[:,1]) # plotting column 1 vs column 2
# The text in the numpy.txt should look like this
# 0 0
# 1 1
# 2 4
# 4 16
# 5 25
# 6 36
data1 = np.loadtxt('scipy.txt') # load the file
print data1.T
for val in data1.T: #loop over each and every value in data1.T
plt.plot(data1[:,0], val) #data1[:,0] is the first row in data1.T
# data in scipy.txt looks like this:
# 0 0 6
# 1 1 5
# 2 4 4
# 4 16 3
# 5 25 2
# 6 36 1
Scatter Plots and Bar Graphs
sct = np.random.rand(20, 2)
print sct
plt.scatter(sct[:,0], sct[:,1]) # I am plotting a scatter plot.
ghj =[5, 10 ,15, 20, 25]
it =[ 1, 2, 3, 4, 5]
plt.bar(ghj, it) # simple bar graph
ghj =[5, 10 ,15, 20, 25]
it =[ 1, 2, 3, 4, 5]
plt.bar(ghj, it, width =5)# you can change the thickness of a bar, by default the bar will have a thickness of 0.8 units
ghj =[5, 10 ,15, 20, 25]
it =[ 1, 2, 3, 4, 5]
plt.barh(ghj, it) # barh is a horizontal bar graph
Multiple bar charts
new_list = [[5., 25., 50., 20.], [4., 23., 51., 17.], [6., 22., 52., 19.]]
x = np.arange(4)
plt.bar(x + 0.00, new_list[0], color ='b', width =0.25)
plt.bar(x + 0.25, new_list[1], color ='r', width =0.25)
plt.bar(x + 0.50, new_list[2], color ='g', width =0.25)
#plt.show()
#Stacked Bar charts
p = [5., 30., 45., 22.]
q = [5., 25., 50., 20.]
x =range(4)
plt.bar(x, p, color ='b')
plt.bar(x, q, color ='y', bottom =p)
# plotting more than 2 values
A = np.array([5., 30., 45., 22.])
B = np.array([5., 25., 50., 20.])
C = np.array([1., 2., 1., 1.])
X = np.arange(4)
plt.bar(X, A, color = 'b')
plt.bar(X, B, color = 'g', bottom = A)
plt.bar(X, C, color = 'r', bottom = A + B) # for the third argument, I use A+B
plt.show()
black_money = np.array([5., 30., 45., 22.])
white_money = np.array([5., 25., 50., 20.])
z = np.arange(4)
plt.barh(z, black_money, color ='g')
plt.barh(z, -white_money, color ='r')# - notation is needed for generating, back to back charts
Other Plots
#Pie charts
y = [5, 25, 45, 65]
plt.pie(y)
#Histograms
d = np.random.randn(100)
plt.hist(d, bins = 20)
d = np.random.randn(100)
plt.boxplot(d)
#1) The red bar is the median of the distribution
#2) The blue box includes 50 percent of the data from the lower quartile to the upper quartile.
# Thus, the box is centered on the median of the data.
d = np.random.randn(100, 5) # generating multiple box plots
plt.boxplot(d)
2nd 部分:
包含了怎么调整figure的样式和颜色,例如:makers,line,thicness,line patterns和color map.
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
p =np.random.standard_normal((50,2))
p += np.array((-1,1)) # center the distribution at (-1,1)
q =np.random.standard_normal((50,2))
q += np.array((1,1)) #center the distribution at (-1,1)
plt.scatter(p[:,0], p[:,1], color ='.25')
plt.scatter(q[:,0], q[:,1], color = '.75')
dd =np.random.standard_normal((50,2))
plt.scatter(dd[:,0], dd[:,1], color ='1.0', edgecolor ='0.0') # edge color controls the color of the edge
Custom Color for Bar charts,Pie charts and box plots:
The below bar graph, plots x(1 to 50) (vs) y(50 random integers, within 0-100. But you need different colors for each value. For which we create a list containing four colors(color_set). The list comprehension creates 50 different color values from color_set
vals = np.random.random_integers(99, size =50)
color_set = ['.00', '.25', '.50','.75']
color_lists = [color_set[(len(color_set)* val) // 100] for val in vals]
c = plt.bar(np.arange(50), vals, color = color_lists)
hi =np.random.random_integers(8, size =10)
color_set =['.00', '.25', '.50', '.75']
plt.pie(hi, colors = color_set)# colors attribute accepts a range of values
plt.show()
#If there are less colors than values, then pyplot.pie() will simply cycle through the color list. In the preceding
#example, we gave a list of four colors to color a pie chart that consisted of eight values. Thus, each color will be used twice
values = np.random.randn(100)
w = plt.boxplot(values)
for att, lines in w.iteritems():
for l in lines:
l.set_color('k')
Color Maps
know more about hsv
# how to color scatter plots
#Colormaps are defined in the matplotib.cm module. This module provides
#functions to create and use colormaps. It also provides an exhaustive choice of predefined color maps.
import matplotlib.cm as cm
N = 256
angle = np.linspace(0, 8 * 2 * np.pi, N)
radius = np.linspace(.5, 1., N)
X = radius * np.cos(angle)
Y = radius * np.sin(angle)
plt.scatter(X,Y, c=angle, cmap = cm.hsv)
#Color in bar graphs
import matplotlib.cm as cm
vals = np.random.random_integers(99, size =50)
cmap = cm.ScalarMappable(col.Normalize(0,99), cm.binary)
plt.bar(np.arange(len(vals)),vals, color =cmap.to_rgba(vals))
Line Styles
# I am creating 3 levels of gray plots, with different line shades
def pq(I, mu, sigma):
a = 1. / (sigma * np.sqrt(2. * np.pi))
b = -1. / (2. * sigma ** 2)
return a * np.exp(b * (I - mu) ** 2)
I =np.linspace(-6,6, 1024)
plt.plot(I, pq(I, 0., 1.), color = 'k', linestyle ='solid')
plt.plot(I, pq(I, 0., .5), color = 'k', linestyle ='dashed')
plt.plot(I, pq(I, 0., .25), color = 'k', linestyle ='dashdot')
N = 15
A = np.random.random(N)
B= np.random.random(N)
X = np.arange(N)
plt.bar(X, A, color ='.75')
plt.bar(X, A+B , bottom = A, color ='W', linestyle ='dashed') # plot a bar graph
plt.show()
def gf(X, mu, sigma):
a = 1. / (sigma * np.sqrt(2. * np.pi))
b = -1. / (2. * sigma ** 2)
return a * np.exp(b * (X - mu) ** 2)
X = np.linspace(-6, 6, 1024)
for i in range(64):
samples = np.random.standard_normal(50)
mu,sigma = np.mean(samples), np.std(samples)
plt.plot(X, gf(X, mu, sigma), color = '.75', linewidth = .5)
plt.plot(X, gf(X, 0., 1.), color ='.00', linewidth = 3.)
Fill surfaces with pattern
N = 15
A = np.random.random(N)
B= np.random.random(N)
X = np.arange(N)
plt.bar(X, A, color ='w', hatch ='x')
plt.bar(X, A+B,bottom =A, color ='r', hatch ='/')
# some other hatch attributes are :
#/
#\
#|
#-
#+
#x
#o
#O
#.
#*
Marker styles
cd C:\Users\tk\Desktop\Matplot
Come back to this section later
X= np.linspace(-6,6,1024)
Ya =np.sinc(X)
Yb = np.sinc(X) +1
plt.plot(X, Ya, marker ='o', color ='.75')
plt.plot(X, Yb, marker ='^', color='.00', markevery= 32)# this one marks every 32 nd element
# Marker Size
A = np.random.standard_normal((50,2))
A += np.array((-1,1))
B = np.random.standard_normal((50,2))
B += np.array((1, 1))
plt.scatter(A[:,0], A[:,1], color ='k', s =25.0)
plt.scatter(B[:,0], B[:,1], color ='g', s = 100.0) # size of the marker is specified using 's' attribute
Own Marker Shapes- come back to this later
# more about markers
X =np.linspace(-6,6, 1024)
Y =np.sinc(X)
plt.plot(X,Y, color ='r', marker ='o', markersize =9, markevery = 30, markerfacecolor='w', linewidth = 3.0, markeredgecolor = 'b')
import matplotlib as mpl
mpl.rc('lines', linewidth =3)
mpl.rc('xtick', color ='w') # color of x axis numbers
mpl.rc('ytick', color = 'w') # color of y axis numbers
mpl.rc('axes', facecolor ='g', edgecolor ='y') # color of axes
mpl.rc('figure', facecolor ='.00',edgecolor ='w') # color of figure
mpl.rc('axes', color_cycle = ('y','r')) # color of plots
x = np.linspace(0, 7, 1024)
plt.plot(x, np.sin(x))
plt.plot(x, np.cos(x))
3rd 部分:
图的注释--包含若干图,控制坐标轴范围,长款比和坐标轴。
Annotation
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
X =np.linspace(-6,6, 1024)
Y =np.sinc(X)
plt.title('A simple marker exercise')# a title notation
plt.xlabel('array variables') # adding xlabel
plt.ylabel(' random variables') # adding ylabel
plt.text(-5, 0.4, 'Matplotlib') # -5 is the x value and 0.4 is y value
plt.plot(X,Y, color ='r', marker ='o', markersize =9, markevery = 30, markerfacecolor='w', linewidth = 3.0, markeredgecolor = 'b')
def pq(I, mu, sigma):
a = 1. / (sigma * np.sqrt(2. * np.pi))
b = -1. / (2. * sigma ** 2)
return a * np.exp(b * (I - mu) ** 2)
I =np.linspace(-6,6, 1024)
plt.plot(I, pq(I, 0., 1.), color = 'k', linestyle ='solid')
plt.plot(I, pq(I, 0., .5), color = 'k', linestyle ='dashed')
plt.plot(I, pq(I, 0., .25), color = 'k', linestyle ='dashdot')
# I have created a dictinary of styles
design = {
'facecolor' : 'y', # color used for the text box
'edgecolor' : 'g',
'boxstyle' : 'round'
}
plt.text(-4, 1.5, 'Matplot Lib', bbox = design)
plt.plot(X, Y, c='k')
plt.show()
#This sets the style of the box, which can either be 'round' or 'square'
#'pad': If 'boxstyle' is set to 'square', it defines the amount of padding between the text and the box's sides
Alignment Control
The text is bound by a box. This box is used to relatively align the text to the coordinates passed to pyplot.text(). Using the verticalalignment and horizontalalignment parameters (respective shortcut equivalents are va and ha), we can control how the alignment is done.
The vertical alignment options are as follows:
'center': This is relative to the center of the textbox
'top': This is relative to the upper side of the textbox
'bottom': This is relative to the lower side of the textbox
'baseline': This is relative to the text's baseline
Horizontal alignment options are as follows:
align ='bottom' align ='baseline'
------------------------align = center--------------------------------------
align= 'top
cd C:\Users\tk\Desktop
from IPython.display import Image
Image(filename='text alignment.png')
#The horizontal alignment options are as follows:
#'center': This is relative to the center of the textbox
#'left': This is relative to the left side of the textbox
#'right': This is relative to the right-hand side of the textbox
X = np.linspace(-4, 4, 1024)
Y = .25 * (X + 4.) * (X + 1.) * (X - 2.)
plt.annotate('Big Data',
ha ='center', va ='bottom',
xytext =(-1.5, 3.0), xy =(0.75, -2.7),
arrowprops ={'facecolor': 'green', 'shrink':0.05, 'edgecolor': 'black'}) #arrow properties
plt.plot(X, Y)
#arrow styles are :
from IPython.display import Image
Image(filename='arrows.png')
Legend properties:
'loc': This is the location of the legend. The default value is 'best', which will place it automatically. Other valid values are
'upper left', 'lower left', 'lower right', 'right', 'center left', 'center right', 'lower center', 'upper center', and 'center'.
'shadow': This can be either True or False, and it renders the legend with a shadow effect.
'fancybox': This can be either True or False and renders the legend with a rounded box.
'title': This renders the legend with the title passed as a parameter.
'ncol': This forces the passed value to be the number of columns for the legend
x =np.linspace(0, 6,1024)
y1 =np.sin(x)
y2 =np.cos(x)
plt.xlabel('Sin Wave')
plt.ylabel('Cos Wave')
plt.plot(x, y1, c='b', lw =3.0, label ='Sin(x)') # labels are specified
plt.plot(x, y2, c ='r', lw =3.0, ls ='--', label ='Cos(x)')
plt.legend(loc ='best', shadow = True, fancybox = False, title ='Waves', ncol =1) # displays the labels
plt.grid(True, lw = 2, ls ='--', c='.75') # adds grid lines to the figure
plt.show()
Shapes
#Paths for several kinds of shapes are available in the matplotlib.patches module
import matplotlib.patches as patches
dis = patches.Circle((0,0), radius = 1.0, color ='.75' )
plt.gca().add_patch(dis) # used to render the image.
dis = patches.Rectangle((2.5, -.5), 2.0, 1.0, color ='.75') #patches.rectangle((x & y coordinates), length, breadth)
plt.gca().add_patch(dis)
dis = patches.Ellipse((0, -2.0), 2.0, 1.0, angle =45, color ='.00')
plt.gca().add_patch(dis)
dis = patches.FancyBboxPatch((2.5, -2.5), 2.0, 1.0, boxstyle ='roundtooth', color ='g')
plt.gca().add_patch(dis)
plt.grid(True)
plt.axis('scaled') # displays the images within the prescribed axis
plt.show()
#FancyBox: This is like a rectangle but takes an additional boxstyle parameter
#(either 'larrow', 'rarrow', 'round', 'round4', 'roundtooth', 'sawtooth', or 'square')
import matplotlib.patches as patches
theta = np.linspace(0, 2 * np.pi, 8) # generates an array
vertical = np.vstack((np.cos(theta), np.sin(theta))).transpose() # vertical stack clubs the two arrays.
#print vertical, print and see how the array looks
plt.gca().add_patch(patches.Polygon(vertical, color ='y'))
plt.axis('scaled')
plt.grid(True)
plt.show()
#The matplotlib.patches.Polygon()constructor takes a list of coordinates as the inputs, that is, the vertices of the polygon
# a polygon can be imbided into a circle
theta = np.linspace(0, 2 * np.pi, 6) # generates an array
vertical = np.vstack((np.cos(theta), np.sin(theta))).transpose() # vertical stack clubs the two arrays.
#print vertical, print and see how the array looks
plt.gca().add_patch(plt.Circle((0,0), radius =1.0, color ='b'))
plt.gca().add_patch(plt.Polygon(vertical, fill =None, lw =4.0, ls ='dashed', edgecolor ='w'))
plt.axis('scaled')
plt.grid(True)
plt.show()
Ticks in Matplotlib
#In matplotlib, ticks are small marks on both the axes of a figure
import matplotlib.ticker as ticker
X = np.linspace(-12, 12, 1024)
Y = .25 * (X + 4.) * (X + 1.) * (X - 2.)
pl =plt.axes() #the object that manages the axes of a figure
pl.xaxis.set_major_locator(ticker.MultipleLocator(5))
pl.xaxis.set_minor_locator(ticker.MultipleLocator(1))
plt.plot(X, Y, c = 'y')
plt.grid(True, which ='major') # which can take three values: minor, major and both
plt.show()
name_list = ('Omar', 'Serguey', 'Max', 'Zhou', 'Abidin')
value_list = np.random.randint(0, 99, size = len(name_list))
pos_list = np.arange(len(name_list))
ax = plt.axes()
ax.xaxis.set_major_locator(ticker.FixedLocator((pos_list)))
ax.xaxis.set_major_formatter(ticker.FixedFormatter((name_list)))
plt.bar(pos_list, value_list, color = '.75',align = 'center')
plt.show()
4th 部分:
包含了一些复杂图形。
Working with figures
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
T = np.linspace(-np.pi, np.pi, 1024) #
fig, (ax0, ax1) = plt.subplots(ncols =2)
ax0.plot(np.sin(2 * T), np.cos(0.5 * T), c = 'k')
ax1.plot(np.cos(3 * T), np.sin(T), c = 'k')
plt.show()
Setting aspect ratio
T = np.linspace(0, 2 * np.pi, 1024)
plt.plot(2. * np.cos(T), np.sin(T), c = 'k', lw = 3.)
plt.axes().set_aspect('equal') # remove this line of code and see how the figure looks
plt.show()
X = np.linspace(-6, 6, 1024)
Y1, Y2 = np.sinc(X), np.cos(X)
plt.figure(figsize=(10.24, 2.56)) #sets size of the figure
plt.plot(X, Y1, c='r', lw = 3.)
plt.plot(X, Y2, c='.75', lw = 3.)
plt.show()
X = np.linspace(-6, 6, 1024)
plt.ylim(-.5, 1.5)
plt.plot(X, np.sinc(X), c = 'k')
plt.show()
X = np.linspace(-6, 6, 1024)
Y = np.sinc(X)
X_sub = np.linspace(-3, 3, 1024)#coordinates of subplot
Y_sub = np.sinc(X_sub) # coordinates of sub plot
plt.plot(X, Y, c = 'b')
sub_axes = plt.axes([.6, .6, .25, .25])# coordinates, length and width of the subplot frame
sub_axes.plot(X_detail, Y_detail, c = 'r')
plt.show()
Log Scale
X = np.linspace(1, 10, 1024)
plt.yscale('log') # set y scale as log. we would use plot.xscale()
plt.plot(X, X, c = 'k', lw = 2., label = r'$f(x)=x$')
plt.plot(X, 10 ** X, c = '.75', ls = '--', lw = 2., label = r'$f(x)=e^x$')
plt.plot(X, np.log(X), c = '.75', lw = 2., label = r'$f(x)=\log(x)$')
plt.legend()
plt.show()
#The logarithm base is 10 by default, but it can be changed with the optional parameters basex and basey.
Polar Coordinates
T = np.linspace(0 , 2 * np.pi, 1024)
plt.axes(polar = True) # show polar coordinates
plt.plot(T, 1. + .25 * np.sin(16 * T), c= 'k')
plt.show()
import matplotlib.patches as patches # import patch module from matplotlib
ax = plt.axes(polar = True)
theta = np.linspace(0, 2 * np.pi, 8, endpoint = False)
radius = .25 + .75 * np.random.random(size = len(theta))
points = np.vstack((theta, radius)).transpose()
plt.gca().add_patch(patches.Polygon(points, color = '.75'))
plt.show()
x = np.linspace(-6,6,1024)
y= np.sin(x)
plt.plot(x,y)
plt.savefig('bigdata.png', c= 'y', transparent = True) #savefig function writes that data to a file
# will create a file named bigdata.png. Its resolution will be 800 x 600 pixels, in 8-bit colors (24-bits per pixel)
theta =np.linspace(0, 2 *np.pi, 8)
points =np.vstack((np.cos(theta), np.sin(theta))).T
plt.figure(figsize =(6.0, 6.0))
plt.gca().add_patch(plt.Polygon(points, color ='r'))
plt.axis('scaled')
plt.grid(True)
plt.savefig('pl.png', dpi =300) # try 'pl.pdf', pl.svg'
#dpi is dots per inch. 300*8 x 6*300 = 2400 x 1800 pixels
总结
你学习Python时能犯的最简单的错误之一就是同时去尝试学习过多的库。当你努力一下子学会每样东西时,你会花费很多时间来切换这些不同概念之间,变得沮丧,最后转移到其他事情上。
所以,坚持关注这个过程:
-
理解Python基础
-
学习Numpy
-
学习Pandas
-
学习Matplolib