python数据挖掘笔记 - LyndonYoung

python数据挖掘笔记

1.数据挖掘的基本任务包括利用分类与预测、聚类分析、关联规则、时序模式、偏差检测、智能推荐等方法，帮助企业提取出数据中蕴含的商业价值。

2.Anaconda是一个集成python数据类库的python版本

3.当python代码中带有中文时，需要指定编码：# -*- coding:utf-8 -*-

4.python数据挖掘相关扩展库（可用pip或者apt-get安装，例如：sudo pip install numpy; sudo apt-get install python-numpy）

Numpy:提供数组支持，以及相应的高效处理函数

Scipy:提供矩阵支持，以及矩阵相关的数值计算模块

Matplotlib:强大的数据可视化工具，做图库

Pandas:强大灵活的数据分析和探索工具

StatsModels:统计建模和计量经济学，包括描述统计，统计模型估计和推断

Scikit-Learn:支持回归，分类，聚类等的强大机器学习库

Keras:深度学习库，用于建立神经网络以及深度学习模型

Gensim:用来做文本主题模型的库，文本挖掘可能会用到

5.python程序名不要使用和库一样的名字（例如：numpy.py），否则会产生许多意想不到的错误，

6.numpy的基本使用

 1 #!/usr/bin/python
 2 # -*- coding:utf-8 -*-
 3 # basic use of numpy
 4 
 5 import numpy as np # use alias
 6 a = np.array([2, 0, 1, 5]) # create array
 7 print(a)  
 8 print(a[:3])  # slices
 9 print(a.min())  # print the min
10 a.sort()   
11 print(a)
12 b = np.array([[1,2,3], [4,5,6]])   # double dim array
13 print(b * b)

numpy提供多维数组功能，但是它只是一般的数组，并不是矩阵。

7.scipy提供了真正的矩阵以及大量基于矩阵的运算和函数

scipy包含的功能有最优化、线性代数、积分、插值、拟合、特殊函数、快速傅里叶变换、信号处理和图像处理、常微分方程求解等。

 1 #!/usr/bin/python
 2 # -*- coding:utf-8 -*-
 3 # basic use of scipy
 4 # solve the problem of DAES : 2x1 - x2^2 = 1, x1^2 - x2 = 2
 5 from scipy.optimize import fsolve  # which could solve the DAES
 6 def f(x):
 7     x1 = x[0]
 8     x2 = x[1]
 9     return [2 * x1 - x2 ** 2  - 1, x1 ** 2 - x2 - 2]
10 
11 result = fsolve(f, [1, 1])   # [1, 1] is the estimate for the roots of func(x) = 0 (i think it can be any number
12 print(result)
13 
14 def g(x):
15     return (1 - x ** 2) ** 0.5 
16 
17 pi_2, err = integrate.quad(g, -1, 1)  # the result and error
18 print(pi_2 * 2)

scipy更多资料请参考网站：http://docs.scipy.org/doc/scipy/reference/

8.Matplotlib是最著名的绘图库，主要用于绘制二维图，当然也可以绘制简单的三维图

 1 #!/usr/bin/python
 2 # -*- coding:utf-8 -*-
 3 # the basic use of matplotlib
 4 import numpy as np
 5 import matplotlib.pyplot as plt 
 6 x = np.linspace(0, 10, 1000)
 7 y = np.sin(x) + 1 
 8 z = np.cos(x ** 2) + 1 
 9 plt.figure(figsize = (8, 4))  # the size of graphic
10 plt.plot(x, y, label = '$\sin x+1$', color = 'red', linewidth = 2)  # drawing ,design the label,color,size of line
11 plt.plot(x, z, 'b--', label = '$\cos x^2+1$')   # drawing, design label,type of line
12 plt.xlabel('Time(s)')  # the name of x
13 plt.ylabel('Volt')  # the name of y
14 plt.title('A Simple Example') # tile
15 plt.ylim(0, 2.2) # the range of y
16 plt.legend() # Show Legend
17 plt.show() # show the graphic

绘制饼图

 1 #!/usr/bin/python
 2 # -*- coding:utf-8 -*-
 3 import matplotlib.pyplot as plt 
 4 labels = 'Frogs', 'Hogs', 'Dogs', 'Logs'
 5 sizes = [15, 30, 45, 10] 
 6 colors = ['yellowgreen', 'gold', 'lightskyblue', 'lightcoral']
 7 explode = (0, 0.1, 0, 0)  # outstanding the second part, that is hogs
 8 plt.pie(sizes, explode = explode, labels = labels, colors = colors, autopct = '%1.1f%%', shadow = True, startangle = 90) 
 9 plt.axis('equal')
10 plt.show()

更多资料参考：http://matplotlib.org/contents.html

画廊：http://matplotlib.org/gallery.html

如果使用中文标签，可能无法正常显示，这是因为默认字体为英文字体，解决办法是作图之前手动指定默认字体为中文，如：黑体（SimHei）

plt.rcParams['font.sans-serif'] = ['SimHei']

如果作图时负号不能显示，可通过以下方法解决：

plt.rcParams['axes.unicode_minus'] = False

9.Pandas是python下最强大的数据分析和探索工具（利用python进行数据分析这本书是pandas的作者之一写的，里面介绍了更多关于pandas的知识),Pandas的

基本数据结构是Series和DadaFrame，Series是序列，类似于一维数组，DataFrame是一张二维表格，类似二维数组，它的每一列都是一个Series

 1 #!/usr/bin/python
 2 # -*- coding:utf-8 -*-
 3 # the basic use of pandas
 4 import pandas as pd
 5 import numpy as np
 6 s = pd.Series([1, 2, 3], index = ['a', 'b', 'c'])  # create a series
 7 d = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns = ['a', 'b', 'c']) # create a table
 8 # d = pd.DataFrame(np.random.randn(3, 4))
 9 d2 = pd.DataFrame(s)
10 #d.head(3)
11 print (d.describe())
12 print (d2.describe())

更多资料参考链接：http://pandas.pydata.org/

后续更新。。。

参考书籍：Python数据分析与挖掘实战

posted on 2016-04-23 11:50 LyndonYoung 阅读(515) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部