Data Analysis with Python

Module 2 Data Wrangling

处理缺失值

这里写图片描述
使用Python去除缺失值
pandas包中的dataframes.dropna()，当inplace参数为True时，直接在原数据框内操作

数据格式化

这里写图片描述

数据标准化

这里写图片描述

数据分组

这里写图片描述

数据转换（Categorical→Numeric)

这里写图片描述
Why One-Hot Encode Data in Machine Learning?中提到，将分类型数据转为数值型数据的两种方法：
1. Integer-Encoding,针对有序分类变量
2. One-Hot Encoding,针对无序分类变量

可使用pandas.get_dummies()进行转换。
这里写图片描述

Module 3 Exploratory Data Analysis(EDA)

统计描述

df.describe()
value_counts()
Box Plots
seaborn.boxplot
Scatter Plot
matplotlib.pyplot.scatter()

Groupby in Python

df.groupby()
pivot table(透视表)
df.pivot_table()，转化后便于阅读和查看，但不便于进行数据处理
Heat Map

方差分析（ANOVA)

scipy.stats.f_oneway()
这里写图片描述

相关分析（correlation)

Correlation doesn’t imply causation!

这里写图片描述

统计相关性

皮尔森相关分析

这里写图片描述
scipy.stats.pearsonr()

两者相关性很强。

Module 4 Model Development

线性回归

一元线性回归 Simple Linear Regression(SLR)

举例： $\hat{y} = b_{0} + b_{1} x$
可以使用scikit-learn中的sklearn.linear_model.LinearRegression()进行训练和预测

多元线性回归 Multiple Linear Regression(MLR)

举例： $\hat{Y} = b_{0} + b_{1} x_{1} + b_{2} x_{2} + b_{3} x_{3} + b_{4} x_{4}$
这里写图片描述

模型评估可视化

Regression Plot
seaborn.regplot()
Residual Plot

seaborn.residplot()
借助残差图可以观察数据分布情况，大致判断数据是否呈线性相关
Distribution Plot
seaborn.displot()

多项式线性回归和管道操作

多项式线性回归 Polynomial Regression

这里写图片描述

管道操作 Pipeline

这里写图片描述

Measures for In-Sample Evaluation

Mean Squared Errors(MSE)
$M S E = 1 n \sum i = 1 n (Y i - Y^i) 2$
sklearn.metrics.mean_squared_error()

Prediction and Decision Making

这里写图片描述

Multiple Linear Regression(MLR)
Simple Linear Regression(SLR)
Mean Squared Errors(MSE)

Module5 Model Evaluation

这里写图片描述
最后一条有疑问，不会造成过拟合吗？？？？？？？？？？

拆分数据集为训练集和测试集：
这里写图片描述
交叉验证示意图：

3折交叉验证：
这里写图片描述
查看预测结果：

Over-fitting, Under-fitting and Model Slection

这里写图片描述

Rigid Regression

sklearn.linear_model.Ridge()
岭回归中文文档，通过参数α调整模型，从而防止过拟合。
这里写图片描述

Grid Search

网格搜索通过交叉验证的方法来自动调整超参数hyperparameters
机器学习中的超参数
 Parameter和Hyperparameter的区别，原文。简单来说，参数是训练过程中通过数据自动调整的，而超参数一般都是手动选择，用于调整估计模型过程的。
这里写图片描述

sklearn.model_selection.GridSearchCV()
这里写图片描述

查看各组合的结果：
这里写图片描述

视频内容只是简介，关键还是要用lab里的 jupyter notebook 练手。

参考资料：
1. https://cognitiveclass.ai/
2. http://pandas.pydata.org/pandas-docs/stable/index.html
3. http://scikit-learn.org/stable/index.html

posted @ 2018-02-09 14:01 huidan 阅读(974) 评论(0) 编辑收藏举报

刷新页面返回顶部