数据预处理
一、缺失值处理
均值填充、中位数填充、众数填充、常数填充
import pandas as pd
import numpy as np
c = np.array([[1,2,3,4],[4,5,6,np.nan],[5,6,7,8],[9,4,np.nan,8]])
C=pd.DataFrame(c)
C
|
0 |
1 |
2 |
3 |
0 |
1.0 |
2.0 |
3.0 |
4.0 |
1 |
4.0 |
5.0 |
6.0 |
NaN |
2 |
5.0 |
6.0 |
7.0 |
8.0 |
3 |
9.0 |
4.0 |
NaN |
8.0 |
(1)导入数据预处理中填充模块SimpleImputer
from sklearn.impute import SimpleImputer
(2)利用Imputer 创建填充对象imp_mean
imp_mean = SimpleImputer(missing_values=np.nan,strategy='mean',verbose=0)
参数说明:
参数 含义&输入
missing_values 告诉SimpleImputer,数据中的缺失值长什么样,默认空值np.nan
strategy 我们填补缺失值的策略,默认均值。 输入“mean”使用均值填补(仅对数值型特征可用) 输入“median"用中值填补(仅对数值型特征 可用) 输入"most_frequent”用众数填补(对数值型和字符型特征都可用) 输入“constant"表示请参考参数“fill_value"中的值 (对数值型和字符型特征都可用)
fill_value 当参数startegy为”constant"的时候可用,可输入字符串或数字表示要填充的值,常用0
copy 默认为True,将创建特征矩阵的副本,反之则会将缺失值填补到原本的特征矩阵中去。
imp_mean=imp_mean.fit_transform(C)
imp_mean
array([[1. , 2. , 3. , 4. ],
[4. , 5. , 6. , 6.66666667],
[5. , 6. , 7. , 8. ],
[9. , 4. , 5.33333333, 8. ]])
pd.DataFrame(imp_mean)
|
0 |
1 |
2 |
3 |
0 |
1.0 |
2.0 |
3.000000 |
4.000000 |
1 |
4.0 |
5.0 |
6.000000 |
6.666667 |
2 |
5.0 |
6.0 |
7.000000 |
8.000000 |
3 |
9.0 |
4.0 |
5.333333 |
8.000000 |
1、均值填充
指定列 处理空值
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
c = np.array([[1,2,3,4],[4,5,6,np.nan],[5,6,7,8],[9,4,np.nan,8]])
C=pd.DataFrame(c)
age = C[2]
age = pd.DataFrame(age)
imp_mean = SimpleImputer(missing_values=np.nan,strategy='mean',verbose=0)
imp_mean = imp_mean.fit_transform(age)
C[2]=imp_mean
C
|
0 |
1 |
2 |
3 |
0 |
1.0 |
2.0 |
3.000000 |
4.0 |
1 |
4.0 |
5.0 |
6.000000 |
NaN |
2 |
5.0 |
6.0 |
7.000000 |
8.0 |
3 |
9.0 |
4.0 |
5.333333 |
8.0 |
处理整个数组的全部列
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
c = np.array([[1,2,3,4],[4,5,6,np.nan],[5,6,7,8],[9,4,np.nan,8]])
C=pd.DataFrame(c)
imp_mean = SimpleImputer(missing_values=np.nan,strategy='mean',verbose=0)
imp_mean = imp_mean.fit_transform(C)
pd.DataFrame(imp_mean)
|
0 |
1 |
2 |
3 |
0 |
1.0 |
2.0 |
3.000000 |
4.000000 |
1 |
4.0 |
5.0 |
6.000000 |
6.666667 |
2 |
5.0 |
6.0 |
7.000000 |
8.000000 |
3 |
9.0 |
4.0 |
5.333333 |
8.000000 |
2、中位数填充
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
c = np.array([[1,2,3,4],[4,5,6,np.nan],[5,6,7,8],[9,4,np.nan,8]])
C=pd.DataFrame(c)
imp_median = SimpleImputer(missing_values=np.nan,strategy='median',verbose=0)
imp_median = imp_median.fit_transform(C)
pd.DataFrame(imp_median)
|
0 |
1 |
2 |
3 |
0 |
1.0 |
2.0 |
3.0 |
4.0 |
1 |
4.0 |
5.0 |
6.0 |
8.0 |
2 |
5.0 |
6.0 |
7.0 |
8.0 |
3 |
9.0 |
4.0 |
6.0 |
8.0 |
3、众数填充
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
c = np.array([[1,2,3,4],[4,5,6,np.nan],[5,6,7,8],[9,4,np.nan,8]])
C=pd.DataFrame(c)
imp_most_frequent = SimpleImputer(missing_values=np.nan,strategy='most_frequent',verbose=0)
imp_most_frequent = imp_most_frequent.fit_transform(C)
pd.DataFrame(imp_most_frequent)
|
0 |
1 |
2 |
3 |
0 |
1.0 |
2.0 |
3.0 |
4.0 |
1 |
4.0 |
5.0 |
6.0 |
8.0 |
2 |
5.0 |
6.0 |
7.0 |
8.0 |
3 |
9.0 |
4.0 |
3.0 |
8.0 |
4、常数填充
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
c = np.array([[1,2,3,4],[4,5,6,np.nan],[5,6,7,8],[9,4,np.nan,8]])
C=pd.DataFrame(c)
imp_0 = SimpleImputer(missing_values=np.nan,strategy='constant',fill_value=0)
imp_0 = imp_0.fit_transform(C)
pd.DataFrame(imp_0)
|
0 |
1 |
2 |
3 |
0 |
1.0 |
2.0 |
3.0 |
4.0 |
1 |
4.0 |
5.0 |
6.0 |
0.0 |
2 |
5.0 |
6.0 |
7.0 |
8.0 |
3 |
9.0 |
4.0 |
0.0 |
8.0 |
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】凌霞软件回馈社区,博客园 & 1Panel & Halo 联合会员上线
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】博客园社区专享云产品让利特惠,阿里云新客6.5折上折
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 一个费力不讨好的项目,让我损失了近一半的绩效!
· 实操Deepseek接入个人知识库
· CSnakes vs Python.NET:高效嵌入与灵活互通的跨语言方案对比
· 【.NET】调用本地 Deepseek 模型
· Plotly.NET 一个为 .NET 打造的强大开源交互式图表库