.背景:因为女朋友最近考上了教师编,所以拿到了教师编制 笔试 面试的数据,进行笔试面试 上岸数据分析。
# 导入相关包
import pandas as pd from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder from sklearn import svm from sklearn import metrics import matplotlib.pyplot as plt import seaborn as sns from pandas import plotting sns.set_style("whitegrid") plt.style.use('seaborn')
# 导入数据集 io = r'G:\Python\Learn\iris\data\DataCalculate.xlsx' data = pd.read_excel(io, sheet_name='Sheet1')
<class 'pandas.core.frame.DataFrame'> RangeIndex: 39 entries, 0 to 38 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ranking_written 39 non-null int64 1 written 39 non-null float64 2 ranking_audition 39 non-null int64 3 audition 39 non-null float64 4 total 39 non-null float64 5 ranking_total 39 non-null int64 6 complete 39 non-null object
ranking_written written ranking_audition ... total ranking_total complete 0 1 84.75 2 ... 87.30 1 ON 1 2 78.70 3 ... 84.40 2 ON 2 7 75.15 1 ... 83.58 3 ON 3 12 72.70 4 ... 81.88 4 ON 4 8 74.70 8 ... 81.72 5 ON 5 4 75.70 15 ... 81.52 6 ON 6 3 76.15 21 ... 81.34 7 ON 7 13 72.05 6 ... 81.26 8 ON 8 6 75.20 19 ... 81.08 9 ON 9 11 73.95 16 ... 80.82 10 ON 10 15 70.70 7 ... 80.60 11 ON 11 10 73.95 22 ... 80.46 12 ON 12 14 71.65 10 ... 80.38 13 ON 13 9 74.15 29 ... 79.82 14 OFF 14 5 75.55 33 ... 79.78 15 OFF 15 29 65.10 5 ... 78.72 16 OFF 16 19 68.80 18 ... 78.64 17 OFF 17 21 67.05 11 ... 78.30 18 OFF 18 17 69.60 31 ... 77.76 19 OFF 19 25 65.70 13 ... 77.64 20 OFF 20 20 68.35 26 ... 77.62 21 OFF 21 22 66.50 20 ... 77.60 22 OFF 22 26 65.60 14 ... 77.60 23 OFF 23 30 65.10 12 ... 77.52 24 OFF 24 32 63.85 9 ... 77.38 25 OFF 25 16 70.20 35 ... 77.16 26 OFF 26 24 65.75 23 ... 76.82 27 OFF 27 27 65.55 25 ... 76.62 28 OFF 28 31 64.95 24 ... 76.50 29 OFF 29 18 69.10 38 ... 76.12 30 OFF 30 28 65.45 32 ... 76.10 31 OFF 31 23 65.85 34 ... 75.78 32 OFF 32 38 59.30 17 ... 74.96 33 OFF 33 34 60.65 27 ... 74.54 34 OFF 34 36 60.00 28 ... 74.28 35 OFF 35 33 62.35 37 ... 73.78 36 OFF 36 39 59.25 30 ... 73.74 37 OFF 37 35 60.20 36 ... 73.16 38 OFF 38 37 59.90 39 ... 23.96 39 OFF
通过 violinplot 与 pointplot 通过斜率与分布,探索笔试和面试 以及上岸的关系
# 设置颜色主题 antV = ['#1890FF', '#2FC25B', '#FACC14', '#223273', '#8543E0', '#13C2C2', '#3436c7', '#F04864']
# 绘制 pointplot # 各特征与上岸之间的关系 f, axes = plt.subplots(2, 2, figsize=(8, 8), sharex=True) sns.despine(left=True) sns.violinplot(x='complete', y='ranking_written', data=data, palette=antV, ax=axes[0, 0]) sns.violinplot(x='complete', y='written', data=data, palette=antV, ax=axes[0, 1]) sns.violinplot(x='complete', y='ranking_audition', data=data, palette=antV, ax=axes[1, 0]) sns.violinplot(x='complete', y='audition', data=data, palette=antV, ax=axes[1, 1])
# 绘制 pointplot # 各特征与上岸之间的关系 f, axes = plt.subplots(2, 2, figsize=(8, 8), sharex=True) sns.despine(left=True) sns.pointplot(x='complete', y='ranking_written', data=data, color=antV[0], ax=axes[0, 0]) sns.pointplot(x='complete', y='written', data=data, color=antV[0], ax=axes[0, 1]) sns.pointplot(x='complete', y='ranking_audition', data=data, color=antV[0], ax=axes[1, 0]) sns.pointplot(x='complete', y='audition', data=data, color=antV[0], ax=axes[1, 1])
sns.pairplot(data=data, palette=antV, hue='complete')
Andrews Curves 适合进行数据校验,对数据中异常的数据进行数据校验。
plt.subplots(figsize=(10, 8)) plotting.andrews_curves(data, 'complete', colormap='cool')
分别基于 笔试和面试 笔试排名和面试排名进行线性回归分析:
sns.lmplot(data=data, x='written', y='audition', palette=antV, hue='complete')
sns.lmplot(data=data, x='ranking_written', y='ranking_audition', palette=antV, hue='complete')
最后通过热力图找出不同属性之间的相关性 相关性体现在热力图的正负值:
通过机器学习 以笔试成绩 面试成绩预测其是否上岸,其他辅助数据笔试排名 面试排名
进行机器学习之前 将数据集进行拆分为训练集和测试集 将是否上岸转换为 0 1
# 载入特征和标签集 X = data[['ranking_written', 'written', 'ranking_audition', 'audition', 'total', 'ranking_total']] Y = data['complete']
# 对标签集进行编码 encoder = LabelEncoder() y = encoder.fit_transform(Y)
[1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0]
将数据集进行 7:3 的拆分 拆分为训练数据和测试数据
# 对各阶段排名 以及成绩 最终是否进入进行机器学习 train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.3, random_state=101) print(train_X.shape, train_y.shape, test_X.shape, test_y.shape)
(27, 6) (27,) (12, 6) (12,)
# 通用模型的机器学习测试方式 model = svm.SVC() model.fit(train_X, train_y) prediction = model.predict(test_X) print('The accuracy of the SVM is: {0}'.format(metrics.accuracy_score(prediction, test_y)))
The accuracy of the SVM is: 1.0
# 笔试属性 与最终结果之间的关系 written = data[['ranking_written', 'written', 'complete']] train_w, test_w = train_test_split(written, test_size=0.3, random_state=0) train_x_w = train_w[['ranking_written', 'written']] train_y_w = train_w.complete test_x_w = test_w[['ranking_written', 'written']] test_y_w = test_w.complete model = svm.SVC() model.fit(train_x_w, train_y_w) prediction = model.predict(test_x_w) print('The accuracy of the SVM using Written is: {0}'.format(metrics.accuracy_score(prediction, test_y_w)))
# 面试属性 与最终结果之间的关系 audition = data[['ranking_audition', 'audition', 'complete']] train_a, test_a = train_test_split(audition, test_size=0.3, random_state=0) train_x_a = train_a[['ranking_audition', 'audition']] train_y_a = train_a.complete test_x_a = test_a[['ranking_audition', 'audition']] test_y_a = test_a.complete model = svm.SVC() model.fit(train_x_a, train_y_a) prediction = model.predict(test_x_a) print('The accuracy of the SVM using audition is: {0}'.format(metrics.accuracy_score(prediction, test_y_a)))
# 总成绩属性 与最终结果之间的关系 audition = data[['ranking_total', 'total', 'complete']] train_a, test_a = train_test_split(audition, test_size=0.3, random_state=0) train_x_a = train_a[['ranking_total', 'total']] train_y_a = train_a.complete test_x_a = test_a[['ranking_total', 'total']] test_y_a = test_a.complete model = svm.SVC() model.fit(train_x_a, train_y_a) prediction = model.predict(test_x_a) print('The accuracy of the SVM using total is: {0}'.format(metrics.accuracy_score(prediction, test_y_a)))
The accuracy of the SVM is: 1.0 The accuracy of the SVM using Written is: 0.9166666666666666 The accuracy of the SVM using audition is: 0.8333333333333334 The accuracy of the SVM using total is: 1.0