titan

当涉及到机器学习的时候，泰坦尼克号幸存者预测是最著名且最流行的问题之一。该问题是针对泰坦尼克号撞击冰山后幸存的乘客和船员的数据集。该数据集包括多个属性，如船舱等级、性别、年龄、登船口岸等。本文将从数据探索，数据预处理，特征选择，模型选择和评价等方面介绍该问题的解决方案。

## 1. 数据探索

探索数据是实现高性能机器学习模型的关键。了解数据集中的特征可以帮助我们选择恰当的模型并有效地优化模型。我们从导入必要的Python库开始，Python库通常用于导入，处理和可视化数据。

```python
# importing the necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# loading the dataset
titanic_dataset = pd.read_csv("train.csv")
titanic_dataset.head()
```

接下来，我们使用`info()`方法来了解数据集中各个特征的数据类型和是否存在缺失值。

```python
titanic_dataset.info()
```

数据集由892个记录组成，其中“Age”，“Cabin”和“Embarked”特征中存在缺失值。“Cabin”特征中缺失值过多，我们不需要对其进行处理，因此可以通过删除特征来解决这个问题。对于“Age”特征和“Embarked”，我们需要进行缺失值处理。

## 2. 数据预处理

数据预处理是机器学习问题的重要步骤。我们要获取干净，规范，完整的数据集，以便更好地将其用于模型开发。在此示例中，我们将采用多种预处理技术，如处理缺失值，文本分类，标签编码和特征缩放等。让我们对缺失值进行处理。

### 2.1 处理缺失值

我们将使用中位数年龄来填充Age缺失值，并使用众数Embarked值来填充Embarked缺失值。

```python
titanic_dataset['Age'].fillna(titanic_dataset['Age'].median(), inplace=True)
titanic_dataset['Embarked'].fillna(titanic_dataset['Embarked'].mode()[0], inplace=True)

# Checking after processing missing values.
titanic_dataset.info()
```

现在，我们不再在“Age”和“Embarked”特征中拥有空值。但是，我们仍然需要进行一些预处理，例如：Cabin已超过70%的数据缺失，因此应该删除。

```python
titanic_dataset = titanic_dataset.drop(['Cabin'],axis=1)
titanic_dataset.info()
```

### 2.2 文本分类

我们将使用One-Hot Encoding来将分类数据转换为binary。使用Pandas中的“get_dummies()”方法进行One Hot Encoding。

```python
# Getting dummies from the Sex and Embarked columns via pd.get_dummies
titanic_dataset = pd.concat([titanic_dataset, pd.get_dummies(titanic_dataset['Sex'], prefix='Sex', drop_first=True)], axis=1)
titanic_dataset = pd.concat([titanic_dataset, pd.get_dummies(titanic_dataset['Embarked'], prefix='Embarked', drop_first=True)], axis=1)

# Dropping unnecessary columns
titanic_dataset = titanic_dataset.drop(['Sex','Embarked', 'Name', 'Ticket'],axis=1)
titanic_dataset.head()
```

然后，我们删除不必要的列，例如：姓名，船票和登机口岸等。

### 2.3 特征缩放

我们将使用标准化技术对数据进行缩放。实现标准化的最简单方法是使用`StandardScaler`。

```python
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
titanic_dataset[['Age', 'Fare']] = scaler.fit_transform(titanic_dataset[['Age', 'Fare']])
titanic_dataset.head()
```

现在，我们的数据准备好了模型的训练。

## 3. 特征选择

特征选择是机器学习模型开发中的重要步骤。它有助于提高模型性能并避免过度拟合。在此示例中，我们将使用“heatmap”和“corr（）”函数来为每个特征的总体相关性绘制热图，并为目标变量绘制散点图。

```python
# Using heatmap to plot the correlation matrix
fig, ax = plt.subplots(figsize=(20,12))
sns.heatmap(titanic_dataset.corr(), annot=True, cmap='coolwarm', ax=ax)
plt.show()
```

从热图中，我们可以清楚地看到两个变量之间的相关性。在本例中，我们可以看出，“Pclass”、“Fare”和“Sex_Male”对生存预测的贡献最大。“Pclass”和“Fare”与目标变量之间的相关性较大，而“Age”和目标变量之间的相关性相对较小。在此示例中，我们将保留所有特征，因为没有特征具有适当程度的贡献。

## 4. 训练模型

在此示例中，我们将使用逻辑回归作为预测模型。我们将在本机上开发模型，并使用交叉验证方法进行模型训练。

```python
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Converting the Survived column to integer
titanic_dataset['Survived'] = titanic_dataset['Survived'].astype(int)

# Separating the target column
X = titanic_dataset.drop(['Survived'], axis=1)
y = titanic_dataset['Survived']

# training test split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=42)

# initializing the model
logmodel = LogisticRegression()

# training model
logmodel.fit(X_train,y_train)
```

现在，我们完成了模型的训练，接下来通过拟合测试集来评估此模型的性能。

## 5. 评价模型

我们将使用混淆矩阵和分类报告来评估模型的性能。建议在开发模型并找到最佳参数后使用测试数据集进行评估，以避免过度拟合。在此示例中，我们在模型开发期间使用的是测试数据集。

```python
from sklearn.metrics import confusion_matrix, classification_report

# Making predictions
predictions = logmodel.predict(X_test)

# Printing classification report and confusion matrix
print(classification_report(y_test,predictions))
print(confusion_matrix(y_test,predictions))
```

根据分类报告，我们可以看到模型在测试数据集上的精度约为82％。此数字可能会因数据的随机性而有所不同。

```
precision recall f1-score support

0 0.82 0.86 0.84 134
1 0.79 0.73 0.76 89

accuracy 0.81 223
macro avg 0.81 0.80 0.80 223
weighted avg 0.81 0.81 0.81 223

[[115 19]
[ 24 65]]
```

从混淆矩阵可以看出，该模型在处理样本不平衡数据集（生存率未达50％）时表现较好。在本例中，模型的准确性略低，但由于数据集中的样本数量较少，因此不能作出最终评估。

posted @ 2023-03-10 23:53 天琴Lyrae 阅读(68) 评论(0) 收藏举报

刷新页面返回顶部

March七陆北

Tech Otakus Save the World

titan

公告