[Machine Learning with Python] My First Data Preprocessing Pipeline with Titanic Dataset
The Dataset was acquired from https://www.kaggle.com/c/titanic
For data preprocessing, I firstly defined three transformers:
- DataFrameSelector: Select features to handle.
- CombinedAttributesAdder: Add a categorical feature Age_cat which divided all passengers into three catagories according to their ages.
- ImputeMostFrequent: Since the SimpleImputer( ) method was only suitable for numerical variables, I wrote an transformer to impute string missing values with the mode value. Here I was inspired by https://stackoverflow.com/questions/25239958/impute-categorical-missing-values-in-scikit-learn.
Then I wrote pipelines separately for different features
- For numerical features, I applied DataFrameSelector, SimpleImputer and StandardScaler
- For categorical features, I applied DataFrameSelector, ImputeMostFrequent and OneHotEncoder
- For the new created feature Age_cat, since itself was a category but was derived from a numerical feature, I wrote an individual pipeline to impute the missing values and encode the categories.
Finally, we can build a full pipeline through FeatureUnion. Here is the code:
1 # Read data 2 import pandas as pd 3 import numpy as np 4 import os 5 titanic_train = pd.read_csv('Dataset/Titanic/train.csv') 6 titanic_test = pd.read_csv('Dataset/Titanic/test.csv') 7 submission = pd.read_csv('Dataset/Titanic/gender_submission.csv') 8 9 # Divide attributes and labels 10 titanic_labels = titanic_train['Survived'].copy() 11 titanic = titanic_train.drop(['Survived'],axis=1) 12 13 # Feature Selection 14 from sklearn.base import BaseEstimator, TransformerMixin 15 16 class DataFrameSelector(BaseEstimator, TransformerMixin): 17 def __init__(self,attribute_name): 18 self.attribute_name = attribute_name 19 def fit(self, X): 20 return self 21 def transform (self, X, y=None): 22 if 'Pclass' in self.attribute_name: 23 X['Pclass'] = X['Pclass'].astype(str) 24 return X[self.attribute_name] 25 26 # Feature Creation 27 class CombinedAttributesAdder(BaseEstimator, TransformerMixin): 28 def fit(self, X, y=None): 29 return self # nothing else to do 30 def transform(self, X, y=None): 31 Age_cat = pd.cut(X['Age'],[0,18,60,100],labels=['child', 'adult', 'old']) 32 Age_cat=np.array(Age_cat) 33 return pd.DataFrame(Age_cat,columns=['Age_Cat']) 34 35 # Impute Categorical variables 36 class ImputeMostFrequent(BaseEstimator, TransformerMixin): 37 def fit(self, X, y=None): 38 self.fill = pd.Series([X[c].value_counts().index[0] for c in X],index=X.columns) 39 return self 40 def transform(self, X, y=None): 41 return X.fillna(self.fill) 42 43 #Pipeline 44 from sklearn.impute import SimpleImputer # Scikit-Learn 0.20+ 45 from sklearn.pipeline import Pipeline 46 from sklearn.preprocessing import StandardScaler 47 from sklearn.preprocessing import OneHotEncoder 48 from sklearn.pipeline import FeatureUnion 49 50 num_pipeline = Pipeline([ 51 ('selector',DataFrameSelector(['Age','SibSp','Parch','Fare'])), 52 ('imputer', SimpleImputer(strategy="median")), 53 ('std_scaler', StandardScaler()), 54 ]) 55 56 cat_pipeline = Pipeline([ 57 ('selector',DataFrameSelector(['Pclass','Sex','Embarked'])), 58 ('imputer',ImputeMostFrequent()), 59 ('encoder', OneHotEncoder()), 60 ]) 61 62 new_pipeline = Pipeline([ 63 ('selector',DataFrameSelector(['Age'])), 64 #('imputer', SimpleImputer(strategy="median")), 65 ('attr_adder',CombinedAttributesAdder()), 66 ('imputer',ImputeMostFrequent()), 67 ('encoder', OneHotEncoder()), 68 ]) 69 70 full_pipeline = FeatureUnion([ 71 ("num", num_pipeline), 72 ("cat", cat_pipeline), 73 ("new", new_pipeline), 74 ]) 75 76 titanic_prepared = full_pipeline.fit_transform(titanic)
Another thing I want to mention is that the output of a pipeline should be a 2D array rather a 1D array. So if you wanna choose only one feature, don't forget to transform the 1D array by reshape()
method. Otherwise, you will receive an error like
ValueError: Expected 2D array, got 1D array instead
Specifically, apply reshape(-1,1) for column and reshape(1,-1). More about the issue can be found at https://stackoverflow.com/questions/51150153/valueerror-expected-2d-array-got-1d-array-instead.