[Machine Learning with Python] Data Preparation by Pandas and Scikit-Learn
In this article, we dicuss some main steps in data preparation.
Drop Labels
Firstly, we drop labels for train set. Here we use drop() method in Pandas library.
housing = strat_train_set.drop("median_house_value", axis=1) # drop labels for training set housing_labels = strat_train_set["median_house_value"].copy()
Here are some tips:
- The drop funtion deletes rows by default. If you want to delete columns, don't forget to set the parameter axis=1.
- The
drop
function doesn't change the DataFrame by default. And instead, returns to you a copy of the DataFrame with the given rows/columns removed. Or you can set inplace = True. - Note the function copy() here. It creates a copy that will not affect the original DataFrame
Impute Missing Values
Firstly, let's check the missing values:
sample_incomplete_rows = housing[housing.isnull().any(axis=1)].head()
Here give three methods to impute missing values:
Option 1: drop the rows
sample_incomplete_rows.dropna(subset=["total_bedrooms"])
Option 2: drop the columns
sample_incomplete_rows.drop("total_bedrooms", axis=1)
Option 3: impute with the median value
median = housing["total_bedrooms"].median() sample_incomplete_rows["total_bedrooms"].fillna(median, inplace=True)
Alternatively, we can import sklearn.impute.SimpleImputer
class in Scikit-Learn 0.20.
1 try: 2 from sklearn.impute import SimpleImputer # Scikit-Learn 0.20+ 3 except ImportError: 4 from sklearn.preprocessing import Imputer as SimpleImputer 5 6 imputer = SimpleImputer(strategy="median") 7 # Remove the text attribute because median can only be calculated on numerical attributes 8 housing_num = housing.drop('ocean_proximity', axis=1) 9 # alternatively: housing_num = housing.select_dtypes(include=[np.number]) 10 imputer.fit(housing_num)
We can check the statistcs by imputer.statistics_ and the strategy by imputer.strategy
Finally, transform the train set:
1 X = imputer.transform(housing_num) 2 housing_tr = pd.DataFrame(X, columns=housing_num.columns, 3 index = list(housing.index.values))
Encode Categorical Attributes
We need to convert text labels to numbers. There are two methods.
Option 1: Label Encoding
Conver a categorical attribute into an interger attribute.
1 try: 2 from sklearn.preprocessing import OrdinalEncoder 3 except ImportError: 4 from future_encoders import OrdinalEncoder # Scikit-Learn < 0.20 5 6 ordinal_encoder = OrdinalEncoder() 7 housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)
Option2: One-Hot Encoding
Convert a categorical attribute into a series of binary intergers.
1 try: 2 from sklearn.preprocessing import OrdinalEncoder # just to raise an ImportError if Scikit-Learn < 0.20 3 from sklearn.preprocessing import OneHotEncoder 4 except ImportError: 5 from future_encoders import OneHotEncoder # Scikit-Learn < 0.20 6 7 cat_encoder = OneHotEncoder() 8 housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
By default, the OneHotEncoder
class returns a sparse array, but we can convert it to a dense array if needed by calling the toarray()
method:
housing_cat_1hot.toarray()
Alternatively, you can set sparse=False
when creating the OneHotEncoder
:
cat_encoder = OneHotEncoder(sparse=False)
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
Feature Engineering
Sometimes, we need to add some features to better describe the variation of the target variable. Let's create a custom transformer to add extra attributes and implement three methods: fit()(returning self), transform(), and fit_transform(). You can get the last one for free by simply adding TransformerMixin as a base class. Also, if you add BaseEstima tor as a base class (and avoid *args and **kargs in your constructor) you will get two extra methods (get_params() and set_params()) that will be useful for auto‐ matic hyperparameter tuning.
1 from sklearn.base import BaseEstimator, TransformerMixin 2 3 # column index 4 rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6 5 6 class CombinedAttributesAdder(BaseEstimator, TransformerMixin): 7 def __init__(self, add_bedrooms_per_room = True): # no *args or **kargs 8 self.add_bedrooms_per_room = add_bedrooms_per_room 9 def fit(self, X, y=None): 10 return self # nothing else to do 11 def transform(self, X, y=None): 12 rooms_per_household = X[:, rooms_ix] / X[:, household_ix] 13 population_per_household = X[:, population_ix] / X[:, household_ix] 14 if self.add_bedrooms_per_room: 15 bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix] 16 return np.c_[X, rooms_per_household, population_per_household, 17 bedrooms_per_room] 18 else: 19 return np.c_[X, rooms_per_household, population_per_household] 20 21 attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False) 22 housing_extra_attribs = attr_adder.transform(housing.values)