[Machine Learning with Python] Data Preparation by Pandas and Scikit-Learn

In this article, we dicuss some main steps in data preparation.

Drop Labels

Firstly, we drop labels for train set. Here we use drop() method in Pandas library. 

housing = strat_train_set.drop("median_house_value", axis=1) # drop labels for training set
housing_labels = strat_train_set["median_house_value"].copy()

Here are some tips:

  • The drop funtion deletes rows by default. If you want to delete columns, don't forget to set the parameter axis=1.
  • The drop function doesn't change the DataFrame by default.  And instead, returns to you a copy of the DataFrame with the given rows/columns removed. Or you can set inplace = True.
  • Note the function copy() here. It creates a copy that will not affect the original DataFrame

Impute Missing Values

Firstly, let's check the missing values:

sample_incomplete_rows = housing[housing.isnull().any(axis=1)].head()

Here give three methods to impute missing values:

Option 1: drop the rows

sample_incomplete_rows.dropna(subset=["total_bedrooms"])

Option 2: drop the columns

sample_incomplete_rows.drop("total_bedrooms", axis=1) 

Option 3: impute with the median value

median = housing["total_bedrooms"].median()
sample_incomplete_rows["total_bedrooms"].fillna(median, inplace=True)

 

Alternatively, we can import sklearn.impute.SimpleImputer class in Scikit-Learn 0.20.

 1 try:
 2     from sklearn.impute import SimpleImputer # Scikit-Learn 0.20+
 3 except ImportError:
 4     from sklearn.preprocessing import Imputer as SimpleImputer
 5 
 6 imputer = SimpleImputer(strategy="median")
 7 # Remove the text attribute because median can only be calculated on numerical attributes
 8 housing_num = housing.drop('ocean_proximity', axis=1)
 9 # alternatively: housing_num = housing.select_dtypes(include=[np.number])
10 imputer.fit(housing_num)

We can check the statistcs by imputer.statistics_ and the strategy by imputer.strategy

Finally, transform the train set:

1 X = imputer.transform(housing_num)
2 housing_tr = pd.DataFrame(X, columns=housing_num.columns,
3                           index = list(housing.index.values))

Encode Categorical Attributes

We need to convert text labels to numbers. There are two methods.

Option 1: Label Encoding

Conver a categorical attribute into an interger attribute.

1 try:
2     from sklearn.preprocessing import OrdinalEncoder
3 except ImportError:
4     from future_encoders import OrdinalEncoder # Scikit-Learn < 0.20
5 
6 ordinal_encoder = OrdinalEncoder()
7 housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)

 

Option2: One-Hot Encoding

Convert a categorical attribute into a series of binary intergers.

1 try:
2     from sklearn.preprocessing import OrdinalEncoder # just to raise an ImportError if Scikit-Learn < 0.20
3     from sklearn.preprocessing import OneHotEncoder
4 except ImportError:
5     from future_encoders import OneHotEncoder # Scikit-Learn < 0.20
6 
7 cat_encoder = OneHotEncoder()
8 housing_cat_1hot = cat_encoder.fit_transform(housing_cat)

By default, the OneHotEncoder class returns a sparse array, but we can convert it to a dense array if needed by calling the toarray()method:

housing_cat_1hot.toarray()

Alternatively, you can set sparse=False when creating the OneHotEncoder:

cat_encoder = OneHotEncoder(sparse=False)
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)

 Feature Engineering

Sometimes, we need to add some features to better describe the variation of the target variable. Let's create a custom transformer to add extra attributes and implement three methods: fit()(returning self), transform(), and fit_transform(). You can get the last one for free by simply adding TransformerMixin as a base class. Also, if you add BaseEstima tor as a base class (and avoid *args and **kargs in your constructor) you will get two extra methods (get_params() and set_params()) that will be useful for auto‐ matic hyperparameter tuning.

 1 from sklearn.base import BaseEstimator, TransformerMixin
 2 
 3 # column index
 4 rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6
 5 
 6 class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
 7     def __init__(self, add_bedrooms_per_room = True): # no *args or **kargs
 8         self.add_bedrooms_per_room = add_bedrooms_per_room
 9     def fit(self, X, y=None):
10         return self  # nothing else to do
11     def transform(self, X, y=None):
12         rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
13         population_per_household = X[:, population_ix] / X[:, household_ix]
14         if self.add_bedrooms_per_room:
15             bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
16             return np.c_[X, rooms_per_household, population_per_household,
17                          bedrooms_per_room]
18         else:
19             return np.c_[X, rooms_per_household, population_per_household]
20 
21 attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
22 housing_extra_attribs = attr_adder.transform(housing.values)

 

posted @ 2019-01-02 09:34  Sherrrry  阅读(609)  评论(0编辑  收藏  举报