[Machine Learning with Python] Data Preparation through Transformation Pipeline
In the former article "Data Preparation by Pandas and Scikit-Learn", we discussed about a series of steps in data preparation. Scikit-Learn provides the Pipeline class to help with such sequences of transformations.
The Pipeline constructor takes a list of name/estimator pairs defining a sequence of steps. All but the last estimator must be transformers (i.e., they must have a fit_transform() method). The names can be anything you like.
When you call the pipeline’s fit() method, it calls fit_transform() sequentially on all transformers, passing the output of each call as the parameter to the next call, until it reaches the final estimator, for which it just calls the fit() method.
The pipeline exposes the same methods as the final estimator. In this example, the last estimator is a StandardScaler, which is a transformer, so the pipeline has a transform() method that applies all the transforms to the data in sequence (it also has a fit_transform method that we could have used instead of calling fit() and then transform()).
1 from sklearn.pipeline import Pipeline 2 from sklearn.preprocessing import StandardScaler 3 4 num_pipeline = Pipeline([ 5 ('imputer', SimpleImputer(strategy="median")), 6 ('attribs_adder', CombinedAttributesAdder()), 7 ('std_scaler', StandardScaler()), 8 ]) 9 10 try: 11 from sklearn.compose import ColumnTransformer 12 except ImportError: 13 from future_encoders import ColumnTransformer # Scikit-Learn < 0.20 14 15 num_attribs = list(housing_num) 16 cat_attribs = ["ocean_proximity"] 17 18 full_pipeline = ColumnTransformer([ 19 ("num", num_pipeline, num_attribs), 20 ("cat", OneHotEncoder(), cat_attribs), 21 ]) 22 23 housing_prepared = full_pipeline.fit_transform(housing)