数据挖掘中对Categorical特征的处理
Categorical特征常被称为离散特征、分类特征,数据类型通常是object类型,而我们的机器学习模型通常只能处理数值数据,所以需要对Categorical数据转换成Numeric特征。
Categorical特征又有两类,我们需要理解它们的具体含义并进行对应的转换。
- Ordinal 类型:这种类型的Categorical存在着自然的顺序结构,如果你对Ordinal 类型数据进行排序的话,可以是增序或者降序,比如在学习成绩这个特征中具体的值可能有:
A、B、C、D
四个等级,但是根据成绩的优异成绩进行排序的话有A>B>C>D
- Nominal类型:这种是常规的Categorical类型,不能对Nominal类型数据进行排序。比如血型特征可能的值有:
A、B、O、AB
,但你不能得出A>B>O>AB
的结论。
对于Ordinal和Nominal类型数据有不同的方法将它们转换成数字。
对于Ordinal类型数据可以使用LabelEncoder进行编码处理,例如成绩的A、B、C、D
四个等级进行LabelEncoder处理后会映射成1、2、3、4
,这样数据间的自然大小关系也会保留下来。
对于Nominal类型数据可以使用OneHotEncoder进行编码处理
- Use pandas’ get_dummies() method to return a new DataFrame containing a new column for each dummy variable
- Use the concat() method to add these dummy columns back to the original DataFrame
- Then drop the original columns entirely using the drop method
- In case you are dealing with ordinal feature –> you map its values to 1, 2, 3, 4 or 3, 2, 1 or whatever if not already mapped. Ordinal feature means its values may be arranged in some order that makes logical sense. For example, you have a feature “Size” with alphanumeric values, let’s say “small, medium, big”; indeed “big” is bigger than “small”, you can compare those values and it will make sense. You map “small, medium, big” to 1, 2, 3 for example. Example in Titanic: Pclass is an ordinal feature: Pclass=1 is better than Pclass=3. Note that in this case Pclass feature is already mapped to 1, 2, 3 so you don’t have to do anything with it. You would have to map it if Pclass contained alphanumeric values like “high_class, medium_class, low_class”.
- In case you are dealing with categorical feature - you look at how much categories (possible values in that particular feature) do you have. If you have only 2 categories you map them to 0 and 1 or to -1 and that’s it. If you have more than 2 categories, you create dummy variables. Example in Titanic: Sex is a categorical variable with 2 categories - ‘male’ and ‘female’, you map them for example to 0 and 1, and that’s it. Note that it’s not ordinal because male is not
better nor worse than female, you can’t logically compare them. Now, Embarked is a categorical feature too, but it has 3 categories instead of just 2. You make dummy variables out of this feature. And make just 2, not 3, the 3rd one is redundant. Well this feature is redundant by itself but anyway.
Edit: following further discussion, there are cases when turning ordinal features to dummies may improve your score a bit. It’s hard to tell beforehand, so it should be usefeul to make 2 sets of features, one including ordinal data and the other with ordinal-to-one-hot data, compare the results on various models and pick the one that worked out best in your specific case.