Dummy vs one-hot for Machine Learning

原文 https://www.quora.com/What-is-the-difference-between-One-Hot-Encoding-and-a-CountVectorizer

 

 

As the previous answers suggested they are very closely related, but there’s a subtle difference of how they are used and thought of. In layman’s term:

Dummy variable:

  • You replace the categorical variable by different boolean variables (taking value 0 or 1) to encode whether or not the categorical value had a certain value. For encoding a categorical variable that can take k values, you only need k-1 dummy variables.
  • Often used in more statistical domains as it uses the “correct number of degrees of freedom”.

One-hot encoding

  • You replace the categorical variable by a vector indicating “in which dimension” your variables lives. This vector will have k dimensions.
  • Often used in CS domains.

The basic take-away is that dummy variables takes into account the fact that we can use 1 less dimension by realising that if it’s not all the previous categories then it must be the last one. To give a concrete example (if you know a bit about ML): Dummy variables would be used in linear regressions if you’re using an intercept (“to keep the correct number of degrees of freedom”). While one-hot encoding would be used in Xgboost because if not, the last category could only be taken into account by trees that had access to all the other dummy variables.

PS: note that people use them interchangeably. Ex: get_dummies in pandas return one hot encoding.

Hope it helps :)

 

 

 

One hot encoding is a binary representation of a categorical data. This became popular after deep learning came into practice because categorical data can’t be used directly with many ML algorithms.

It is very simple and one can understand it as follows. Let’s we have three colors, ‘red’, ‘green’ and ‘blue’.

  1. We will first convert these to integers. For example, red->1, green->2 and blue->3.
  2. In one hot encoding, each word is represented by a vector of the same length as other samples. The vector length will be the number of colors and only one value of any color's vector will be one corresponding to its integer value. All others will be zero. Our three colors can be represented as follows:

red-> [1, 0, 0],
green-> [0, 1, 0],
blue-> [0, 0, 1].

 

 

 

 

One hot encoding transforms categorical features to a format that works better with classification and regression algorithms.

Let’s take the following example. I have seven sample inputs of categorical data belonging to four categories. Now, I could encode these to nominal values as I have done here, but that wouldn’t make sense from a machine learning perspective. We can’t say that the category of “Penguin” is greater or smaller than “Human”. Then they would be ordinal values, not nominal.

What we do instead is generate one boolean column for each category. Only one of these columns could take on the value 1 for each sample. Hence, the term one hot encoding.

This works very well with most machine learning algorithms. Some algorithms, like random forests, handle categorical values natively. Then, one hot encoding is not necessary. The process of one hot encoding may seem tedious, but fortunately, most modern machine learning libraries can take care of it.

posted @ 2022-11-18 17:16  PanPan003  阅读(20)  评论(0编辑  收藏  举报