What happens in one hot encoding?
Lets discuss this with an example.
Suppose you are going to choose ice cream for your desert. But you heed for certain information about ice cream before selecting. Those are the flavor, capacity and the brand. These three are the features.
You have following values for those features.
flavor = {chocolate, vanilla, butterscotch, strawberry, cheesecake}
capacity = {0.5L, 1L, 2L}
brand = {Cargills, Elephant house, Baskin-Robbins}
Depending on the various combinations of those features you will choose your desert. Here in one hot encoding, for each distinct value in a single feature, it allocates a place ( a separate column) in the resulting matrix. If a particular, record ( a combination) has a value for a feature, it represents 1 in the corresponding position in the resulting matrix.
Consider capacity :
This feature has only 3 distinct values, so only 3 bits are allocated for this.
0.5L 1L 2L
---------------------------------------
| 1 | 0 | 0 |
| | | |
---------------------------------------
Flavor and brand have 4 and 3 distinct values, so they get 4 and 3 bits respectively
Likewise we do this to every feature. Then, they are concatenated horizontally.
Note that for a particular feature, only a single place can have the value one and others should be zero.
Suppose we have the following row in the resulting matrix.
0 1 0 0 1 0 0 0 0 1
| |
flavor capacity brand
This represents a combination of vanilla, 0.5L, Baskin-Robbins ice cream.
What makes One hot encoder better?
This method does not make much effect on non-linear learning algorithms such as decision trees, KNN, etc. But improvement of descriptiveness of the data effects on linear learning algorithms such as logistic regression.
The most imperative advantage of this method arises when it comes to feature selection. When we search for the best set of features in data mining, what we does usually is, removing the most irrelevant features by some algorithm. But in those removed features there can be important values which has bigger correlation to the target, but got rejected due to other less important values. Even in the selected set of features, there can be particular value that doesn't make any sense. So, by removing or keeping the entire feature, we cannot deal with that issue.
One hot encoder eliminates the requirement of bothering about this. It presents each distinct value as a feature, so that every value can be tested and handled appropriately.
Disadvantages:
- Consumes large amount of memory to keep data.(obvious)
- Extra time to do the encoding.
But for better results, it is worth spending extra memory and time.
“May it be a light to you in dark places, when all other lights go out.”
― J.R.R. Tolkien, The Fellowship of the Ring
― J.R.R. Tolkien, The Fellowship of the Ring
Thanks & Cheers!