Friday, July 26, 2013

One hot encoding for machine learning

One hot encoding is a data encoding method used in predictive modeling. It enhances the descriptiveness of the available data.

What happens in one hot encoding?

Lets discuss this with an example.
Suppose you are going to choose ice cream for your desert. But you heed for certain information about ice cream before selecting. Those are the flavor, capacity and the brand. These three are the features.
You have following values for those features.

flavor = {chocolate, vanilla, butterscotch, strawberry, cheesecake}
capacity = {0.5L, 1L, 2L}
brand = {Cargills, Elephant house, Baskin-Robbins}

Depending on the various combinations of those features you will choose your desert. Here in one hot encoding, for each distinct value in a single feature, it allocates a place ( a separate column) in the resulting matrix. If a particular, record ( a combination) has a value for a feature, it represents 1 in the corresponding position in the resulting matrix.

Consider capacity :

This feature has only 3 distinct values, so only 3 bits are allocated for this.

  0.5L        1L      2L
---------------------------------------
|     1      |     0   |     0    |
|             |          |           |
---------------------------------------

Flavor and brand have 4 and 3 distinct values, so they get 4 and 3 bits respectively
Likewise we do this to every feature. Then, they are concatenated horizontally.
Note that for a particular feature, only a single place can have the value one and others should be zero.

Suppose we have the following row in the resulting matrix.

0    1    0    0    1    0    0    0    0    1
                      |                 |
      flavor          capacity      brand

This represents a combination of vanilla, 0.5L,  Baskin-Robbins ice cream.

What makes One hot encoder better?

This method does not make much effect on non-linear learning algorithms such as decision trees, KNN, etc. But improvement of descriptiveness of the data effects on linear learning algorithms such as logistic regression.

The most imperative advantage of this method arises when it comes to feature selection. When we search for the best set of features in data mining, what we does usually is, removing the most irrelevant  features by some algorithm. But in those removed features there can be important values which has bigger correlation to the target, but got rejected due to other less important values. Even in the selected set of features, there can be particular value that doesn't make any sense. So, by removing or keeping the entire feature, we cannot deal with that issue.
One hot encoder eliminates the requirement of bothering about this. It presents each distinct value as a feature, so that every value can be tested and handled appropriately.


Disadvantages:

  • Consumes large amount of memory to keep data.(obvious) 
  • Extra time to do the encoding.

But for better results, it is worth spending extra memory and time.


“May it be a light to you in dark places, when all other lights go out.” 
― J.R.R. TolkienThe Fellowship of the Ring

Thanks & Cheers!

























Saturday, February 16, 2013

Evaluating missing features using sci kit learn library in Python


This code is still improving....

import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import ExtraTreesClassifier


#make sure you provide only the important features (after eliminating unwanted ones)
def feature_predict(training_data, feature_num, num_of_rows, num_of_cols): #give training data as a numpy array and which particular missing feauture you want to predict
       
    fatt = np.zeros((num_of_cols-1,), dtype = object)
   
   
    train_new = np.concatenate((training_data[training_data[0::,feature_num]!='', 0:feature_num],training_data[training_data[0::,feature_num]!='', feature_num+1:]),1)
    targets = training_data[training_data[0::, feature_num] !='', feature_num ]
    targets_new = np.zeros(len(targets), dtype = int)
   
    tar = set(targets)
    tar = list(tar)
    tar.sort()
   
    tell = {}
   
    for i in xrange(len(tar)):
        tell[tar[i]] = i
                 
    for i in xrange(len(targets)):
        targets_new[i] = tell[targets[i]]
   
   
   
    for i in xrange(num_of_cols-1):
        s = set(np.concatenate((training_data[0::, 0:feature_num],training_data[0::,feature_num+1:]),1)[:,i])
        s = list(s)
        s.sort()
        fatt[i] = s
       
    valid = np.concatenate((training_data[training_data[0::,feature_num]=='', 0:feature_num],training_data[training_data[0::,feature_num]=='', feature_num+1:]),1)
   
    feature = np.zeros((len(train_new),num_of_cols-1), dtype = int)
    valid_new = np.zeros((len(valid), num_of_cols-1), dtype = int)
   
    for i in xrange(num_of_cols-1):
        for j in xrange(len(train_new)):
            for k,element in enumerate(fatt[i]):
                if train_new[j,i]==element:
                    feature[j,i]=k
                    break
                       
   
    for i in xrange(num_of_cols-1):
        for j in xrange(len(valid)):
            for k,element in enumerate(fatt[i]):
                if valid[j,i]==element:
                    valid_new[j,i]=k
                    break
   
    clf = ExtraTreesClassifier(n_estimators=10)
    clf.fit(feature, targets_new)
    res = clf.predict(valid_new)
   
    return res