Note on using OneHotEncoder in scikit-learn to work on categorical features

OneHotEncoder is used to transform categorical feature to a lot of binary features. The fit method takes an argument of array of int. But one thing not clearly stated in the document is that the np.max(int_array) + 1 should be equal to the number of  categories. Otherwise, if you have discrete integers, some very large, you will have a huge memory leak. And get Memory Error.

So the best way is to use LabelEncoder() first to convert discrete integers to a continuous integer set with a smaller max value:

encoder = sklearn.preprocessing.OneHotEncoder()
label_encoder = sklearn.preprocessing.LabelEncoder()
data_label_encoded = label_encoder.fit_transform(data['category_feature'])
data['category_feature'] = data_label_encoded
data_feature_one_hot_encoded = encoder.fit_transform(data[['category_feature']].as_matrix())

Then a sparse matrix containing one hot encoded categorical feature is generated.

This entry was posted in Uncategorized. Bookmark the permalink.

1 Response to Note on using OneHotEncoder in scikit-learn to work on categorical features

  1. Pingback: What are the pros and cons between get_dummies (Pandas) and OneHotEncoder (Scikit-learn)?

Leave a comment