Note on using OneHotEncoder in scikit-learn to work on categorical features

OneHotEncoder is used to transform categorical feature to a lot of binary features. The fit method takes an argument of array of int. But one thing not clearly stated in the document is that the np.max(int_array) + 1 should be equal to the number of categories. Otherwise, if you have discrete integers, some very large, you will have a huge memory leak. And get Memory Error.

So the best way is to use LabelEncoder() first to convert discrete integers to a continuous integer set with a smaller max value:

encoder = sklearn.preprocessing.OneHotEncoder()
label_encoder = sklearn.preprocessing.LabelEncoder()
data_label_encoded = label_encoder.fit_transform(data['category_feature'])
data['category_feature'] = data_label_encoded
data_feature_one_hot_encoded = encoder.fit_transform(data[['category_feature']].as_matrix())

Then a sparse matrix containing one hot encoded categorical feature is generated.

This entry was posted in Uncategorized. Bookmark the permalink.

Note on using OneHotEncoder in scikit-learn to work on categorical features

1 Response to Note on using OneHotEncoder in scikit-learn to work on categorical features

Leave a comment Cancel reply

Recent Posts

Recent Comments

Archives

Categories

Meta

Recent Posts

Recent Comments

Archives

Categories

Meta

	What are the pros an… on Note on using OneHotEncoder in…
	Just Clarifying on Solve Mathematica 10 install…

Note on using OneHotEncoder in scikit-learn to work on categorical features

Share this:

Related

1 Response to Note on using OneHotEncoder in scikit-learn to work on categorical features

Leave a comment Cancel reply

Recent Posts

Recent Comments

Archives

Categories

Meta

Recent Posts

Recent Comments

Archives

Categories

Meta