OneHotEncoder is used to transform categorical feature to a lot of binary features. The fit method takes an argument of array of int. But one thing not clearly stated in the document is that the np.max(int_array) + 1 should be equal to the number of categories. Otherwise, if you have discrete integers, some very large, you will have a huge memory leak. And get Memory Error.
So the best way is to use LabelEncoder() first to convert discrete integers to a continuous integer set with a smaller max value:
encoder = sklearn.preprocessing.OneHotEncoder() label_encoder = sklearn.preprocessing.LabelEncoder() data_label_encoded = label_encoder.fit_transform(data['category_feature']) data['category_feature'] = data_label_encoded data_feature_one_hot_encoded = encoder.fit_transform(data[['category_feature']].as_matrix())
Then a sparse matrix containing one hot encoded categorical feature is generated.
Pingback: What are the pros and cons between get_dummies (Pandas) and OneHotEncoder (Scikit-learn)?