Categorical Encoding¶
Many machine learning models cannot work with categorical data directly and require numerical inputs. sorix provides encoding tools to handle these scenarios.
1. OneHotEncoder¶
The OneHotEncoder creates new binary columns for each unique category in a feature. A 1 represents the presence of a category, while a 0 represents its absence.
Example¶
Let's see an example with categorical features.
# Uncomment the next line and run this cell to install sorix
#!pip install 'sorix @ git+https://github.com/Mitchell-Mirano/sorix.git@main'
import numpy as np
import pandas as pd
import sorix
from sorix.preprocessing import OneHotEncoder
# Create sample data with categorical features
data = {
'color': ['red', 'blue', 'green', 'blue', 'red'],
'size': ['S', 'M', 'L', 'S', 'M']
}
X = pd.DataFrame(data)
X
encoder = OneHotEncoder()
X_encoded = encoder.fit_transform(X)
print("OneHot Encoded data:\n", X_encoded)
print("\nEncoded features names:\n", encoder.get_features_names())
OneHot Encoded data: [[0. 0. 1. 0. 0. 1.] [1. 0. 0. 0. 1. 0.] [0. 1. 0. 1. 0. 0.] [1. 0. 0. 0. 0. 1.] [0. 0. 1. 0. 1. 0.]] Encoded features names: ['color_blue', 'color_green', 'color_red', 'size_L', 'size_M', 'size_S']
2. Conceptual Note¶
The OneHotEncoder is especially useful for nominal categorical data (where there is no inherent order). For each category, it creates a new column with binary values. If we have $k$ categories for a feature, we get $k$ binary features.
Methods available include:
fit: Learns the unique categories from the data.transform: Encoders the categories into binary vectors.inverse_transform: Returns the index of the encoded categories.get_features_names: Returns the names of the new binary columns.
# Save the fitted encoder
sorix.save(encoder, 'my_encoder.sor')
# Reload the encoder instance
loaded_encoder = sorix.load('my_encoder.sor')
# Verify consistency
X_test = pd.DataFrame({'color': ['blue'], 'size': ['S']})
assert np.allclose(encoder.transform(X_test), loaded_encoder.transform(X_test))
print("OneHotEncoder object successfully saved and reloaded using sorix components!")
OneHotEncoder object successfully saved and reloaded using sorix components!
B. Using state_dict and load_state_dict¶
If you only want to save the internal categories without pickling the entire object, you can save the state_dict explicitly.
# 1. Extract the state dictionary
params_dict = encoder.state_dict()
# 2. Save the dictionary with sorix.save (.sor extension)
sorix.save(params_dict, 'encoder_params.sor')
# 3. Load the dictionary with sorix.load
loaded_params = sorix.load('encoder_params.sor')
# 4. Apply state to a fresh instance
new_encoder = OneHotEncoder()
new_encoder.load_state_dict(loaded_params)
# 5. Verify results
assert np.allclose(encoder.transform(X_test), new_encoder.transform(X_test))
print("Encoder state_dict successfully saved and loaded!")
Encoder state_dict successfully saved and loaded!