Cardinal Encoding

When categorical features in the dataset contain variables with many levels (also known as high cardinality features), then typical One Hot Encoding leads to creation of a very large number of new features, thereby making the experiment slow and introduces probable noise for certain machine learning algorithms. Features with high cardinality can be handled in PyCaret using high_cardinality_features parameter within setup. It supports two methods for cardinal encoding i.e. Frequency / Count Based and Clustering method. These methods can be defined in the high_cardinality_method parameter within setup.


Parameters in setup

high_cardinality_features: string, default = None
When the data contains features with high cardinality, they can be compressed into fewer levels by passing them as a list of column names with high cardinality. Features are compressed using method defined in high_cardinality_method param.

high_cardinality_method: string, default = ‘frequency’
When method set to ‘frequency’ it will replace the original value of feature with the frequency distribution and convert the feature into numeric. Other available method is ‘clustering’ which performs the clustering on statistical attribute of data and replaces the original value of feature with cluster label. The number of clusters is determined using a combination of Calinski-Harabasz and Silhouette criterion.


How to use?


# Importing dataset
from pycaret.datasets import get_data
income = get_data('income')

# Importing module and initializing setup
from pycaret.classification import *
clf1 = setup(data = income, target = 'income >50K', high_cardinality_features = ['native-country'])


Notice how native-country variable is transformed into numeric variables. By default it uses count based method to convert high cardinality feature space. To change the method, you can use high_cardinality_method parameter within setup.

Try this next


Was this page helpful?

Send feedback