Combine Rare Levels

Sometimes a dataset can have a categorical feature (or multiple categorical features) that has a very high number of levels (i.e. high cardinality features). If such feature (or features) are encoded into numeric values, then the resultant matrix is a sparse matrix. This not only makes experiment slow due to manifold increment in the number of features and hence the size of the dataset, but also introduces noise in the experiment. Sparse matrix can be avoided by combining the rare levels in the feature(or features) having high cardinality. This can be achieved in PyCaret using combine_rare_levels parameter within setup.


Parameters in setup: 

combine_rare_levels: bool, default = False
When set to True, all levels in categorical features below the threshold defined in rare_level_threshold param are combined together as a single level. There must be at least two levels under the threshold for this to take effect. rare_level_threshold represents the percentile distribution of level frequency. Generally, this technique is applied to limit a sparse matrix caused by high numbers of levels in categorical features.

rare_level_threshold: float, default = 0.1
Percentile distribution below which rare categories are combined. Only comes into effect when combine_rare_levels is set to True.


How to use?


# Importing dataset
from pycaret.datasets import get_data
income = get_data('income')

# Importing module and initializing setup
from pycaret.classification import *
clf1 = setup(data = income, target = 'income >50K', combine_rare_levels = True)


Try this next


Was this page helpful?

Send feedback