Handle Unknown Levels

When the unseen data has new levels in categorical feature that were not present at the time of training the model, it may cause problems for trained algorithm in generating accurate predictions. One way to deal with such data points is to reassign them to known level of categorical features i.e. the levels known in the training dataset. This can be achieved in PyCaret using handle_unknown_categorical¬†parameter which is set to True by default. It supports two methods ‘least_frequent’ and ‘most_frequent’ which can be controlled using unknown_categorical_method parameter within setup.


Parameters in setup 

handle_unknown_categorical: bool, default = True
When set to True, unknown categorical levels in new / unseen data are replaced by the most or least frequent level as learned in the training data. The method is defined under the unknown_categorical_method param.

unknown_categorical_method: string, default = ‘least_frequent’
Method used to replace unknown categorical levels in unseen data. Method can be set to ‘least_frequent’ or ‘most_frequent’.


How to use?


# Importing dataset
from pycaret.datasets import get_data
insurance = get_data('insurance')

# Importing module and initializing setup
from pycaret.regression import *
reg1 = setup(data = insurance, target = 'charges', handle_unknown_categorical = True, unknown_categorical_method = 'most_frequent')


Try this next


Was this page helpful?

Send feedback