Ignore Low Variance
Sometimes a dataset may have a categorical feature with multiple levels, where distribution of such levels are skewed and one level may dominate over other levels. This means there is not much variation in the information provided by such feature. For a ML model, such feature may not add a lot of information and thus can be ignored for modeling. This can be achieved in PyCaret using ignore_low_variance parameter within setup. Both conditions below must be met for a feature to be considered a low variance feature.
- Count of unique values in a feature / sample size < 10%
- Count of most common value / Count of second most common value > 20 times.
Parameters in setup
ignore_low_variance: bool, default = False
When set to True, all categorical features with statistically insignificant variances are removed from the dataset. The variance is calculated using the ratio of unique values to the number of samples, and the ratio of the most common value to the frequency of the second most common value.
How to use?
# Importing dataset from pycaret.datasets import get_data mice = get_data('mice') # Filter the column to demonstrate example mice = mice[mice['Genotype']] = 'Control' # Importing module and initializing setup from pycaret.classification import * clf1 = setup(data = mice, target = 'class', ignore_low_variance = True)
Try this next