Ignore Low Variance

Sometimes a dataset may have a categorical feature with multiple levels, where distribution of such levels are skewed and one level may dominate over other levels. This means there is not much variation in the information provided by such feature.  For a ML model, such feature may not add a lot of information and thus can be ignored for modeling. This can be achieved in PyCaret using ignore_low_variance parameter within setup. Both conditions below must be met for a feature to be considered a low variance feature.

  • Count of unique values in a feature  / sample size < 10%
  • Count of most common value / Count of second most common value > 20 times.


Parameters in setup 

ignore_low_variance: bool, default = False
When set to True, all categorical features with statistically insignificant variances are removed from the dataset. The variance is calculated using the ratio of unique  values to the number of samples, and the ratio of the most common value to the frequency of the second most common value.


How to use?


# Importing dataset
from pycaret.datasets import get_data
mice = get_data('mice')

# Filter the column to demonstrate example
mice = mice[mice['Genotype']] = 'Control'

# Importing module and initializing setup
from pycaret.classification import *
clf1 = setup(data = mice, target = 'class', ignore_low_variance = True)


Try this next


Was this page helpful?

Send feedback