Missing Value Imputation


Datasets for various reasons may have missing values or empty records, often encoded as blanks or NaN. Most of the machine learning algorithms are not capable of dealing with missing or blank values. Removing samples with missing values is a basic strategy that is sometimes used but it comes with a cost of losing probable valuable data and the associated information or patterns. A better strategy is to impute the missing values. PyCaret by default imputes the missing value in the dataset by ‘mean’ for numeric features and ‘constant’ for categorical features. To change the imputation method, numeric_imputation and categorical_imputation parameters can be used within setup

 

Parameters in setup: 


numeric_imputation: string, default = ‘mean’
If missing values are found in numeric features, they will be imputed with the  mean value of the feature. The other available option is ‘median’ which imputes  the value using the median value in the training dataset.

categorical_imputation: string, default = ‘constant’
If missing values are found in categorical features, they will be imputed with a constant ‘not_available’ value. The other available option is ‘mode’ which  imputes the missing value using most frequent value in the training dataset.

 

How to use?


 

# Importing dataset
from pycaret.datasets import get_data
hepatitis = get_data('hepatitis')

# Importing module and initializing setup
from pycaret.classification import *
clf1 = setup(data = hepatitis, target = 'Class')

 

Try this next


 

Was this page helpful?

Send feedback