Missing Value Imputation
Datasets for various reasons may have missing values or empty records, often encoded as blanks or NaN. Most of the machine learning algorithms are not capable of dealing with missing or blank values. Removing samples with missing values is a basic strategy that is sometimes used but it comes with a cost of losing probable valuable data and the associated information or patterns. A better strategy is to impute the missing values. PyCaret by default imputes the missing value in the dataset by ‘mean’ for numeric features and ‘constant’ for categorical features. To change the imputation method, numeric_imputation and categorical_imputation parameters can be used within setup.
Parameters in setup:
numeric_imputation: string, default = ‘mean’
If missing values are found in numeric features, they will be imputed with the mean value of the feature. The other available option is ‘median’ which imputes the value using the median value in the training dataset.
categorical_imputation: string, default = ‘constant’
If missing values are found in categorical features, they will be imputed with a constant ‘not_available’ value. The other available option is ‘mode’ which imputes the missing value using most frequent value in the training dataset.
How to use?
# Importing dataset from pycaret.datasets import get_data hepatitis = get_data('hepatitis') # Importing module and initializing setup from pycaret.classification import * clf1 = setup(data = hepatitis, target = 'Class')
Try this next