Sampling


When the dataset contains over 25,000 samples, PyCaret enables sampling of dataset by default. It does so by training a preliminary linear model at various sample levels and prints a visual showing the performance of trained model as a function of sample level, shown on x-axis. This plot can then be used to evaluate the sample size to be used for training the models. Sometimes, you may want to choose a smaller sample size to train models faster. To change the estimator from linear model, sample_estimator parameter can be used within setup. To turn off sampling, sampling parameter can be set to False.

This functionality is only available in pycaret.classification and pycaret.regression modules.

 

Parameters in setup 


sampling: bool, default = True
When the sample size exceeds 25,000 samples, pycaret will build a base estimator at various sample sizes from the original dataset. This will return a performance plot of common evaluation metrics at various sample levels, that will assist in deciding the preferred sample size for modeling. The desired sample size must then be entered for training and validation in the pycaret environment. When sample_size entered is less than 1, the remaining dataset (1 – sample) is used for fitting the model only when finalize_model() is called.

sample_estimator: object, default = None
If None, Linear Model is used by default.

 

How to use?


 

# Importing dataset
from pycaret.datasets import get_data
bank = get_data('bank')

# Importing module and initializing setup
from pycaret.regression import *
reg1 = setup(data = bank, target = 'deposit')

 

Try this next


 

Was this page helpful?

Send feedback