Train Test Split

Goal in machine learning is to build a model that generalizes well to the new data. Hence the dataset is split into the Train dataset and the Test dataset during supervised machine learning experiment. Test dataset serves as a proxy for new data. Evaluation of a trained machine learning model and optimization of the hyperparameters in PyCaret is performed using k-fold cross validation on Train dataset only. Test dataset (also known as hold-out set) is not used in training of models and hence can be used under predict_model function to evaluate metrics and determine if the model has over-fitted the data. By default, PyCaret uses 70% of the dataset for training, which can be changed using train_size parameter within setup.

This functionality is only available in pycaret.classification and pycaret.regression modules.


Parameters in setup 

train_size: float, default = 0.7
Size of the training set. By default, 70% of the data will be used for training and validation. The remaining data will be used for a test / hold-out set.


How to use?


# Importing dataset
from pycaret.datasets import get_data
insurance = get_data('insurance')

# Importing module and initializing setup
from pycaret.regression import *
reg1 = setup(data = insurance, target = 'charges', train_size = 0.5)



Try this next


Was this page helpful?

Send feedback