Train Test Split
Goal in machine learning is to build a model that generalizes well to the new data. Hence the dataset is split into the Train dataset and the Test dataset during supervised machine learning experiment. Test dataset serves as a proxy for new data. Evaluation of a trained machine learning model and optimization of the hyperparameters in PyCaret is performed using k-fold cross validation on Train dataset only. Test dataset (also known as hold-out set) is not used in training of models and hence can be used under predict_model function to evaluate metrics and determine if the model has over-fitted the data. By default, PyCaret uses 70% of the dataset for training, which can be changed using train_size parameter within setup.
Parameters in setup
train_size: float, default = 0.7
Size of the training set. By default, 70% of the data will be used for training and validation. The remaining data will be used for a test / hold-out set.
How to use?
# Importing dataset from pycaret.datasets import get_data insurance = get_data('insurance') # Importing module and initializing setup from pycaret.regression import * reg1 = setup(data = insurance, target = 'charges', train_size = 0.5)
Try this next