Setting up Environment


Before we begin any machine learning experiment in PyCaret, we must necessarily set the environment. This has two simple steps:

 

Step 1: Importing a Module


Depending on the type of experiment you want to perform, one of the six available modules currently supported must be imported in your python environment. Importing a module prepares an environment for specific task. For example, if you have imported the Classification module, the environment will be setup accordingly to perform classification tasks only. 

S.No Module How to Import
1 Classification from pycaret.classification import *
2 Regression from pycaret.regression import *
3 Clustering from pycaret.clustering import *
4 Anomaly Detection from pycaret.anomaly import *
5 Natural Language Processing from pycaret.nlp import *
6 Association Rule Mining from pycaret.arules import *

 

Step 2: Initializing the setup


Common to all modules in PyCaret, setup is the first and the only mandatory step to start any machine learning experiment. Besides performing some basic processing tasks by default, PyCaret also offers wide array of pre-processing features which structurally elevates an average machine learning experiment to an advanced solution. We have only covered the essential part of the setup function in this section. Elaborate details of all the pre-processing features can be found here. Listed below are the essential default tasks performed by PyCaret when you initialize the setup:

Data Type Inference:  Any experiment performed in PyCaret begins with determining the correct data types for all features. The setup function performs essential inferences about the data and performs several downstream tasks such as ignoring ID and Date columns, categorical encoding, missing values imputation based on the data type inferred by PyCaret’s internal algorithm. Once the setup is executed a dialogue box (see example below) appears with the list of all the features and their inferred data types. Data type inferences are usually correct but once the dialogue box appears, user should review the list for accuracy. If all the data types are inferred correctly you may press enter to continue or if not you may type ‘quit‘ to stop the experiment.

 

If you choose to enter ‘quit‘ because one or more data types were not inferred correctly, you can overwrite them within setup command by passing categorical_feature parameter to force categorical type and numeric_feature parameter to force numeric type. Similarly, in order to ignore certain features to become part of experiment you can pass ignore_features parameter within setup.

Note: If you don’t want PyCaret to display the dialogue for confirmation of data types you may pass silent as True within setup to perform a unattended run of experiment. We don’t recommend that unless you are absolutely sure the inference is correct or you have performed the experiment before or you are overwriting data types using numeric_feature and categorical_feature parameter.

Data Cleaning and Preparation: setup function automatically performs missing value imputation and categorical encoding as they are imperative for any machine learning experiment. By default mean value is used for imputation of numeric features and most frequent value or mode is used for categorical features. You may change the method using numeric_imputation and categorical_imputation parameter. For classification problems, setup also performs target encoding if target is not of type numeric.

Data Sampling: If the sample size is greater than 25,000, PyCaret automatically builds a preliminary linear model based on different sample sizes and provides a visual that shows the performance of the model based on sample size. This plot can then be used to evaluate if the performance of model increases with increase in sample size. If not, you may choose a smaller sample size in interest of efficiency and performance of the experiment. See an example below where we have used the ‘bank‘ dataset from pycaret’s repository and it has 45,211 samples.

Train Test Split: setup function also performs the train test split (stratified for classification problems). The default split ratio is 70:30 but you can change that using train_size parameter within setup. Evaluation of a trained machine learning model and hyperparameter optimization in PyCaret is performed using k-fold cross validation on Train set only.

Assigning Session ID as seed: session id is a pseudo random number generated by default if no session_id parameter is passed. PyCaret distributes this id as a seed in all the functions to isolate the effect of randomization. This allows for reproducibility at later date in the same or different environment.

 

Classification Example

 

 

Code
# Importing dataset
from pycaret.datasets import get_data
diabetes = get_data('diabetes')

# Importing module and initializing setup
from pycaret.classification import *
clf1 = setup(data = diabetes, target = 'Class variable')

 

Output

Output has been compressed.

Regression Example

 

Code
# Importing dataset
from pycaret.datasets import get_data
boston = get_data('boston')

# Importing module and initializing setup
from pycaret.regression import *
reg1 = setup(data = boston, target = 'medv')

 

Output

Output has been compressed.

Clustering Example

 

Code
# Importing dataset
from pycaret.datasets import get_data
jewellery = get_data('jewellery')

# Importing module and initializing setup
from pycaret.clustering import * 
clu1 = setup(data = jewellery)

 

Output

Output has been compressed.

Anomaly Detection Example

 

Code
# Importing dataset
from pycaret.datasets import get_data
anomalies = get_data('anomaly')

# Importing module and initializing setup
from pycaret.anomaly import *
ano1 = setup(data = anomalies)

 

Output

Output has been compressed.

Natural Language Processing Example

 

Code
# Importing dataset
from pycaret.datasets import get_data
kiva = get_data('kiva')

# Importing module and initializing setup
from pycaret.nlp import *
nlp1 = setup(data = kiva, target = 'en')

 

Output

Association Rule Mining Example

 

Code
# Importing dataset
from pycaret.datasets import get_data
france = get_data('france')

# Importing module and initializing setup
from pycaret.arules import *
arules1 = setup(data = france, transaction_id = 'InvoiceNo', item_id = 'Description')

 

Output

Try this next


 

Was this page helpful?

GitHub

Send feedback