Setting up Environment


 

setup(data, target, train_size = 0.7, sampling = True, sample_estimator = None, categorical_features = None, categorical_imputation = ‘constant’, ordinal_features = None, high_cardinality_features = None, high_cardinality_method = ‘frequency’, numeric_features = None, numeric_imputation = ‘mean’, date_features = None, ignore_features = None, normalize = False, normalize_method = ‘zscore’, transformation = False, transformation_method = ‘yeo-johnson’, handle_unknown_categorical = True, unknown_categorical_method = ‘least_frequent’, pca = False, pca_method = ‘linear’, pca_components = None, ignore_low_variance = False, combine_rare_levels = False, rare_level_threshold = 0.10, bin_numeric_features = None, remove_outliers = False, outliers_threshold = 0.05, remove_multicollinearity = False, multicollinearity_threshold = 0.9, create_clusters = False, cluster_iter = 20, polynomial_features = False, polynomial_degree = 2, trigonometry_features = False, polynomial_threshold = 0.1, group_features = None, group_names = None, feature_selection = False, feature_selection_threshold = 0.8, feature_interaction = False, feature_ratio = False, interaction_threshold = 0.01, transform_target = False, transform_target_method = ‘box-cox’, session_id = None, silent=False, profile = False)

Description:

This function initializes the environment in pycaret and creates the transformation pipeline to prepare the data for modeling and deployment. setup() must called before executing any other function in pycaret. It takes two mandatory parameters: dataframe {array-like, sparse matrix} and name of the target column. All other parameters are optional.

Code
#import the dataset from pycaret repository
from pycaret.datasets import get_data
boston = get_data('boston')

#import regression module
from pycaret.regression import *

#intialize the setup
exp_reg = setup(boston, target = 'medv')

 

Output

Output has been compressed.

boston‘ is a pandas DataFrame and ‘medv‘ is the name of target column.

Parameters:

data: dataframe
array-like, sparse matrix, shape (n_samples, n_features) where n_samples is the number of samples and n_features is the number of features.

target: string
Name of the target column to be passed in as a string. The target variable could be binary or multiclass. In case of a multiclass target, all estimators are wrapped
with a OneVsRest classifier.

train_size: float, default = 0.7
Size of the training set. By default, 70% of the data will be used for training  and validation. The remaining data will be used for a test / hold-out set.

sampling: bool, default = True
When the sample size exceeds 25,000 samples, pycaret will build a base estimator at various sample sizes from the original dataset. This will return a performance  plot of AUC, Accuracy, Recall, Precision, Kappa and F1 values at various sample levels, that will assist in deciding the preferred sample size for modeling.  The desired sample size must then be entered for training and validation in the pycaret environment. When sample_size entered is less than 1, the remaining dataset (1 – sample) is used for fitting the model only when finalize_model() is called.

sample_estimator: object, default = None
If None, Linear Regression is used by default.

categorical_features: string, default = None
If the inferred data types are not correct, categorical_features can be used to overwrite the inferred type. If when running setup the type of ‘column1’ is inferred as numeric instead of categorical, then this parameter can be used to overwrite the type by passing categorical_features = [‘column1’].

categorical_imputation: string, default = ‘constant’
If missing values are found in categorical features, they will be imputed with a constant ‘not_available’ value. The other available option is ‘mode’ which  imputes the missing value using most frequent value in the training dataset.

ordinal_features: dictionary, default = None
When the data contains ordinal features, they must be encoded differently using the ordinal_features param. If the data has a categorical variable with values
of ‘low’, ‘medium’, ‘high’ and it is known that low < medium < high, then it can be passed as ordinal_features = { ‘column_name’ : [‘low’, ‘medium’, ‘high’] }.  The list sequence must be in increasing order from lowest to highest.

high_cardinality_features: string, default = None
When the data contains features with high cardinality, they can be compressed into fewer levels by passing them as a list of column names with high cardinality. Features are compressed using method defined in high_cardinality_method param.

high_cardinality_method: string, default = ‘frequency’
When method set to ‘frequency’ it will replace the original value of feature with the frequency distribution and convert the feature into numeric. Other available method is ‘clustering’ which performs the clustering on statistical attribute of data and replaces the original value of feature with cluster label. The number of clusters is determined using a combination of Calinski-Harabasz and Silhouette criterion.

numeric_features: string, default = None
If the inferred data types are not correct, numeric_features can be used to overwrite the inferred type. If when running setup the type of ‘column1’ is  inferred as a categorical instead of numeric, then this parameter can be used to overwrite by passing numeric_features = [‘column1’].

numeric_imputation: string, default = ‘mean’
If missing values are found in numeric features, they will be imputed with the mean value of the feature. The other available option is ‘median’ which imputes the value using the median value in the training dataset.

date_features: string, default = None
If the data has a DateTime column that is not automatically detected when running setup, this parameter can be used by passing date_features = ‘date_column_name’.  It can work with multiple date columns. Date columns are not used in modeling.  Instead, feature extraction is performed and date columns are dropped from the dataset. If the date column includes a time stamp, features related to time will  also be extracted.

ignore_features: string, default = None
If any feature should be ignored for modeling, it can be passed to the param ignore_features. The ID and DateTime columns when inferred, are automatically
set to ignore for modeling.

normalize: bool, default = False
When set to True, the feature space is transformed using the normalized_method param. Generally, linear algorithms perform better with normalized data however,  the results may vary and it is advised to run multiple experiments to evaluate the benefit of normalization.

normalize_method: string, default = ‘zscore’
Defines the method to be used for normalization. By default, normalize method is set to ‘zscore’. The standard zscore is calculated as z = (x – u) / s. The other available options are:
‘minmax’ : scales and translates each feature individually such that it is in the range of 0 – 1.
‘maxabs’ : scales and translates each feature individually such that the maximal absolute value of each feature will be 1.0. It does not shift/center the data, and thus does not destroy any sparsity.
‘robust’ : scales and translates each feature according to the Interquartile range. When the dataset contains outliers, robust scaler often gives better results.

transformation: bool, default = False
When set to True, a power transformation is applied to make the data more normal / Gaussian-like. This is useful for modeling issues related to heteroscedasticity or  other situations where normality is desired. The optimal parameter for stabilizing variance and minimizing skewness is estimated through maximum likelihood.

transformation_method: string, default = ‘yeo-johnson’
Defines the method for transformation. By default, the transformation method is set to ‘yeo-johnson’. The other available option is ‘quantile’ transformation. Both  the transformation transforms the feature set to follow a Gaussian-like or normal distribution. Note that the quantile transformer is non-linear and may distort linear  correlations between variables measured at the same scale.

handle_unknown_categorical: bool, default = True
When set to True, unknown categorical levels in new / unseen data are replaced by the most or least frequent level as learned in the training data. The method is  defined under the unknown_categorical_method param.

unknown_categorical_method: string, default = ‘least_frequent’
Method used to replace unknown categorical levels in unseen data. Method can be set to ‘least_frequent’ or ‘most_frequent’.

pca: bool, default = False
When set to True, dimensionality reduction is applied to project the data into  a lower dimensional space using the method defined in pca_method param. In  supervised learning pca is generally performed when dealing with high feature space and memory is a constraint. Note that not all datasets can be decomposed efficiently using a linear PCA technique and that applying PCA may result in loss of information. As such, it is advised to run multiple experiments with different pca_methods to evaluate the impact.

pca_method: string, default = ‘linear’
The ‘linear’ method performs Linear dimensionality reduction using Singular Value Decomposition. The other available options are:
kernel : dimensionality reduction through the use of RVF kernel.
incremental : replacement for ‘linear’ pca when the dataset to be decomposed is too large to fit in memory

pca_components: int/float, default = 0.99
Number of components to keep. if pca_components is a float, it is treated as a target percentage for information retention. When pca_components is an integer it is treated as the number of features to be kept. pca_components must be strictly less than the original number of features in the dataset.

ignore_low_variance: bool, default = False
When set to True, all categorical features with statistically insignificant variances are removed from the dataset. The variance is calculated using the ratio of unique  values to the number of samples, and the ratio of the most common value to the frequency of the second most common value.

combine_rare_levels: bool, default = False
When set to True, all levels in categorical features below the threshold defined in rare_level_threshold param are combined together as a single level. There must be at least two levels under the threshold for this to take effect. rare_level_threshold represents the percentile distribution of level frequency. Generally, this technique  is applied to limit a sparse matrix caused by high numbers of levels in categorical features.

rare_level_threshold: float, default = 0.1
Percentile distribution below which rare categories are combined. Only comes into effect when combine_rare_levels is set to True.

bin_numeric_features: list, default = None
When a list of numeric features is passed they are transformed into categorical features using K Means, where values in each bin have the same nearest center of a 1D k-means cluster. The number of clusters are determined based on the ‘sturges’  method. It is only optimal for gaussian data and underestimates the number of bins for large non-gaussian datasets.

remove_outliers: bool, default = False
When set to True, outliers from the training data are removed using PCA linear dimensionality reduction using the Singular Value Decomposition technique.

outliers_threshold: float, default = 0.05
The percentage / proportion of outliers in the dataset can be defined using the outliers_threshold param. By default, 0.05 is used which means 0.025 of the values on each side of the distribution’s tail are dropped from training data.

remove_multicollinearity: bool, default = False
When set to True, the variables with inter-correlations higher than the threshold defined under the multicollinearity_threshold param are dropped. When two features are highly correlated with each other, the feature that is less correlated with the target variable is dropped.

multicollinearity_threshold: float, default = 0.9
Threshold used for dropping the correlated features. Only comes into effect when remove_multicollinearity is set to True.

create_clusters: bool, default = False
When set to True, an additional feature is created where each instance is assigned to a cluster. The number of clusters is determined using a combination of Calinski-Harabasz and Silhouette criterion.

cluster_iter: int, default = 20
Number of iterations used to create a cluster. Each iteration represents cluster size. Only comes into effect when create_clusters param is set to True.

polynomial_features: bool, default = False
When set to True, new features are created based on all polynomial combinations that exist within the numeric features in a dataset to the degree defined in
polynomial_degree param.

polynomial_degree: int, default = 2
Degree of polynomial features. For example, if an input sample is two dimensional and of the form [a, b], the polynomial features with degree = 2 are: [1, a, b, a^2, ab, b^2].

trigonometry_features: bool, default = False
When set to True, new features are created based on all trigonometric combinations that exist within the numeric features in a dataset to the degree defined in the polynomial_degree param.

polynomial_threshold: float, default = 0.1
This is used to compress a sparse matrix of polynomial and trigonometric features. Polynomial and trigonometric features whose feature importance based on the combination of Random Forest, AdaBoost and Linear correlation falls within the percentile of the defined threshold are kept in the dataset. Remaining features are dropped before further processing.

group_features: list or list of list, default = None
When a dataset contains features that have related characteristics, the group_features param can be used for statistical feature extraction. For example, if a dataset has numeric features that are related with each other (i.e ‘Col1’, ‘Col2’, ‘Col3’), a list containing the column names can be passed under group_features to extract statistical information such as the mean, median, mode and standard deviation.

group_names: list, default = None
When group_features is passed, a name of the group can be passed into the group_names param as a list containing strings. The length of a group_names list must equal to the length of group_features. When the length doesn’t match or the name is not passed, new features are sequentially named such as group_1, group_2 etc.

feature_selection: bool, default = False
When set to True, a subset of features are selected using a combination of various permutation importance techniques including Random Forest, Adaboost and Linear correlation with target variable. The size of the subset is dependent on the feature_selection_param. Generally, this is used to constrain the feature space in order to improve efficiency in modeling. When polynomial_features and feature_interaction are used, it is highly recommended to define the feature_selection_threshold param with a lower value.

feature_selection_threshold: float, default = 0.8
Threshold used for feature selection (including newly created polynomial features). A higher value will result in a higher feature space. It is recommended to do multiple trials with different values of feature_selection_threshold specially in cases where polynomial_features and feature_interaction are used. Setting a very low value may be efficient but could result in under-fitting.

feature_interaction: bool, default = False
When set to True, it will create new features by interacting (a * b) for all numeric variables in the dataset including polynomial and trigonometric features (if created). This feature is not scalable and may not work as expected on datasets with large feature space.

feature_ratio: bool, default = False
When set to True, it will create new features by calculating the ratios (a / b) of all numeric variables in the dataset. This feature is not scalable and may not work as expected on datasets with large feature space.

interaction_threshold: bool, default = 0.01
Similar to polynomial_threshold, It is used to compress a sparse matrix of newly created features through interaction. Features whose importance based on the combination of Random Forest, AdaBoost and Linear correlation falls within the percentile of the defined threshold are kept in the dataset. Remaining features
are dropped before further processing.

transform_target: bool, default = False
When set to True, target variable is transformed using the method defined in transform_target_method param. Target transformation is applied separately from feature transformations.

transform_target_method: string, default = ‘box-cox’
‘Box-cox’ and ‘yeo-johnson’ methods are supported. Box-Cox requires input data to be strictly positive, while Yeo-Johnson supports both positive or negative data. When transform_target_method is ‘box-cox’ and target variable contains negative values, method is internally forced to ‘yeo-johnson’ to avoid exceptions.

session_id: int, default = None
If None, a random seed is generated and returned in the Information grid. The unique number is then distributed as a seed in all functions used during the experiment. This can be used for later reproducibility of the entire experiment.

silent: bool, default = False
When set to True, confirmation of data types is not required. All preprocessing will be performed assuming automatically inferred data types. Not recommended for direct use except for established pipelines.

profile: bool, default = False
If set to true, a data profile for Exploratory Data Analysis will be displayed in an interactive HTML report.

Returns:

Information Grid: Information grid is printed.

Environment: This function returns various outputs that are stored in variable as tuple. They are used by other functions in pycaret.

Compare Models


 

compare_models(blacklist = None, fold = 10,  round = 4,  sort = ‘R2’, turbo = True)

Description:

This function uses all models in the model library and scores them using K-fold Cross Validation. The output prints a score grid that shows MAE, MSE, RMSE, R2, RMSLE and MAPE by fold (default CV = 10 Folds) of all the available models in model library.

Code
compare_models()

 

Output

When turbo is set to True (‘kr’, ‘ard’ and ‘mlp’) are excluded due to longer training times. By default turbo param is set to True. Specific models can also be blacklisted using ‘blacklist’ parameter within compare_models().

Parameters:

blacklist: string, default = None
In order to omit certain models from the comparison, the abbreviation string (see above list) can be passed as list in blacklist param. This is normally done to be more efficient with time.

fold: integer, default = 10
Number of folds to be used in K-fold CV. Must be at least 2.

round: integer, default = 4
Number of decimal places the metrics in the score grid will be rounded to.

sort: string, default = ‘MAE’
The scoring measure specified is used for sorting the average score grid. Other options are ‘MAE’, ‘MSE’, ‘RMSE’, ‘R2’, ‘RMSLE’ and ‘MAPE’.

turbo: Boolean, default = True
When turbo is set to True, it blacklists estimators that have longer training times.

Returns:

Score Grid: A table containing the scores of the model across the k-folds. Scoring metrics used are MAE, MSE, RMSE, R2, RMSLE and MAPE. Mean and standard deviation of the scores across the folds is also returned.

Warnings:

  • compare_models() though attractive, might be time consuming with large datasets. By default turbo is set to True, which blacklists models that have longer training times. Changing turbo parameter to False may result in very high training times with datasets where number of samples exceed 10,000.
  • This function doesn’t return model object.

Create Model


 

create_model(estimator = None, ensemble = False, method = None, fold = 10, round = 4, verbose = True)

Description:

This function creates a model and scores it using K-fold Cross Validation. (default = 10 Fold). The output prints a score grid that shows MAE, MSE, RMSE, RMSLE, R2 and MAPE. This function returns a trained model object. setup() function must be called before using create_model()

Code
lr = create_model('lr')

 

Output

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
 

Parameters:

estimator : string, default = None
Enter abbreviated string of the estimator class. All estimators support binary or multiclass problem. List of estimators supported:

Estimator Abbreviated String
Linear Regression ‘lr’
Lasso Regression ‘lasso’
Ridge Regression ‘ridge’
Elastic Net ‘en’
Least Angle Regression ‘lar’
Lasso Least Angle Regression ‘llar’
Orthogonal Matching Pursuit ‘omp’
Bayesian Ridge ‘br’
Automatic Relevance Determination ‘ard’
Passive Aggressive Regressor ‘par’
Random Sample Consensus ‘ransac’
TheilSen Regressor ‘tr’
Huber Regressor ‘huber’
Kernel Ridge ‘kr’
Support Vector Machine ‘svm’
K Neighbors Regressor ‘knn’
Decision Tree ‘dt’
Random Forest ‘rf’
Extra Trees Regressor ‘et’
AdaBoost Regressor ‘ada’
Gradient Boosting Regressor ‘gbr’
Multi Level Perceptron ‘mlp’
Extreme Gradient Boosting ‘xgboost’
Light Gradient Boosting ‘lightgbm’
CatBoost Regressor ‘catboost’

ensemble: Boolean, default = False
True would result in an ensemble of estimator using the method parameter defined.

method: String, ‘Bagging’ or ‘Boosting’, default = None.
method must be defined when ensemble is set to True. Default method is set to None.

fold: integer, default = 10
Number of folds to be used in K-fold CV. Must be at least 2.

round: integer, default = 4
Number of decimal places the metrics in the score grid will be rounded to.

verbose: Boolean, default = True
Score grid is not printed when verbose is set to False.

Returns:

Score Grid: A table containing the scores of the model across the k-folds. Scoring metrics used are MAE, MSE, RMSE, RMSLE, R2 and MAPE. Mean and standard deviation of the scores across the folds are also returned.

Model: Trained model object

 

Tune Model


 

tune_model(estimator = None,  fold = 10,  round = 4,  n_iter = 10,  optimize = ‘r2’, ensemble = False,  method = None, verbose = True)

Description:

This function tunes the hyperparameters of a model and scores it using K-fold Cross Validation. The output prints the score grid that shows MAE, MSE, RMSE, R2, RMSLE and MAPE by fold (by default = 10 Folds). This function returns a trained model object.  

tune_model() only accepts a string parameter for estimator. 

Code
tuned_xgboost = tune_model('xgboost')

 

Output

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=0.5, gamma=0,
             importance_type='gain', learning_rate=0.1, max_delta_step=0,
             max_depth=20, min_child_weight=4, missing=None, n_estimators=300,
             n_jobs=-1, nthread=None, objective='reg:linear', random_state=786,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
             silent=None, subsample=1, verbosity=0)

Parameters:

estimator : string, default = None
Enter abbreviated name of the estimator class. List of estimators supported:

Estimator Abbreviated String
Linear Regression ‘lr’
Lasso Regression ‘lasso’
Ridge Regression ‘ridge’
Elastic Net ‘en’
Least Angle Regression ‘lar’
Lasso Least Angle Regression ‘llar’
Orthogonal Matching Pursuit ‘omp’
Bayesian Ridge ‘br’
Automatic Relevance Determination ‘ard’
Passive Aggressive Regressor ‘par’
Random Sample Consensus ‘ransac’
TheilSen Regressor ‘tr’
Huber Regressor ‘huber’
Kernel Ridge ‘kr’
Support Vector Machine ‘svm’
K Neighbors Regressor ‘knn’
Decision Tree ‘dt’
Random Forest ‘rf’
Extra Trees Regressor ‘et’
AdaBoost Regressor ‘ada’
Gradient Boosting Regressor ‘gbr’
Multi Level Perceptron ‘mlp’
Extreme Gradient Boosting ‘xgboost’
Light Gradient Boosting ‘lightgbm’
CatBoost Regressor ‘catboost’


fold: integer, default = 10
Number of folds to be used in K-fold CV. Must be at least 2.

round: integer, default = 4
Number of decimal places the metrics in the score grid will be rounded to.

n_iter: integer, default = 10
Number of iterations within the Random Grid Search. For every iteration, the model randomly selects one value from the pre-defined grid of hyperparameters.

optimize: string, default = ‘r2’
Measure used to select the best model through hyperparameter tuning. The default scoring measure is ‘r2’. Other measures include ‘mae’, ‘mse’.

ensemble: Boolean, default = None
True enables ensembling of the model through method defined in ‘method’ param.

method: String, ‘Bagging’ or ‘Boosting’, default = None
method comes into effect only when ensemble = True. Default is set to None.

verbose: Boolean, default = True
Score grid is not printed when verbose is set to False.

Returns:

Score Grid: A table containing the scores of the model across the k-folds. Scoring metrics used are MAE, MSE, RMSE, R2, RMSLE and MAPE. Mean and standard deviation of the scores across the folds are  also returned.

Model: Trained and tuned model object

Warnings:

  • Estimator parameter takes an abbreviated string. Passing a trained model object returns an error. The tune_model() function internally calls create_model()  before tuning the hyperparameters.

Ensemble Model


 

ensemble_model(estimator, method = ‘Bagging’,  fold = 10, n_estimators = 10, round = 4,  verbose = True)

Description:

This function ensembles the trained base estimator using the method defined in ‘method’ param (default = ‘Bagging’). The output prints a score grid that shows MAE, MSE, RMSE, R2, RMSLE and MAPE by fold (default CV = 10 Folds). Model must be created using create_model() or tune_model(). This function returns a trained model object.

Code
# create a decision tree model
dt = create_model('dt')

# ensemble trained decision tree model
ensembled_dt = ensemble_model(dt)

 

Output

BaggingRegressor(base_estimator=DecisionTreeRegressor(ccp_alpha=0.0,
                                                      criterion='mse',
                                                      max_depth=None,
                                                      max_features=None,
                                                      max_leaf_nodes=None,
                                                      min_impurity_decrease=0.0,
                                                      min_impurity_split=None,
                                                      min_samples_leaf=1,
                                                      min_samples_split=2,
                                                      min_weight_fraction_leaf=0.0,
                                                      presort='deprecated',
                                                      random_state=786,
                                                      splitter='best'),
                 bootstrap=True, bootstrap_features=False, max_features=1.0,
                 max_samples=1.0, n_estimators=10, n_jobs=None, oob_score=False,
                 random_state=786, verbose=0, warm_start=False)

This will return an ensembled Decision Tree model using ‘Bagging’.

Parameters:

estimator : object, default = None

method: String, default = ‘Bagging’
Bagging method will create an ensemble meta-estimator that fits base classifiers each on random subsets of the original dataset. The other available method is ‘Boosting’ which will create a meta-estimators by fitting a classifier on the original dataset and then fits additional copies of the classifier on the same dataset but where the weights of incorrectly classified instances are adjusted such that subsequent classifiers focus more on difficult cases.

fold: integer, default = 10
Number of folds to be used in K-fold CV. Must be at least 2.

n_estimators: integer, default = 10
The number of base estimators in the ensemble. In case of perfect fit, the learning procedure is stopped early.

round: integer, default = 4
Number of decimal places the metrics in the score grid will be rounded to.

verbose: Boolean, default = True
Score grid is not printed when verbose is set to False.

Returns:

Score Grid: A table containing the scores of the model across the k-folds. Scoring metrics used are Accuracy, AUC, Recall, Precision, F1  and Kappa. Mean and standard deviation of the scores across the folds are also returned.

Model: Trained ensembled model object

Warnings:

  • If target variable is multiclass (more than 2 classes), AUC will be returned as zero (0.0).

Blend Models


 

blend_models(estimator_list = ‘All’,  fold = 10,  round = 4,  turbo = True, verbose = True)

Description:

This function creates an ensemble meta-estimator that fits a base regressor on the whole dataset. It then averages the predictions to form a final prediction. By default, this function will use all estimators in the model library (excl. the few estimators when turbo is True) or a specific trained estimator passed as a list in estimator_list param. It scores it using K-fold Cross Validation. The output prints the score grid that shows MAE, MSE, RMSE, R2, RMSLE and MAPE by fold (default = 10 Fold). This function returns a trained model object.

Code
#blend all models
blend_all = blend_models()

#create models for blending
lr = create_model('lr')
rf = create_model('rf')
knn = create_model('knn')

#blend trained models
blend_specific = blend_models(estimator_list = [lr,rf,knn])

 

Output

VotingRegressor(estimators=[('Linear Regression_0',
                             LinearRegression(copy_X=True, fit_intercept=True,
                                              n_jobs=None, normalize=False)),
                            ('Lasso_1',
                             Lasso(alpha=1.0, copy_X=True, fit_intercept=True,
                                   max_iter=1000, normalize=False,
                                   positive=False, precompute=False,
                                   random_state=786, selection='cyclic',
                                   tol=0.0001, warm_start=False)),
                            ('Ridge_2',
                             Ridge(alpha=1.0, copy_X=Tr...
                                           learning_rate=0.1, max_depth=-1,
                                           min_child_samples=20,
                                           min_child_weight=0.001,
                                           min_split_gain=0.0, n_estimators=100,
                                           n_jobs=-1, num_leaves=31,
                                           objective=None, random_state=786,
                                           reg_alpha=0.0, reg_lambda=0.0,
                                           silent=True, subsample=1.0,
                                           subsample_for_bin=200000,
                                           subsample_freq=0)),
                            ('CatBoost Regressor_21',
                             <catboost.core.CatBoostRegressor object at 0x0000026E392C09C8>)],
                n_jobs=None, weights=None)

Parameters:

estimator_list : string (‘All’) or list of object, default = ‘All’

fold: integer, default = 10
Number of folds to be used in K-fold CV. Must be at least 2.

round: integer, default = 4
Number of decimal places the metrics in the score grid will be rounded to.

turbo: Boolean, default = True
When turbo is set to True, it blacklists estimator that uses Radial Kernel.

verbose: Boolean, default = True
Score grid is not printed when verbose is set to False.

Returns:

Score Grid: A table containing the scores of the model across the k-folds. Scoring metrics used are MAE, MSE, RMSE, R2, RMSLE and MAPE. Mean and standard deviation of the scores across the folds are also returned.

Model: Trained Voting Regressor model object.

 

Stack Models


 

stack_models(estimator_list, meta_model = None, fold = 10, round = 4, restack = True, plot = False, finalize = False, verbose = True)

Description:

This function creates a meta model and scores it using K-fold Cross Validation. The predictions from the base level models as passed in the estimator_list param  are used as input features for the meta model. The restacking parameter controls the ability to expose raw features to the meta model when set to True (default = False). The output prints a score grid that shows MAE, MSE, RMSE, R2, RMSLE and MAPE by fold (default = 10 Folds).

This function returns a container which is the list of all models in stacking.

Code
# create models for stacking
dt = create_model('dt')
rf = create_model('rf')
ada = create_model('ada')
ridge = create_model('ridge')
knn = create_model('knn')

# stack trained models
stacked_models = stack_models(estimator_list=[dt,rf,ada,ridge,knn])

 

Output

[DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=786, splitter='best'),
 RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       max_samples=None, min_impurity_decrease=0.0,
                       min_impurity_split=None, min_samples_leaf=1,
                       min_samples_split=2, min_weight_fraction_leaf=0.0,
                       n_estimators=100, n_jobs=None, oob_score=False,
                       random_state=786, verbose=0, warm_start=False),
 AdaBoostRegressor(base_estimator=None, learning_rate=1.0, loss='linear',
                   n_estimators=50, random_state=786),
 Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
       normalize=False, random_state=786, solver='auto', tol=0.001),
 KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform'),
 LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False),
 True]

This will create a meta model that will use the predictions of all the models provided in estimator_list param. By default, the meta model is Linear Regression but can be changed with meta_model param.

Parameters:

estimator_list : list of object

meta_model : object, default = None
if set to None, Linear Regression is used as a meta model.

fold: integer, default = 10
Number of folds to be used in K-fold CV. Must be at least 2.

round: integer, default = 4
Number of decimal places the metrics in the score grid will be rounded to.

restack: Boolean, default = True
When restack is set to True, raw data will be exposed to meta model when making predictions, otherwise when False, only the predicted label is passed to meta model when making final predictions.

plot: Boolean, default = False
When plot is set to True, it will return the correlation plot of prediction
from all base models provided in estimator_list.

finalize: Boolean, default = False
When finalize is set to True, it will fit the stacker on entire dataset including the hold-out sample created during the setup() stage. It is not recommended to set this to True here, If you would like to fit the stacker on the entire dataset including the hold-out, use finalize_model().

verbose: Boolean, default = True
Score grid is not printed when verbose is set to False.

Returns:

Score Grid: A table containing the scores of the model across the k-folds. Scoring metrics used are MAE, MSE, RMSE, R2, RMSLE and MAPE. Mean and standard deviation of the scores across the folds are also returned.

Container: list of all the models where last element is meta model.

 

Create Stacknet


 

create_stacknet(estimator_list, meta_model = None, fold = 10, round = 4, restack = True, finalize = False, verbose = True)

Description:

This function creates a sequential stack net using cross validated predictions at each layer. The final score grid contains predictions from the meta model using K-fold Cross Validation. Base level models can be passed as estimator_list param, the layers can be organized as a sub list within the estimator_list object. Restacking param controls the ability to expose raw features to meta model.

This function returns a container which is the list of all models in stacking.

Code
# create models
dt = create_model('dt')
rf = create_model('rf') 
ada = create_model('ada') 
ridge = create_model('ridge') 
knn = create_model('knn') 

# create stacknet
stacknet = create_stacknet(estimator_list =[[dt,rf],[ada,ridge,knn]])

 

Output

[[DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=None,
                        max_features=None, max_leaf_nodes=None,
                        min_impurity_decrease=0.0, min_impurity_split=None,
                        min_samples_leaf=1, min_samples_split=2,
                        min_weight_fraction_leaf=0.0, presort='deprecated',
                        random_state=786, splitter='best'),
  RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                        max_depth=None, max_features='auto', max_leaf_nodes=None,
                        max_samples=None, min_impurity_decrease=0.0,
                        min_impurity_split=None, min_samples_leaf=1,
                        min_samples_split=2, min_weight_fraction_leaf=0.0,
                        n_estimators=100, n_jobs=None, oob_score=False,
                        random_state=786, verbose=0, warm_start=False)],
 [AdaBoostRegressor(base_estimator=None, learning_rate=1.0, loss='linear',
                    n_estimators=50, random_state=786),
  Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
        normalize=False, random_state=786, solver='auto', tol=0.001),
  KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
                      metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                      weights='uniform')],
 LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False),
 True]

This will result in the stacking of models in multiple layers. The first layer contains dt and rf, the predictions of which are used by models in the second layer to generate predictions which are then used by the meta model to generate final predictions. By default, the meta model is Linear Regression but can be changed with meta_model param.

Parameters:

estimator_list : nested list of objects

meta_model : object, default = None
if set to None, Linear Regression is used as a meta model.

fold: integer, default = 10
Number of folds to be used in K-fold CV. Must be at least 2.

round: integer, default = 4
Number of decimal places the metrics in the score grid will be rounded to.

restack: Boolean, default = True
When restack is set to True, raw data and prediction of all layers will be exposed to the meta model when making predictions. When set to False, only the predicted label of last layer is passed to meta model when making final predictions.

finalize: Boolean, default = False
When finalize is set to True, it will fit the stacker on entire dataset including the hold-out sample created during the setup() stage. It is not  recommended to set this to True here, if you would like to fit the stacker on the entire dataset including the hold-out, use finalize_model().

verbose: Boolean, default = True
Score grid is not printed when verbose is set to False.

Returns:

Score Grid: A table containing the scores of the model across the k-folds. Scoring metrics used are MAE, MSE, RMSE, R2, RMSLE and MAPE. Mean and standard deviation of the scores across the folds are also returned.

Container: list of all the models where last element is meta model.

 

Plot Model


 

plot_model(estimator = None, plot = ‘residuals’)

Description:

This function takes a trained model object and returns a plot based on the test / hold-out set. The process may require the model to be re-trained in certain cases. See list of plots supported below. Model must be created using create_model() or tune_model().

Code
# create a model
lr = create_model('lr')

# plot a model 
plot_model(lr)

 

Output

This will return an residuals plot of a trained Linear Regression model.

Parameters:

estimator : object, default = None
A trained model object should be passed as an estimator.

plot : string, default = auc
Enter abbreviation of type of plot. The current list of plots supported are:

Estimator Abbreviated String
Residuals Plot ‘residuals’
Prediction Error Plot ‘error’
Cooks Distance Plot ‘cooks’
Recursive Feature Selection ‘rfe’
Learning Curve ‘learning’
Validation Curve ‘vc’
Manifold Learning ‘manifold’
Feature Importance ‘feature’
Model Hyperparameter ‘parameter’

 

Returns:

Visual Plot: Prints the visual plot.

 

Evaluate Model


 

evaluate_model(estimator)

Description:

This function displays a user interface for all of the available plots for a given estimator. It internally uses the plot_model() function.

Code
# create a model
lr = create_model('lr')

# evaluate a model 
evaluate_model(lr)

 

Output

Parameters:

estimator : object, default = none
A trained model object should be passed as an estimator.

Returns:

User Interface : Displays the user interface for plotting.

 

Interpret Model


 

interpret_model(estimator, plot = ‘summary’, feature = None, observation = None)

Description:

This function takes a trained model object and returns an interpretation plot based on the test / hold-out set. It only supports tree based algorithms. This function is implemented based on the SHAP (SHapley Additive exPlanations), which is a unified approach to explain the output of any machine learning model. SHAP connects game theory with local explanations.

For more information : https://shap.readthedocs.io/en/latest/

Code
# create a model
dt = create_model('dt')

# interpret a model 
interpret_model(dt)

 

Output

Parameters:

estimator : object, default = none
A trained tree based model object should be passed as an estimator.

plot : string, default = ‘summary’
other available options are ‘correlation’ and ‘reason’.

feature: string, default = None
This parameter is only needed when plot = ‘correlation’. By default feature is
set to None which means the first column of the dataset will be used as a variable.
A feature parameter must be passed to change this.

observation: integer, default = None
This parameter only comes into effect when plot is set to ‘reason’. If no observation
number is provided, it will return an analysis of all observations with the option
to select the feature on x and y axes through drop down interactivity. For analysis at
the sample level, an observation parameter must be passed with the index value of the
observation in test / hold-out set.

Returns:

Visual Plot: Returns the visual plot. Returns the interactive JS plot when plot = ‘reason’.

 

Predict Model


 

predict_model(estimator, data=None, platform=None, authentication=None, round=4)

Description:

This function is used to predict new data using a trained estimator. It accepts an estimator created using one of the function in pycaret that returns a trained  model object or a list of trained model objects created using stack_models() or create_stacknet(). New unseen data can be passed to data param as pandas Dataframe.  If data is not passed, the test / hold-out set separated at the time of setup() is used to generate predictions.

Code
# create a model
lr = create_model('lr')

# generate predictions on holdout
lr_predictions_holdout = predict_model(lr)

 

Output

Parameters:

estimator : object or list of objects / string, default = None
When estimator is passed as string, load_model() is called internally to load the
pickle file from active directory or cloud platform when platform param is passed.

data : {array-like, sparse matrix}, shape (n_samples, n_features)
where n_samples is the number of samples and n_features is the number of features. All features used during training must be present in the new dataset.

platform: string, default = None
Name of platform, if loading model from cloud. Current available options are: ‘aws’.

authentication : dict
dictionary of applicable authentication tokens.

When platform = ‘aws’:
{‘bucket’ : ‘Name of Bucket on S3’}

round: integer, default = 4
Number of decimal places the predicted labels will be rounded to.

Returns:

Information Grid : Information grid is printed when data is None.

Warnings:

  • If the estimator passed is created using finalize_model() then the metrics printed in the information grid maybe misleading as the model is trained on the complete dataset including the test / hold-out set. Once finalize_model() is used, the model is considered ready for deployment and should be used on new unseen datasets only.

Finalize Model


 

finalize_model(estimator)

Description:

This function fits the estimator onto the complete dataset passed during the setup() stage. The purpose of this function is to prepare for final model deployment after experimentation.

Code
# create a model
lr = create_model('lr')

# finalize model
lr_final = finalize_model(lr)

 

Parameters:

estimator : object, default = none
A trained model object should be passed as an estimator.

Returns:

Model : Trained model object fitted on complete dataset.

Warnings:

  • If the model returned by finalize_model(), is used on predict_model() without passing a new unseen dataset, then the information grid printed is misleading as the model is trained on the complete dataset including test / hold-out sample. Once finalize_model() is used, the model is considered ready for deployment and should be used on new unseens dataset only.

Deploy Model


 

deploy_model(model, model_name, authentication, platform = ‘aws’)

Description:
(In Preview)

This function deploys the transformation pipeline and trained model object for production use. The platform of deployment can be defined under the platform param along with the applicable authentication tokens which are passed as a dictionary to the authentication param.

Code
# create a model
lr = create_model('lr')

# deploy model
deploy_model(model = lr, model_name = 'deploy_lr', platform = 'aws', authentication = {'bucket' : 'pycaret-test'})

 

Output

This will deploy the model on an AWS S3 account under bucket ‘pycaret-test’

For AWS users:

Before deploying a model to an AWS S3 (‘aws’), environment variables must be configured using the command line interface. To configure AWS env. variables, type aws configure in your python command line. The following information is required which can be generated using the Identity and Access Management (IAM)  portal of your amazon console account:

  • AWS Access Key ID
  • AWS Secret Key Access
  • Default Region Name (can be seen under Global settings on your AWS console)
  • Default output format (must be left blank)

Parameters:

model : object
A trained model object should be passed as an estimator.

model_name : string
Name of model to be passed as a string.

authentication : dict
dictionary of applicable authentication tokens.

When platform = ‘aws’:
{‘bucket’ : ‘Name of Bucket on S3’}

platform: string, default = ‘aws’
Name of platform for deployment. Current available options are: ‘aws’.

Returns:

Message : Success Message

Warnings:

  • This function uses file storage services to deploy the model on cloud platform. As such, this is efficient for batch-use. Where the production objective is to  obtain prediction at an instance level, this may not be the efficient choice as it transmits the binary pickle file between your local python environment and the platform.

Save Model


 

save_model(model, model_name, verbose=True)

Description:

This function saves the transformation pipeline and trained model object into the current active directory as a pickle file for later use.

Code
# create a model
lr = create_model('lr')

# save a model
save_model(lr, 'lr_model_23122019')

 

Output

Parameters:

model : object, default = none
A trained model object should be passed as an estimator.

model_name : string, default = none
Name of pickle file to be passed as a string.

verbose: Boolean, default = True
Success message is not printed when verbose is set to False.

Returns:

Message : Success Message

 

Load Model


 

load_model(model_name, platform = None, authentication = None, verbose=True)

Description:

This function loads a previously saved transformation pipeline and model from the current active directory into the current python environment. Load object must be a pickle file.

Code
saved_lr = load_model('lr_model_23122019') 

 

Output

Parameters:

model_name : string, default = none
Name of pickle file to be passed as a string.

platform: string, default = None
Name of platform, if loading model from cloud. Current available options are: ‘aws’.

authentication : dict
dictionary of applicable authentication tokens.

When platform = ‘aws’:
{‘bucket’ : ‘Name of Bucket on S3’}

verbose: Boolean, default = True
Success message is not printed when verbose is set to False.

Returns:

Message : Success Message

 

Save Experiment


 

save_experiment(experiment_name=None)

Description:

This function saves the entire experiment into the current active directory. All outputs using pycaret are internally saved into a binary list which is pickilized when save_experiment() is used.

Code
save_experiment('experiment_23122019')

 

Output

Parameters:

experiment_name : string, default = none
Name of pickle file to be passed as a string.

Returns:

Message : Success Message

 

Load Environment


 

load_experiment(experiment_name)

Description:

This function loads a previously saved experiment from the current active directory into current python environment. Load object must be a pickle file.

Code
saved_experiment = load_experiment('experiment_23122019')

 

Output

Output has been compressed.

This will load the entire experiment pipeline into the object saved_experiment. The experiment file must be in current directory.

Parameters:

experiment_name : string, default = none
Name of pickle file to be passed as a string.

Returns:

Information Grid : Information Grid containing details of saved objects in experiment pipeline.

 

Try this next


 

Was this page helpful?

Send feedback