Setting up Environment


 

setup(data, target, train_size = 0.7, sampling = True, sample_estimator = None, categorical_features = None, categorical_imputation = ‘constant’, ordinal_features = None, high_cardinality_features = None, high_cardinality_method = ‘frequency’, numeric_features = None, numeric_imputation = ‘mean’, date_features = None, ignore_features = None, normalize = False, normalize_method = ‘zscore’, transformation = False, transformation_method = ‘yeo-johnson’, handle_unknown_categorical = True, unknown_categorical_method = ‘least_frequent’, pca = False, pca_method = ‘linear’, pca_components = None, ignore_low_variance = False, combine_rare_levels = False, rare_level_threshold = 0.10, bin_numeric_features = None, remove_outliers = False, outliers_threshold = 0.05, remove_multicollinearity = False, multicollinearity_threshold = 0.9, remove_perfect_collinearity = False, create_clusters = False, cluster_iter = 20, polynomial_features = False, polynomial_degree = 2, trigonometry_features = False, polynomial_threshold = 0.1, group_features = None, group_names = None, feature_selection = False, feature_selection_threshold = 0.8, feature_interaction = False, feature_ratio = False, interaction_threshold = 0.01, fix_imbalance = False, fix_imbalance_method = None, data_split_shuffle = True, folds_shuffle = False, n_jobs = -1, html = Truesession_id = None, log_experiment = False, experiment_name = None, log_plots = False, log_profile = False, log_data = False, silent=False, verbose=True, profile = False)

Description:

This function initializes the environment in pycaret and creates the transformation pipeline to prepare the data for modeling and deployment. setup() must called before executing any other function in pycaret. It takes two mandatory parameters: dataframe {array-like, sparse matrix} and name of the target column. All other parameters are optional.

Code
#import the dataset from pycaret repository
from pycaret.datasets import get_data
juice = get_data('juice')

#import classification module
from pycaret.classification import *

#intialize the setup
exp_clf = setup(juice, target = 'Purchase')

 

Output

Output has been compressed.

juice‘ is a pandas DataFrame and ‘Purchase‘ is the name of target column.
Parameters:

data: dataframe
array-like, sparse matrix, shape (n_samples, n_features) where n_samples is the number of samples and n_features is the number of features.

target: string
Name of the target column to be passed in as a string. The target variable could be binary or multiclass. In case of a multiclass target, all estimators are wrapped
with a OneVsRest classifier.

train_size: float, default = 0.7
Size of the training set. By default, 70% of the data will be used for training and validation. The remaining data will be used for a test / hold-out set.

sampling: bool, default = True
When the sample size exceeds 25,000 samples, pycaret will build a base estimator at various sample sizes from the original dataset. This will return a performance plot of AUC, Accuracy, Recall, Precision, Kappa and F1 values at various sample levels, that will assist in deciding the preferred sample size for modeling. The desired sample size must then be entered for training and validation in the pycaret environment. When sample_size entered is less than 1, the remaining dataset (1 – sample) is used for fitting the model only when finalize_model() is called.

sample_estimator: object, default = None
If None, Logistic Regression is used by default.

categorical_features: string, default = None
If the inferred data types are not correct, categorical_features can be used to overwrite the inferred type. If when running setup the type of ‘column1’ is inferred as numeric instead of categorical, then this parameter can be used to overwrite the type by passing categorical_features = [‘column1’].

categorical_imputation: string, default = ‘constant’
If missing values are found in categorical features, they will be imputed with a constant ‘not_available’ value. The other available option is ‘mode’ which imputes the missing value using most frequent value in the training dataset.

ordinal_features: dictionary, default = None
When the data contains ordinal features, they must be encoded differently using the ordinal_features param. If the data has a categorical variable with values
of ‘low’, ‘medium’, ‘high’ and it is known that low < medium < high, then it can be passed as ordinal_features = { ‘column_name’ : [‘low’, ‘medium’, ‘high’] }. The list sequence must be in increasing order from lowest to highest.

high_cardinality_features: string, default = None
When the data contains features with high cardinality, they can be compressed into fewer levels by passing them as a list of column names with high cardinality. Features are compressed using method defined in high_cardinality_method param.

high_cardinality_method: string, default = ‘frequency’
When method set to ‘frequency’ it will replace the original value of feature with the frequency distribution and convert the feature into numeric. Other available method is ‘clustering’ which performs the clustering on statistical attribute of data and replaces the original value of feature with cluster label. The number of clusters is determined using a combination of Calinski-Harabasz and Silhouette criterion.

numeric_features: string, default = None
If the inferred data types are not correct, numeric_features can be used to overwrite the inferred type. If when running setup the type of ‘column1’ is inferred as a categorical instead of numeric, then this parameter can be used to overwrite by passing numeric_features = [‘column1’].

numeric_imputation: string, default = ‘mean’
If missing values are found in numeric features, they will be imputed with the mean value of the feature. The other available option is ‘median’ which imputes the value using the median value in the training dataset.

date_features: string, default = None
If the data has a DateTime column that is not automatically detected when running setup, this parameter can be used by passing date_features = ‘date_column_name’. It can work with multiple date columns. Date columns are not used in modeling. Instead, feature extraction is performed and date columns are dropped from the dataset. If the date column includes a time stamp, features related to time will also be extracted.

ignore_features: string, default = None
If any feature should be ignored for modeling, it can be passed to the param ignore_features. The ID and DateTime columns when inferred, are automatically
set to ignore for modeling.

normalize: bool, default = False
When set to True, the feature space is transformed using the normalized_method param. Generally, linear algorithms perform better with normalized data however, the results may vary and it is advised to run multiple experiments to evaluate the benefit of normalization.

normalize_method: string, default = ‘zscore’
Defines the method to be used for normalization. By default, normalize method is set to ‘zscore’. The standard zscore is calculated as z = (x – u) / s. The other available options are:
‘minmax’ : scales and translates each feature individually such that it is in the range of 0 – 1.
‘maxabs’ : scales and translates each feature individually such that the maximal absolute value of each feature will be 1.0. It does not shift/center the data, and thus does not destroy any sparsity.
‘robust’ : scales and translates each feature according to the Interquartile range. When the dataset contains outliers, robust scaler often gives better results.

transformation: bool, default = False
When set to True, a power transformation is applied to make the data more normal / Gaussian-like. This is useful for modeling issues related to heteroscedasticity or other situations where normality is desired. The optimal parameter for stabilizing variance and minimizing skewness is estimated through maximum likelihood.

transformation_method: string, default = ‘yeo-johnson’
Defines the method for transformation. By default, the transformation method is set to ‘yeo-johnson’. The other available option is ‘quantile’ transformation. Both the transformation transforms the feature set to follow a Gaussian-like or normal distribution. Note that the quantile transformer is non-linear and may distort linear correlations between variables measured at the same scale.

handle_unknown_categorical: bool, default = True
When set to True, unknown categorical levels in new / unseen data are replaced by the most or least frequent level as learned in the training data. The method is defined under the unknown_categorical_method param.

unknown_categorical_method: string, default = ‘least_frequent’
Method used to replace unknown categorical levels in unseen data. Method can be set to ‘least_frequent’ or ‘most_frequent’.

pca: bool, default = False
When set to True, dimensionality reduction is applied to project the data into a lower dimensional space using the method defined in pca_method param. In supervised learning pca is generally performed when dealing with high feature space and memory is a constraint. Note that not all datasets can be decomposed efficiently using a linear PCA technique and that applying PCA may result in loss of information. As such, it is advised to run multiple experiments with different pca_methods to evaluate the impact.

pca_method: string, default = ‘linear’
The ‘linear’ method performs Linear dimensionality reduction using Singular Value Decomposition. The other available options are:
kernel : dimensionality reduction through the use of RVF kernel.
incremental : replacement for ‘linear’ pca when the dataset to be decomposed is too large to fit in memory

pca_components: int/float, default = 0.99
Number of components to keep. if pca_components is a float, it is treated as a target percentage for information retention. When pca_components is an integer it is treated as the number of features to be kept. pca_components must be strictly less than the original number of features in the dataset.

ignore_low_variance: bool, default = False
When set to True, all categorical features with statistically insignificant variances are removed from the dataset. The variance is calculated using the ratio of unique values to the number of samples, and the ratio of the most common value to the frequency of the second most common value.

combine_rare_levels: bool, default = False
When set to True, all levels in categorical features below the threshold defined in rare_level_threshold param are combined together as a single level. There must be at least two levels under the threshold for this to take effect. rare_level_threshold represents the percentile distribution of level frequency. Generally, this technique is applied to limit a sparse matrix caused by high numbers of levels in categorical features.

rare_level_threshold: float, default = 0.1
Percentile distribution below which rare categories are combined. Only comes into effect when combine_rare_levels is set to True.

bin_numeric_features: list, default = None
When a list of numeric features is passed they are transformed into categorical features using K Means, where values in each bin have the same nearest center of a 1D k-means cluster. The number of clusters are determined based on the ‘sturges’ method. It is only optimal for gaussian data and underestimates the number of bins for large non-gaussian datasets.

remove_outliers: bool, default = False
When set to True, outliers from the training data are removed using PCA linear dimensionality reduction using the Singular Value Decomposition technique.

outliers_threshold: float, default = 0.05
The percentage / proportion of outliers in the dataset can be defined using the outliers_threshold param. By default, 0.05 is used which means 0.025 of the values on each side of the distribution’s tail are dropped from training data.

remove_multicollinearity: bool, default = False
When set to True, the variables with inter-correlations higher than the threshold defined under the multicollinearity_threshold param are dropped. When two features are highly correlated with each other, the feature that is less correlated with the target variable is dropped.

multicollinearity_threshold: float, default = 0.9
Threshold used for dropping the correlated features. Only comes into effect when remove_multicollinearity is set to True.

remove_perfect_collinearity: bool, default = False
When set to True, perfect collinearity (features with correlation = 1) is removed from the dataset, When two features are 100% correlated, one of it is randomly dropped from the dataset.

create_clusters: bool, default = False
When set to True, an additional feature is created where each instance is assigned to a cluster. The number of clusters is determined using a combination of Calinski-Harabasz and Silhouette criterion.

cluster_iter: int, default = 20
Number of iterations used to create a cluster. Each iteration represents cluster size. Only comes into effect when create_clusters param is set to True.

polynomial_features: bool, default = False
When set to True, new features are created based on all polynomial combinations that exist within the numeric features in a dataset to the degree defined in
polynomial_degree param.

polynomial_degree: int, default = 2
Degree of polynomial features. For example, if an input sample is two dimensional and of the form [a, b], the polynomial features with degree = 2 are: [1, a, b, a^2, ab, b^2].

trigonometry_features: bool, default = False
When set to True, new features are created based on all trigonometric combinations that exist within the numeric features in a dataset to the degree defined in the polynomial_degree param.

polynomial_threshold: float, default = 0.1
This is used to compress a sparse matrix of polynomial and trigonometric features. Polynomial and trigonometric features whose feature importance based on the combination of Random Forest, AdaBoost and Linear correlation falls within the percentile of the defined threshold are kept in the dataset. Remaining features are dropped before further processing.

group_features: list or list of list, default = None
When a dataset contains features that have related characteristics, the group_features param can be used for statistical feature extraction. For example, if a dataset has numeric features that are related with each other (i.e ‘Col1’, ‘Col2’, ‘Col3’), a list containing the column names can be passed under group_features to extract statistical information such as the mean, median, mode and standard deviation.

group_names: list, default = None
When group_features is passed, a name of the group can be passed into the group_names param as a list containing strings. The length of a group_names list must equal to the length of group_features. When the length doesn’t match or the name is not passed, new features are sequentially named such as group_1, group_2 etc.

feature_selection: bool, default = False
When set to True, a subset of features are selected using a combination of various permutation importance techniques including Random Forest, Adaboost and Linear correlation with target variable. The size of the subset is dependent on the feature_selection_param. Generally, this is used to constrain the feature space in order to improve efficiency in modeling. When polynomial_features and feature_interaction are used, it is highly recommended to define the feature_selection_threshold param with a lower value.

feature_selection_threshold: float, default = 0.8
Threshold used for feature selection (including newly created polynomial features). A higher value will result in a higher feature space. It is recommended to do multiple trials with different values of feature_selection_threshold specially in cases where polynomial_features and feature_interaction are used. Setting a very low value may be efficient but could result in under-fitting.

feature_interaction: bool, default = False
When set to True, it will create new features by interacting (a * b) for all numeric variables in the dataset including polynomial and trigonometric features (if created). This feature is not scalable and may not work as expected on datasets with large feature space.

feature_ratio: bool, default = False
When set to True, it will create new features by calculating the ratios (a / b) of all numeric variables in the dataset. This feature is not scalable and may not work as expected on datasets with large feature space.

interaction_threshold: bool, default = 0.01
Similar to polynomial_threshold, It is used to compress a sparse matrix of newly created features through interaction. Features whose importance based on the combination of Random Forest, AdaBoost and Linear correlation falls within the percentile of the defined threshold are kept in the dataset. Remaining features
are dropped before further processing.

fix_imbalance: bool, default = False
When dataset has unequal distribution of target class it can be fixed using fix_imbalance parameter. When set to True, SMOTE (Synthetic Minority Over-sampling Technique) is applied by default to create synthetic datapoints for minority class.

fix_imbalance_method: obj, default = None
When fix_imbalance is set to True and fix_imbalance_method is None, ‘smote’ is applied by default to oversample minority class during cross validation. This parameter accepts any module from ‘imblearn’ that supports ‘fit_resample’ method.

data_split_shuffle: bool, default = True
If set to False, prevents shuffling of rows when splitting data.

folds_shuffle: bool, default = False
If set to False, prevents shuffling of rows when using cross validation.

n_jobs: int, default = -1
The number of jobs to run in parallel (for functions that supports parallel processing) -1 means using all processors. To run all functions on single processor set n_jobs to None.

html: bool, default = True
If set to False, prevents runtime display of monitor. This must be set to False
when using environment that doesnt support HTML.

session_id: int, default = None
If None, a random seed is generated and returned in the Information grid. The unique number is then distributed as a seed in all functions used during the experiment. This can be used for later reproducibility of the entire experiment.

log_experiment: bool, default = False
When set to True, all metrics and parameters are logged on MLFlow server.

experiment_name: str, default = None
Name of experiment for logging. When set to None, ‘clf’ is by default used as alias for the experiment name.

log_plots: bool, default = False
When set to True, specific plots are logged in MLflow as a png file. By default, it is set to False.

log_profile: bool, default = False
When set to True, data profile is also logged on MLflow as a html file. By default, it is set to False.

log_data: bool, default = False
When set to True, train and test dataset are logged as csv.

silent: bool, default = False
When set to True, confirmation of data types is not required. All preprocessing will be performed assuming automatically inferred data types. Not recommended for direct use except for established pipelines.

verbose: Boolean, default = True
Information grid is not printed when verbose is set to False.

profile: bool, default = False
If set to true, a data profile for Exploratory Data Analysis will be displayed in an interactive HTML report.

Returns:

Information Grid: Information grid is printed.

Environment: This function returns various outputs that are stored in variable as tuple. They are used by other functions in pycaret.

Compare Models


 

compare_models(blacklist = None, whitelist = None, fold = 10,  round = 4,  sort = ‘Accuracy’, n_select = 1, turbo = True, verbose = True)

Description:

This function train all the models available in the model library and scores them using Stratified Cross Validation. The output prints a score grid with Accuracy, AUC, Recall, Precision, F1, Kappa and MCC (averaged accross folds), determined by fold parameter.
 
This function returns the best model based on metric defined in sort parameter.
 
To select top N models, use n_select parameter that is set to 1 by default. Where n_select parameter > 1, it will return a list of trained model objects.
Code
# return best model
best = compare_models()

# return best model based on Recall
best = compare_models(sort = 'Recall') #default is 'Accuracy'

# compare specific models
best_specific = compare_models(whitelist = ['dt','rf','xgboost'])

# blacklist certain models
best_specific = compare_models(blacklist = ['catboost','svm'])

# return top 3 models based on Accuracy
top3 = compare_models(n_select = 3)

 

Sample Output

When turbo is set to True, (‘rbfsvm’, ‘gpc’ and ‘mlp’) are excluded due to longer training times. By default turbo param is set to True. Specific models can also be blacklisted using ‘blacklist’ parameter within compare_models().
Parameters:

blacklist: list of strings, default = None
In order to omit certain models from the comparison model ID’s can be passed as a list of strings in blacklist param.

whitelist: list of strings, default = None
In order to run only certain models for the comparison, the model ID’s can be passed as a list of strings in whitelist param.

fold: integer, default = 10
Number of folds to be used in Kfold CV. Must be at least 2.

round: integer, default = 4
Number of decimal places the metrics in the score grid will be rounded to.

sort: string, default = ‘Accuracy’
The scoring measure specified is used for sorting the average score grid. Other options are ‘AUC’, ‘Recall’, ‘Precision’, ‘F1’, ‘Kappa’ and ‘MCC’.

n_select: int, default = 1
Number of top_n models to return. use negative argument for bottom selection.for example, n_select = -3 means bottom 3 models.

turbo: Boolean, default = True
When turbo is set to True, it blacklists estimators that have longer training time.

verbose: Boolean, default = True
Score grid is not printed when verbose is set to False.

Returns:

Score Grid: A table containing the scores of the model across the k-folds. Scoring metrics used are Accuracy, AUC, Recall, Precision, F1, Kappa and MCC. Mean and standard deviation of the scores across the folds are also returned.

Warnings:

  • compare_models() though attractive, might be time consuming with large datasets. By default turbo is set to True, which blacklists models that have longer training times. Changing turbo parameter to False may result in very high training times with datasets where number of samples exceed 10,000.
  • If target variable is multiclass (more than 2 classes), AUC will be returned as zero (0.0).
  • This function doesn’t return model object.

Create Model


 

create_model(estimator = None, ensemble = False, method = None, fold = 10, round = 4, cross_validation = True, verbose = True, system = True, **kwargs)

Description:

This function creates a model and scores it using Stratified Cross Validation.The output prints a score grid that shows Accuracy, AUC, Recall, Precision, F1, Kappa and MCC by fold (default = 10 Fold). This function returns a trained model object. 

Code
# train logistic regression model
lr = create_model('lr') #lr is the id of the model

# check the model library to see all models
models()

# train rf model using 5 fold CV
rf = create_model('rf', fold = 5)

# train svm model without CV
svm = create_model('svm', cross_validation = False)

# train xgboost model with max_depth = 10
xgboost = create_model('xgboost', max_depth = 10)

# train xgboost model on gpu
xgboost_gpu = create_model('xgboost', tree_method = 'gpu_hist', gpu_id = 0) #0 is gpu-id

# train multiple lightgbm models with n learning_rate
import numpy as np lgbms = [create_model('lightgbm', learning_rate = i) for i in np.arange(0.1,1,0.1)] # train custom model from gplearn.genetic import SymbolicClassifier symclf = SymbolicClassifier(generation = 50) sc = create_model(symclf)

 

Sample Output

Parameters:

estimator : string / object, default = None
Enter ID of the estimators available in model library or pass an untrained model object consistent with fit / predict API to train and evaluate model. All estimators support binary or multiclass problem. List of estimators in model library:

 
ID Name
‘lr’ Logistic Regression
‘knn’ K Nearest Neighbour
‘nb’ Naives Bayes
‘dt’ Decision Tree Classifier
‘svm’ SVM – Linear Kernel
‘rbfsvm’ SVM – Radial Kernel
‘gpc’ Gaussian Process Classifier
‘mlp’ Multi Level Perceptron
‘ridge’ Ridge Classifier
‘rf’ Random Forest Classifier
‘qda’ Quadratic Discriminant Analysis
‘ada’ Ada Boost Classifier
‘gbc’ Gradient Boosting Classifier
‘lda’ Linear Discriminant Analysis
‘et’ Extra Trees Classifier
‘xgboost’ Extreme Gradient Boosting
‘lightgbm’ Light Gradient Boosting
‘catboost’ CatBoost Classifier


ensemble: Boolean, default = False

True would result in an ensemble of estimator using the method parameter defined.

method: String, ‘Bagging’ or ‘Boosting’, default = None.
method must be defined when ensemble is set to True. Default method is set to None.

fold: integer, default = 10
Number of folds to be used in K-fold CV. Must be at least 2.

round: integer, default = 4
Number of decimal places the metrics in the score grid will be rounded to.

cross_validation: bool, default = True
When cross_validation set to False fold parameter is ignored and model is trained on entire training dataset. No metric evaluation is returned.

verbose: Boolean, default = True
Score grid is not printed when verbose is set to False.

system: Boolean, default = True
Must remain True all times. Only to be changed by internal functions.

**kwargs:
Additional keyword arguments to pass to the estimator.

Returns:

Score Grid: A table containing the scores of the model across the k-folds. Scoring metrics used are Accuracy, AUC, Recall, Precision, F1, Kappa and MCC. Mean and standard deviation of the scores across the folds are also returned.

Model: Trained model object

Warnings:

  • ‘svm’ and ‘ridge’ doesn’t support predict_proba method. As such, AUC will be returned as zero (0.0)
  • If target variable is multiclass (more than 2 classes), AUC will be returned as zero (0.0)
  • ‘rbfsvm’ and ‘gpc’ uses non-linear kernel and hence the fit time complexity is more than quadratic. These estimators are hard to scale on datasets with more than 10,000 samples.

Tune Model


 

tune_model(estimator = None,  fold = 10,  round = 4,  n_iter = 10, custom_grid = None,  optimize = ‘Accuracy’, choose_better = False, verbose = True)

Description:

This function tunes the hyperparameters of a model and scores it using Stratified Cross Validation. The output prints a score grid that shows Accuracy, AUC, Recall Precision, F1, Kappa, and MCCby fold (by default = 10 Folds). This function returns a trained model object.

Code
# train a decision tree model with default parameters
dt = create_model('dt')

# tune hyperparameters of decision tree
tuned_dt = tune_model(dt)

# tune hyperparameters with increased n_iter
tuned_dt = tune_model(dt, n_iter = 50)

# tune hyperparameters to optimize AUC
tuned_dt = tune_model(dt, optimize = 'AUC') #default is 'Accuracy'

# tune hyperparameters with custom_grid
params = {"max_depth": np.random.randint(1, (len(data.columns)*.85),20),
          "max_features": np.random.randint(1, len(data.columns),20),
          "min_samples_leaf": [2,3,4,5,6],
          "criterion": ["gini", "entropy"]
          }

tuned_dt_custom = tune_model(dt, custom_grid = params)

# tune multiple models dynamically
top3 = compare_models(n_select = 3)
tuned_top3 = [tune_model(i) for i in top3]

 

Sample Output

Parameters:

estimator : object, default = None

fold: integer, default = 10
Number of folds to be used in K-fold CV. Must be at least 2.

round: integer, default = 4
Number of decimal places the metrics in the score grid will be rounded to.

n_iter: integer, default = 10
Number of iterations within the Random Grid Search. For every iteration, the model randomly selects one value from the pre-defined grid of hyperparameters.

custom_grid: dictionary, default = None
To use custom hyperparameters for tuning pass a dictionary with parameter name and values to be iterated. When set to None it uses pre-defined tuning grid.

optimize: string, default = ‘accuracy’
Measure used to select the best model through hyperparameter tuning. The default scoring measure is ‘Accuracy’. Other measures include ‘AUC’, ‘Recall’, ‘Precision’, ‘F1’.

choose_better: Boolean, default = False
When set to set to True, base estimator is returned when the performance doesn’t improve by tune_model. This gurantees the returned object would perform atleast equivalent to base estimator created using create_model or model returned by compare_models.

verbose: Boolean, default = True
Score grid is not printed when verbose is set to False.

Returns:

Score Grid: A table containing the scores of the model across the k-folds. Scoring metrics used are Accuracy, AUC, Recall, Precision, F1, Kappa, and MCC. Mean and standard deviation of the scores across the folds are also returned.

Model: Trained and tuned model object

Warnings:

  • Estimator parameter takes an abbreviated string. Passing a trained model object returns an error. The tune_model() function internally calls create_model()
    before tuning the hyperparameters.
  • If target variable is multiclass (more than 2 classes), optimize param ‘AUC’ is not acceptable.
  • If target variable is multiclass (more than 2 classes), AUC will be returned as zero (0.0)

Ensemble Model


 

ensemble_model(estimator, method = ‘Bagging’,  fold = 10, n_estimators = 10, round = 4,  choose_better = False, optimize = ‘Accuracy’, verbose = True)

Description:

This function ensembles the trained base estimator using the method defined in ‘method’ param (default = ‘Bagging’). The output prints a score grid that shows Accuracy, AUC, Recall, Precision, F1 and Kappa by fold (default = 10 Fold). Model must be created using create_model() or tune_model(). This function returns a trained model object.

Code
# create a decision tree model
dt = create_model('dt') 

# ensemble trained decision tree model 
ensembled_dt = ensemble_model(dt)

 

Output

BaggingClassifier(base_estimator=DecisionTreeClassifier(ccp_alpha=0.0,
                                                        class_weight=None,
                                                        criterion='gini',
                                                        max_depth=None,
                                                        max_features=None,
                                                        max_leaf_nodes=None,
                                                        min_impurity_decrease=0.0,
                                                        min_impurity_split=None,
                                                        min_samples_leaf=1,
                                                        min_samples_split=2,
                                                        min_weight_fraction_leaf=0.0,
                                                        presort='deprecated',
                                                        random_state=786,
                                                        splitter='best'),
                  bootstrap=True, bootstrap_features=False, max_features=1.0,
                  max_samples=1.0, n_estimators=10, n_jobs=None,
                  oob_score=False, random_state=786, verbose=0,
                  warm_start=False)
This will return an ensembled Decision Tree model using ‘Bagging’.
Parameters:

estimator : object, default = None

method: String, default = ‘Bagging’
Bagging method will create an ensemble meta-estimator that fits base classifiers each on random subsets of the original dataset. The other available method is ‘Boosting’ which will create a meta-estimators by fitting a classifier on the original dataset and then fits additional copies of the classifier on the same dataset but where the weights of incorrectly classified instances are adjusted such that subsequent classifiers focus more on difficult cases.

fold: integer, default = 10
Number of folds to be used in K-fold CV. Must be at least 2.

n_estimators: integer, default = 10
The number of base estimators in the ensemble. In case of perfect fit, the learning procedure is stopped early.

round: integer, default = 4
Number of decimal places the metrics in the score grid will be rounded to.

choose_better: Boolean, default = False
When set to set to True, base estimator is returned when the metric doesn’t improve by ensemble_model. This gurantees the returned object would perform atleast equivalent to base estimator created using create_model or model returned by compare_models.

optimize: string, default = ‘Accuracy’
Only used when choose_better is set to True. optimize parameter is used to compare emsembled model with base estimator. Values accepted in optimize parameter are ‘Accuracy’, ‘AUC’, ‘Recall’,’Precision’, ‘F1’, ‘Kappa’, ‘MCC’.

verbose: Boolean, default = True
Score grid is not printed when verbose is set to False.

Returns:

Score Grid: A table containing the scores of the model across the k-folds. Scoring metrics used are Accuracy, AUC, Recall, Precision, F1, Kappa, and MCC. Mean and standard deviation of the scores across the folds are also returned.

Model: Trained ensembled model object

Warnings:

  • If target variable is multiclass (more than 2 classes), AUC will be returned as zero (0.0).

Blend Models


 

blend_models(estimator_list = ‘All’,  fold = 10,  round = 4,  choose_better = False, optimize = ‘Accuracy’, method = ‘hard’, turbo = True, verbose = True)

Description:

This function creates a Soft Voting / Majority Rule classifier for all the estimators in the model library (excluding the few when turbo is True) or for specific trained estimators passed as a list in estimator_list param. It scores it using Stratified Cross Validation. The output prints a score grid that shows Accuracy, AUC, Recall, Precision, F1 and Kappa by fold (default CV = 10 Folds). This function returns a trained model object.

Code
# train a votingclassifier on all models in library
blender = blend_models()

# train a voting classifier on specific models
dt = create_model('dt')
rf = create_model('rf')
adaboost = create_model('ada')
blender_specific = blend_models(estimator_list = [dt,rf,adaboost], method = 'soft')

# train a voting classifier dynamically
blender_top5 = blend_models(compare_models(n_select = 5))

 

Sample Output

Parameters:

estimator_list : string (‘All’) or list of object, default = ‘All’

fold: integer, default = 10
Number of folds to be used in K-fold CV. Must be at least 2.

round: integer, default = 4
Number of decimal places the metrics in the score grid will be rounded to.

choose_better: Boolean, default = False
When set to set to True, base estimator is returned when the metric doesn’t improve by ensemble_model. This gurantees the returned object would perform atleast equivalent to base estimator created using create_model or model returned by compare_models.

optimize: string, default = ‘Accuracy’
Only used when choose_better is set to True. optimize parameter is used to compare emsembled model with base estimator. Values accepted in optimize parameter are ‘Accuracy’, ‘AUC’, ‘Recall’, ‘Precision’, ‘F1’, ‘Kappa’, ‘MCC’.

method: string, default = ‘hard’
‘hard’ uses predicted class labels for majority rule voting.’soft’, predicts the class label based on the argmax of the sums of the predicted probabilities, which is recommended for an ensemble of well-calibrated classifiers.

turbo: Boolean, default = True
When turbo is set to True, it blacklists estimator that uses Radial Kernel.

verbose: Boolean, default = True
Score grid is not printed when verbose is set to False.

Returns:

Score Grid: A table containing the scores of the model across the k-folds. Scoring metrics used are Accuracy, AUC, Recall, Precision, F1, Kappa, and MCC. Mean and standard deviation of the scores across the folds are also returned.

Model: Trained Voting Classifier model object.

Warnings:

  • When passing estimator_list with method set to ‘soft’. All the models in the estimator_list must support predict_proba function. ‘svm’ and ‘ridge’ doesn’t support the predict_proba and hence an exception will be raised.
  • When estimator_list is set to ‘All’ and method is forced to ‘soft’, estimators that doesn’t support the predict_proba function will be dropped from the estimator list.
  • CatBoost Classifier not supported in blend_models().
  • If target variable is multiclass (more than 2 classes), AUC will be returned as zero (0.0).

Stack Models


 

stack_models(estimator_list, meta_model = None, fold = 10, round = 4, method = ‘soft’, restack = True, plot = False, choose_better = False, optimize  = ‘Accuracy’, finalize = False, verbose = True)

Description:

This function creates a meta model and scores it using Stratified Cross Validation. The predictions from the base level models as passed in the estimator_list param  are used as input features for the meta model. The restacking parameter controls the ability to expose raw features to the meta model when set to True (default = False). The output prints the score grid that shows Accuracy, AUC, Recall, Precision, F1 and Kappa by fold (default = 10 Folds). This function returns a container which is the list of all models in stacking.

 

WARNING : This function will adopt to Stackingclassifer() from sklearn in future release of PyCaret 2.x.

Code
# create models for stacking
dt = create_model('dt')
rf = create_model('rf')
ada = create_model('ada')
ridge = create_model('ridge')
knn = create_model('knn')

# stack trained models
stacked_models = stack_models(estimator_list=[dt,rf,ada,ridge,knn])

 

Output

This will create a meta model that will use the predictions of all the models provided in estimator_list param. By default, the meta model is Logistic Regression but can be changed with meta_model param.
Parameters:

estimator_list : list of objects

meta_model : object, default = None
if set to None, Logistic Regression is used as a meta model.

fold: integer, default = 10
Number of folds to be used in K-fold CV. Must be at least 2.

round: integer, default = 4
Number of decimal places the metrics in the score grid will be rounded to.

method: string, default = ‘soft’
‘soft’, uses predicted probabilities as an input to the meta model.
‘hard’, uses predicted class labels as an input to the meta model.

restack: Boolean, default = True
When restack is set to True, raw data will be exposed to meta model when
making predictions, otherwise when False, only the predicted label or
probabilities is passed to meta model when making final predictions.

plot: Boolean, default = False
When plot is set to True, it will return the correlation plot of prediction
from all base models provided in estimator_list.

choose_better: Boolean, default = False
When set to set to True, base estimator is returned when the metric doesn’t improve by ensemble_model. This gurantees the returned object would perform atleast equivalent to base estimator created using create_model or model returned by compare_models.

optimize: string, default = ‘Accuracy’
Only used when choose_better is set to True. optimize parameter is used to compare emsembled model with base estimator. Values accepted in optimize parameter are ‘Accuracy’, ‘AUC’, ‘Recall’, ‘Precision’, ‘F1’, ‘Kappa’, ‘MCC’.

finalize: Boolean, default = False
When finalize is set to True, it will fit the stacker on entire dataset
including the hold-out sample created during the setup() stage. It is not
recommended to set this to True here, If you would like to fit the stacker
on the entire dataset including the hold-out, use finalize_model().

verbose: Boolean, default = True
Score grid is not printed when verbose is set to False.

Returns:

Score Grid: A table containing the scores of the model across the k-folds. Scoring metrics used are Accuracy, AUC, Recall, Precision, F1, Kappa, and MCC. Mean and standard deviation of the scores across the folds are also returned.

Container: list of all the models where last element is meta model.

Warnings:

  • When the method is forced to be ‘soft’ and estimator_list param includes estimators that do not support the predict_proba method such as ‘svm’ or ‘ridge’, predicted values for those specific estimators only are used instead of probability when building the meta_model. The same rule applies when the stacker is used under predict_model() function.
  • If target variable is multiclass (more than 2 classes), AUC will be returned as zero (0.0).
  • Method ‘soft’ not supported for when target is multiclass.

Create Stacknet


 

create_stacknet(estimator_list, meta_model = None, fold = 10, round = 4, method = ‘soft’, restack = True, choose_better = False, optimize = ‘Accuracy’, finalize = False, verbose = True)

Description:

This function creates a sequential stack net using cross validated predictions at each layer. The final score grid contains predictions from the meta model  using Stratified Cross Validation. Base level models can be passed as estimator_list param, the layers can be organized as a sub list within the estimator_list object. Restacking param controls the ability to expose raw features to meta model. This function returns a container which is the list of all models in stacking.

 
 
WARNING : This function will be deprecated in future release of PyCaret 2.x.
Code
#create models for stacknet
dt = create_model('dt')
rf = create_model('rf')
ada = create_model('ada')
ridge = create_model('ridge')
knn = create_model('knn')

#create stacknet
stacknet = create_stacknet(estimator_list =[[dt,rf],[ada,ridge,knn]])

 

Output

This will result in the stacking of models in multiple layers. The first layer contains dt and rf, the predictions of which are used by models in the second  layer to generate predictions which are then used by the meta model to generate final predictions. By default, the meta model is Logistic Regression but can be changed with meta_model param.
Parameters:

estimator_list : nested list of objects

meta_model : object, default = None
if set to None, Logistic Regression is used as a meta model.

fold: integer, default = 10
Number of folds to be used in K-fold CV. Must be at least 2.

round: integer, default = 4
Number of decimal places the metrics in the score grid will be rounded to.

method: string, default = ‘soft’
‘soft’, uses predicted probabilities as an input to the meta model.
‘hard’, uses predicted class labels as an input to the meta model.

restack: Boolean, default = True
When restack is set to True, raw data and prediction of all layers will be exposed to the meta model when making predictions. When set to False, only the predicted label or probabilities of last layer is passed to meta model when making final predictions.

choose_better: Boolean, default = False
When set to set to True, base estimator is returned when the metric doesn’t improve by ensemble_model. This gurantees the returned object would perform atleast equivalent to base estimator created using create_model or model returned by compare_models.

optimize: string, default = ‘Accuracy’
Only used when choose_better is set to True. optimize parameter is usedto compare emsembled model with base estimator. Values accepted in optimize parameter are ‘Accuracy’, ‘AUC’, ‘Recall’, ‘Precision’, ‘F1’, ‘Kappa’ and ‘MCC’.

finalize: Boolean, default = False
When finalize is set to True, it will fit the stacker on entire dataset including the hold-out sample created during the setup() stage. It is not recommended to set this to True here, if you would like to fit the stacker on the entire dataset including the hold-out, use finalize_model().

verbose: Boolean, default = True
Score grid is not printed when verbose is set to False.

Returns:

Score Grid: A table containing the scores of the model across the k-folds. Scoring metrics used are Accuracy, AUC, Recall, Precision, F1, Kappa, and MCC. Mean and standard deviation of the scores across the folds are also returned.

Container: list of all the models where last element is meta model.

Warnings:

  • When the method is forced to be ‘soft’ and estimator_list param includes estimators that do not support the predict_proba method such as ‘svm’ or ‘ridge’, predicted values for those specific estimators only are used instead of probability when building the meta_model. The same rule applies when the stacker is used under predict_model() function.
  • If target variable is multiclass (more than 2 classes), AUC will be returned as zero (0.0).
  • Method ‘soft’ not supported for when target is multiclass.

Plot Model


 

plot_model(estimator = None, plot = ‘auc’, save = False, verbose = True, system = True)

Description:

This function takes a trained model object and returns a plot based on the test / hold-out set. The process may require the model to be re-trained in certain cases. See list of plots supported below. Model must be created using create_model() or tune_model().

Code
#create a model
lr = create_model('lr')

#plot a model
plot_model(lr)

 

Output
This will return an AUC plot of a trained Logistic Regression model.
Parameters:

estimator : object, default = None
A trained model object should be passed as an estimator.

plot : string, default = auc
Enter abbreviation of type of plot. The current list of plots supported are:

Plot Name
‘auc’ Area Under the Curve
‘threshold’ Discrimination Threshold
‘pr’ Precision Recall Curve
‘confusion_matrix’ Confusion Matrix
‘error’ Class Prediction Error
‘class_report’ Classification Report
‘boundary’ Decision Boundary
‘rfe’ Recursive Feature Selection
‘learning’ Learning Curve
‘manifold’ Manifold Learning
‘calibration’ Calibration Curve
‘vc’ Validation Curve
‘dimension’ Dimension Learning
‘feature’ Feature Importance
‘parameter’ Model Hyperparameter

 

save: Boolean, default = False
When set to True, Plot is saved as a ‘png’ file in current working directory.

verbose: Boolean, default = True
Progress bar not shown when verbose set to False.

system: Boolean, default = True
Must remain True all times. Only to be changed by internal functions.

Returns:

Visual Plot: Prints the visual plot.

Warnings:

  • ‘svm’ and ‘ridge’ doesn’t support the predict_proba method. As such, AUC and calibration plots are not available for these estimators.
  • When the ‘max_features’ parameter of a trained model object is not equal to the number of samples in training set, the ‘rfe’ plot is not available.
  • ‘calibration’, ‘threshold’, ‘manifold’ and ‘rfe’ plots are not available for multiclass problems.

Evaluate Model


 

evaluate_model(estimator)
Description:

This function displays a user interface for all of the available plots for a given estimator. It internally uses the plot_model() function.

Code
#create a model
lr = create_model('lr')

#evaluate a model
evaluate_model(lr)

 

Output
Parameters:
estimator : object, default = none
A trained model object should be passed as an estimator.
Returns:

User Interface : Displays the user interface for plotting.

Interpret Model


 

interpret_model(estimator, plot = ‘summary’, feature = None, observation = None)
Description:

This function takes a trained model object and returns an interpretation plot based on the test / hold-out set. It only supports tree based algorithms. This function is implemented based on the SHAP (SHapley Additive exPlanations), which is a unified approach to explain the output of any machine learning model. SHAP connects game theory with local explanations.

For more information : https://shap.readthedocs.io/en/latest/

Code
#create a model
dt = create_model('dt')

#interpret a model
interpret_model(dt)

 

Output
Parameters:
estimator : object, default = none
A trained tree based model object should be passed as an estimator.

plot : string, default = ‘summary’
other available options are ‘correlation’ and ‘reason’.

feature: string, default = None
This parameter is only needed when plot = ‘correlation’. By default feature is
set to None which means the first column of the dataset will be used as a variable.
A feature parameter must be passed to change this.

observation: integer, default = None
This parameter only comes into effect when plot is set to ‘reason’. If no observation
number is provided, it will return an analysis of all observations with the option
to select the feature on x and y axes through drop down interactivity. For analysis at
the sample level, an observation parameter must be passed with the index value of the
observation in test / hold-out set.

Returns:

Visual Plot: Returns the visual plot. Returns the interactive JS plot when plot = ‘reason’.

Warnings:

  • interpret_model doesn’t support multiclass problems.

Calibrate Model


 

calibrate_model(estimator, method = ‘sigmoid’, fold=10, round=4, verbose=True)
Description:

This function takes the input of trained estimator and performs probability calibration with sigmoid or isotonic regression. The output prints a score  grid that shows Accuracy, AUC, Recall, Precision, F1 and Kappa by fold (default = 10 Fold). The output of the original estimator and the calibrated estimator (created using this function) might not differ much. In order to see the calibration differences, use ‘calibration’ plot in plot_model to see the difference before and after.

Code
#create a boosting model
dt_boosted = create_model('dt', ensemble = True, method = 'Boosting')

#calibrate trainde boosted dt
calibrated_dt = calibrate_model(dt_boosted)

 

Output

Parameters:
estimator : object, default = none
A trained tree based model object should be passed as an estimator.

plot : string, default = ‘summary’
other available options are ‘correlation’ and ‘reason’.

feature: string, default = None
This parameter is only needed when plot = ‘correlation’. By default feature is
set to None which means the first column of the dataset will be used as a variable.
A feature parameter must be passed to change this.

observation: integer, default = None
This parameter only comes into effect when plot is set to ‘reason’. If no observation
number is provided, it will return an analysis of all observations with the option
to select the feature on x and y axes through drop down interactivity. For analysis at
the sample level, an observation parameter must be passed with the index value of the
observation in test / hold-out set.

Returns:

Visual Plot: Returns the visual plot. Returns the interactive JS plot when plot = ‘reason’.

Warnings:

  • interpret_model doesn’t support multiclass problems.

Optimize Threshold


 

optimize_threshold(estimator, true_positive = 0, true_negative = 0, false_positive = 0, false_negative = 0)
Description:

This function optimizes probability threshold for a trained model using custom cost function that can be defined using combination of True Positives, True Negatives, False Positives (also known as Type I error), and False Negatives (Type II error). This function returns a plot of optimized cost as a function of probability threshold between 0 to 100.

Code
#create a model
lr = create_model('lr')

#optimize threshold for trained model
optimize_threshold(lr, true_negative = 10, false_negative = -100)

 

Output
This will return a plot of optimized cost as a function of probability threshold.
Parameters:
estimator : object
A trained model object should be passed as an estimator.

true_positive : int, default = 0
Cost function or returns when prediction is true positive.

true_negative : int, default = 0
Cost function or returns when prediction is true negative.

false_positive : int, default = 0
Cost function or returns when prediction is false positive.

false_negative : int, default = 0
Cost function or returns when prediction is false negative.

Returns:

Visual Plot : Prints the visual plot.

Warnings:

  • This function is not supported for multiclass problems.

Predict Model


 

predict_model(estimator, data=None, probability_threshold=None, platform=None, authentication=None, verbose=True)

Description:

This function is used to predict new data using a trained estimator. It accepts an estimator created using one of the function in pycaret that returns a trained  model object or a list of trained model objects created using stack_models() or create_stacknet(). New unseen data can be passed to data param as pandas Dataframe. If data is not passed, the test / hold-out set separated at the time of setup() is used to generate predictions.

Code
#create a model
lr = create_model('lr')

#generate predictions on hold-out set using trained model
lr_predictions_holdout = predict_model(lr)

 

Output
Parameters:

estimator : object or list of objects / string, default = None
When estimator is passed as string, load_model() is called internally to load the pickle file from active directory or cloud platform when platform param is passed.

data : {array-like, sparse matrix}, shape (n_samples, n_features)
where n_samples is the number of samples and n_features is the number of features. All features  used during training must be present in the new dataset.

probability_threshold : float, default = None
threshold used to convert probability values into binary outcome. By default the probability threshold for all binary classifiers is 0.5 (50%). This can be changed using probability_threshold param.

platform: string, default = None
Name of platform, if loading model from cloud. Current available options are: ‘aws’.

authentication : dict
dictionary of applicable authentication tokens.

When platform = ‘aws’:
{‘bucket’ : ‘Name of Bucket on S3’}

verbose: bool, default = True 
Holdout score grid is not printed when verbose is set to False.

Returns:

Information Grid : Information grid is printed when data is None.

Warnings:

  • if the estimator passed is created using finalize_model() then the metrics printed in the information grid maybe misleading as the model is trained on the complete dataset including the test / hold-out set. Once finalize_model() is used, the model is considered ready for deployment and should be used on new  unseen datasets only.

Finalize Model


 

finalize_model(estimator)
Description:

This function fits the estimator onto the complete dataset passed during the setup() stage. The purpose of this function is to prepare for final model deployment after experimentation.

#create a model
lr = create_model('lr')

#finalize trained model
finalize_model(lr)

 

Parameters:
estimator : object, default = none
A trained model object should be passed as an estimator.
Returns:

Model : Trained model object fitted on complete dataset.

Warnings:

  • If the model returned by finalize_model(), is used on predict_model() without passing a new unseen dataset, then the information grid printed is misleading as the model is trained on the complete dataset including test / hold-out sample. Once finalize_model() is used, the model is considered ready for deployment and should be used on new unseens dataset only.

Deploy Model


 

deploy_model(model, model_name, authentication, platform = ‘aws’)
Description:
(In Preview)

This function deploys the transformation pipeline and trained model object for production use. The platform of deployment can be defined under the platform param along with the applicable authentication tokens which are passed as a dictionary to the authentication param.

Code
#create a model
lr = create_model('lr')

#deploy trained model on cloud
deploy_model(model = lr, model_name = 'deploy_lr', platform = 'aws', authentication = {'bucket' : 'pycaret-test'})

 

Output
This will deploy the model on an AWS S3 account under bucket ‘pycaret-test’

For AWS users:

Before deploying a model to an AWS S3 (‘aws’), environment variables must be configured using the command line interface. To configure AWS env. variables, type aws configure in your python command line. The following information is required which can be generated using the Identity and Access Management (IAM)  portal of your amazon console account:

  • AWS Access Key ID
  • AWS Secret Key Access
  • Default Region Name (can be seen under Global settings on your AWS console)
  • Default output format (must be left blank)
Parameters:
model : object
A trained model object should be passed as an estimator.

model_name : string
Name of model to be passed as a string.

authentication : dict
dictionary of applicable authentication tokens.

When platform = ‘aws’:
{‘bucket’ : ‘Name of Bucket on S3’}

platform: string, default = ‘aws’
Name of platform for deployment. Current available options are: ‘aws’.

Returns:

Message : Success Message

Warnings:

  • This function uses file storage services to deploy the model on cloud platform. As such, this is efficient for batch-use. Where the production objective is to  obtain prediction at an instance level, this may not be the efficient choice as it transmits the binary pickle file between your local python environment and the platform.

Save Model


 

save_model(model, model_name, verbose=True)
Description:

This function saves the transformation pipeline and trained model object into the current active directory as a pickle file for later use.

Code
#create a model
lr = create_model('lr')

#save trained model
save_model(lr, 'lr_model_23122019')

 

Output
Parameters:
model : object, default = none
A trained model object should be passed as an estimator.

model_name : string, default = none
Name of pickle file to be passed as a string.

verbose: Boolean, default = True
Success message is not printed when verbose is set to False.

Returns:

Message : Success Message

Load Model


 

load_model(model_name, platform = None, authentication = None, verbose=True)
Description:

This function loads a previously saved transformation pipeline and model from the current active directory into the current python environment. Load object must be a pickle file.

Code
saved_lr = load_model('lr_model_23122019')

 

Output
Parameters:
model_name : string, default = none
Name of pickle file to be passed as a string.

platform: string, default = None
Name of platform, if loading model from cloud. Current available options are: ‘aws’.

authentication : dict
dictionary of applicable authentication tokens.

When platform = ‘aws’:
{‘bucket’ : ‘Name of Bucket on S3’}

verbose: Boolean, default = True
Success message is not printed when verbose is set to False.

Returns:

Message : Success Message

AutoML


 

automl(optimize = ‘Accuracy’, use_holdout = False)

Description:

This function returns the best model out of all models created in current active environment based on metric defined in optimize parameter.

Code
# selecting best model
best = automl()

 

Output
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=123, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)
Parameters:

optimize : string, default = ‘Accuracy’
Other values you can pass in optimize param are ‘AUC’, ‘Recall’, ‘Precision’, ‘F1’, ‘Kappa’, and ‘MCC’.


use_holdout: bool, default = False
When set to True, metrics are evaluated on holdout set instead of CV.

Returns:

Model : Trained and finalized model object.

Pull


 

pull()

Description:

Returns last printed score grid as pandas dataframe.
Code
# store score grid in dataframe 
df = pull()

 

Output

Models


 

models(type = None)

Description:

Returns table of models available in model library.
Code
# show all models in library 
all_models = models()

 

Output

Parameters:

type : string, default = None
– linear : filters and only return linear models
– tree : filters and only return tree based models
– ensemble : filters and only return ensemble models

Returns:

Dataframe: Pandas dataframe with meta data for all models.

Get Logs


 

get_logs(experiment_name = None, save = False)

Description:

Returns a table with experiment logs consisting run details, parameter, metrics and tags.
Code
# store experiment logs in pandas dataframe
logs = get_logs()

 

Output

Parameters:
experiment_name : string, default = None
set to None current active run is used.
 
save : bool, default = False
When set to True, csv file is saved in current directory.

Returns:

Dataframe: Pandas dataframe with logs.

Get Config


 

get_config(variable)

Description:

This function is used to access global environment variables. Following variables can be accessed:
 
  • X: Transformed dataset (X)
  • y: Transformed dataset (y)  
  • X_train: Transformed train dataset (X)
  • X_test: Transformed test/holdout dataset (X)
  • y_train: Transformed train dataset (y)
  • y_test: Transformed test/holdout dataset (y)
  • seed: random state set through session_id
  • prep_pipe: Transformation pipeline configured through setup
  • folds_shuffle_param: shuffle parameter used in Kfolds
  • n_jobs_param: n_jobs parameter used in model training
  • html_param: html_param configured through setup
  • create_model_container: results grid storage container
  • master_model_container: model storage container
  • display_container: results display container
  • exp_name_log: Name of experiment set through setup
  • logging_param: log_experiment param set through setup
  • log_plots_param: log_plots param set through setup
  • USI: Unique session ID parameter set through setup
  • fix_imbalance_param: fix_imbalance param set through setup
  • fix_imbalance_method_param: fix_imbalance_method param set through setup
Code
# get X_train dataframe
X_train = get_config('X_train') 

 

Output

Set Config


 

set_config(variable, value)

Description:
This function is used to reset global environment variables. Following variables can be accessed:

 

  • X: Transformed dataset (X)
  • y: Transformed dataset (y)
  • X_train: Transformed train dataset (X)
  • X_test: Transformed test/holdout dataset (X)
  • y_train: Transformed train dataset (y)
  • y_test: Transformed test/holdout dataset (y)
  • seed: random state set through session_id
  • prep_pipe: Transformation pipeline configured through setup
  • folds_shuffle_param: shuffle parameter used in Kfolds
  • n_jobs_param: n_jobs parameter used in model training
  • html_param: html_param configured through setup
  • create_model_container: results grid storage container
  • master_model_container: model storage container
  • display_container: results display container
  • exp_name_log: Name of experiment set through setup
  • logging_param: log_experiment param set through setup
  • log_plots_param: log_plots param set through setup
  • USI: Unique session ID parameter set through setup
  • fix_imbalance_param: fix_imbalance param set through setup
  • fix_imbalance_method_param: fix_imbalance_method param set through setup
Code
# change seed value in environment to '999'
set_config('seed', 999)

 

Output

No output.

Get System Logs


 

get_system_logs()

Description:

Read and print ‘logs.log’ file from current active directory.
Code
# Reading system logs in Notebook
get_system_logs()

 

Output

MLFlow UI


 

mlflow ui

Description:

Execute function in the current working directory to open MLFlow server on localhost:5000
Code
# loading dataset
from pycaret.datasets import get_data
data = get_data('diabetes')

# initializing setup
from pycaret.classification import *
clf1 = setup(data, target = 'Class variable', log_experiment = True, experiment_name = 'diabetes1')

# compare all baseline models and select top 5
top5 = compare_models(n_select = 5) 

# tune top 5 base models
tuned_top5 = [tune_model(i) for i in top5]

# ensemble top 5 tuned models
bagged_top5 = [ensemble_model(i) for i in tuned_top5]

# blend top 5 base models 
blender = blend_models(estimator_list = top5) 

# run mlflow server (notebook)
!mlflow ui

### just 'mlflow ui' when running through command line.

 

Output

Try this next


 

Was this page helpful?

Send feedback