Setting up Environment


 

setup(data, target=None, custom_stopwords=None, session_id = None)

Description:

This function initializes the environment in pycaret. setup() must called before executing any other function in pycaret. It takes one mandatory parameter: dataframe {array-like, sparse matrix} or object of type list. If a dataframe is passed, target column containing text must be specified. When data passed is of type list, no target parameter is required. All other parameters are optional. This module only supports English Language at this time.

Code
#import the dataset from pycaret repository
from pycaret.datasets import get_data
kiva = get_data('kiva')

#import nlp module
from pycaret.nlp import *

#intialize the setup
exp_nlp = setup(data = kiva, target = 'en')

 

Output

kiva‘ is a pandas Dataframe and ‘en‘ is the name of column containing text.

Parameters:

data : dataframe or list
{array-like, sparse matrix}, shape (n_samples, n_features) where n_samples is the number of samples and n_features is the number of features or object of type list with n length.

target: string
If data is of type DataFrame, name of column containing text values must be passed as string.

custom_stopwords: list, default = None
list containing custom stopwords.

session_id: int, default = None
If None, a random seed is generated and returned in the Information grid. The unique number is then distributed as a seed in all functions used during the experiment. This can be used for later reproducibility of the entire experiment.

Returns:

Information Grid: Information grid is printed.

Environment: This function returns various outputs that are stored in variable as tuple. They are used by other functions in pycaret.

Warnings:

Some functionalities in pycaret.nlp requires you to have english language model. The language model is not downloaded automatically when you install pycaret.  You will have to download two models using your Anaconda Prompt or python command line interface. To download the model, please type the following in your command line:

  • python -m spacy download en_core_web_sm
  • python -m textblob.download_corpora

Once downloaded, please restart your kernel and re-run the setup.

Create Model


 

create_model(model=None, multi_core=False, num_topics = None, verbose=True)

Description:

This function creates a model on the dataset passed as a data param during the setup stage. setup() function must be called before using create_model(). This function returns a trained model object. 

Code
lda = create_model('lda')

 

Output
LdaModel(num_terms=2742, num_topics=4, decay=0.5, chunksize=100)

This will return trained Latent Dirichlet Allocation model.

Parameters:

model : string, default = None
Enter abbreviated string of the model class. List of models supported:

Estimator Abbreviated String
Latent Dirichlet Allocation ‘lda’
Latent Semantic Indexing ‘lsi’
Hierarchical Dirichlet Process ‘hdp’
Random Projections ‘rp’
Non-Negative Matrix Factorization ‘nmf’

multi_core: Boolean, default = False
True would utilize all CPU cores to parallelize and speed up model training. Only available for ‘lda’. For all other models, the multi_core parameter is ignored.

num_topics: integer, default = 4
Number of topics to be created. If None, default is set to 4.

verbose: Boolean, default = True
Status update is not printed when verbose is set to False.

Returns:

Model: Trained model object.

 

Assign Model


 

create_model(model, verbose=True)

Description:

This function assigns each of the data point in the dataset passed during setup stage to one of the topic using trained model object passed as model param. create_model() function must be called before using assign_model(). This function returns data frame with topic weights, dominant topic and % of the dominant topic (where applicable).

Code
# create a model
lda = create_model('lda')

# label the data using trained model
lda_df = assign_model(lda)

 

Output

This will return a dataframe with inferred topics using trained Latent Dirichlet Allocation model.

Parameters:

model : trained model object, default = None

verbose: Boolean, default = True
Status update is not printed when verbose is set to False.

Returns:

Data frame: Returns dataframe with inferred topics using trained model object.

 

Plot Model


 

plot_model(model = None, plot = ‘frequency’, topic_num = None)

Description:

This function takes a trained model object (optional) and returns a plot based on the inferred dataset by internally calling assign_model before generating a plot. Where a model parameter is not passed, a plot on the entire dataset will be returned instead of one at the topic level. As such, plot_model can be used with or without model. All plots with a model parameter passed as a trained model object will return a plot based on the first topic i.e. ‘Topic 0’. This can be changed using the topic_num param.

Code
# create a model
lda = create_model('lda')

# plot a model
plot_model(lda)

 

Output

This will return an AUC plot of a trained Logistic Regression model.

Parameters:

model : object, default = none
A trained model object can be passed. Model must be created using create_model().

plot : string, default = ‘frequency’
Enter abbreviation for type of plot. The current list of plots supported are:

Name Abbreviated String
Word Token Frequency ‘frequency’
Word Distribution Plot ‘distribution’
Bigram Frequency Plot ‘bigram’
Trigram Frequency Plot ‘trigram’
Sentiment Polarity Plot ‘sentiment’
Part of Speech Frequency ‘pos’
t-SNE (3d) Dimension Plot ‘tsne’
Topic Model (pyLDAvis) ‘topic_model’
Topic Infer Distribution ‘topic_distribution’
Word cloud ‘wordcloud’
UMAP Dimensionality Plot ‘umap’

topic_num : string, default = None
Topic number to be passed as a string. If set to None, default generation will be on ‘Topic 0’

Returns:

Visual Plot: Prints the visual plot.

Warnings:

  • ‘pos’ and ‘umap’ plot not available at model level. Hence the model parameter is ignored. The result will always be based on the entire training corpus.
  • ‘topic_model’ plot is based on pyLDAVis implementation. Hence its not available for model = ‘lsi’, ‘rp’ and ‘nmf’.

Evaluate Model


 

evaluate_model(model)

Description:

This function displays the user interface for all the available plots for a given model. It internally uses the plot_model() function.

Code
# create a model
lda = create_model('lda')

# evaluate a model
evaluate_model(lda)

 

Output

Parameters:

model : object, default = None
A trained model object should be passed.

Returns:

User Interface : Displays the user interface for plotting.

 

Tune Model


 

tune_model(model=None, multi_core=False, supervised_target=None, estimator=None, optimize=None, auto_fe = True, fold=10)

Description:

This function tunes the num_topics model parameter using a predefined grid with the objective of optimizing a supervised learning metric as defined in the optimize param. You can choose the supervised estimator from a large library available in pycaret. By default, supervised estimator is Linear. This function returns the tuned model object.

Code
tuned_lda = tune_model(model = 'lda', supervised_target = 'status')

 

Output
<gensim.models.ldamodel.LdaModel at 0x1ac5ccef408>
 

Parameters:

model : string, default = None
Enter abbreviated name of the model. List of available models supported:

Estimator Abbreviated String
Latent Dirichlet Allocation ‘lda’
Latent Semantic Indexing ‘lsi’
Hierarchical Dirichlet Process ‘hdp’
Random Projections ‘rp’
Non-Negative Matrix Factorization ‘nmf’

multi_core: Boolean, default = False
True would utilize all CPU cores to parallelize and speed up model training. Only available for ‘lda’. For all other models, multi_core parameter is ignored.

supervised_target: string
Name of the target column for supervised learning. If None, the model coherence value is used as the objective function.

estimator: string, default = None

Estimator Abbrev. String Task
Logistic Regression ‘lr’ Classification
K Nearest Neighbour ‘knn’ Classification
Naives Bayes ‘nb’ Classification
Decision Tree ‘dt’ Classification
SVM (Linear) ‘svm’ Classification
SVM (RBF) ‘rbfsvm’ Classification
Gaussian Process ‘gpc’ Classification
Multi Level Perceptron ‘mlp’ Classification
Ridge Classifier ‘ridge’ Classification
Random Forest ‘rf’ Classification
Quadratic Disc. Analysis ‘qda’ Classification
AdaBoost ‘ada’ Classification
Gradient Boosting Classifier ‘gbc’ Classification
Linear Disc. Analysis ‘lda’ Classification
Extra Trees Classifier ‘et’ Classification
Extreme Gradient Boosting ‘xgboost’ Classification
Light Gradient Boosting ‘lightgbm’ Classification
Cat Boost Classifier ‘catboost’ Classification
Linear Regression ‘lr’ Regression
Lasso Regression ‘lasso’ Regression
Ridge Regression ‘ridge’ Regression
Elastic Net ‘en’ Regression
Least Angle Regression ‘lar’ Regression
Lasso Least Angle Regression ‘llar’ Regression
Orthogonal Matching Pursuit ‘omp’ Regression
Bayesian Ridge ‘br’ Regression
Automatic Relevance Determination ‘ard’ Regression
Passive Aggressive Regressor ‘par’ Regression
Random Sample Consensus ‘ransac’ Regression
TheilSen Regressor ‘tr’ Regression
Huber Regressor ‘huber’ Regression
Kernel Ridge ‘kn’ Regression
Support Vector Machine ‘svm’ Regression
K Neighbors Regressor ‘knn’ Regression
Decision Tree ‘dt’ Regression
Random Forest ‘rf’ Regression
Extra Trees Regressor ‘et’ Regression
AdaBoost Regressor ‘ada’ Regression
Gradient Boosting Regressor ‘gbr’ Regression
Multi Level Perceptron ‘mlp’ Regression
Extreme Gradient Boosting ‘xgboost’ Regression
Light Gradient Boosting Machine ‘lightgbm’ Regression
CatBoost Regressor ‘catboost’ Regression

If set to None, Linear model is used by default for both classification and regression tasks.

optimize: string, default = None

For Classification tasks:
Accuracy, AUC, Recall, Precision, F1, Kappa

For Regression tasks:
MAE, MSE, RMSE, R2, RMSLE, MAPE

If set to None, default is ‘Accuracy’ for classification and ‘R2’ for regression tasks.

auto_fe: boolean, default = True
Automatic text feature engineering. Only used when supervised_target is passed. When set to true, it will generate text based features such as polarity, subjectivity, word counts to be used in supervised learning. Ignored when supervised_target is set to None.

fold: integer, default = 10
Number of folds to be used in K-fold CV. Must be at least 2.

Returns:

Visual Plot: Visual plot with k number of topics on x-axis with metric to optimize on y-axis. Coherence is used when learning is unsupervised. Also, prints the best model metric.

Model: Trained model object with best K number of topics.

Warnings:

  • Random Projections (‘rp’) and Non Negative Matrix Factorization (‘nmf’) is not available for unsupervised learning. Error is raised when ‘rp’ or ‘nmf’ is passed without supervised_target.
  • Estimators using kernel based methods such as Kernel Ridge Regressor, Automatic Relevance Determinant, Gaussian Process Classifier, Radial Basis Support Vector Machine and Multi Level Perceptron may have longer training times.

Save Model


 

save_model(model, model_name, verbose=True)

Description:

This function saves the transformation pipeline and trained model object into the current active directory as a pickle file for later use.

Code
# create a model
lda = create_model('lda')

# save a model
save_model(lda, 'lda_model_23122019')

 

Output

Parameters:

model : object, default = none
A trained model object should be passed as an estimator.

model_name : string, default = none
Name of pickle file to be passed as a string.

verbose: Boolean, default = True
Success message is not printed when verbose is set to False.

Returns:

Message : Success Message

 

Load Model


 

load_model(model_name, platform = None, authentication = None, verbose=True)

Description:

This function loads a previously saved transformation pipeline and model from the current active directory into the current python environment. Load object must be a pickle file.

Code
saved_lda = load_model('lda_model_23122019')

 

Output

Parameters:

model_name : string, default = none
Name of pickle file to be passed as a string.

platform: string, default = None
Name of platform, if loading model from cloud. Current available options are: ‘aws’.

authentication : dict
dictionary of applicable authentication tokens.

When platform = ‘aws’:
{‘bucket’ : ‘Name of Bucket on S3’}

verbose: Boolean, default = True
Success message is not printed when verbose is set to False.

Returns:

Message : Success Message

 

Save Experiment


 

save_experiment(experiment_name=None)

Description:

This function saves the entire experiment into the current active directory. All outputs using pycaret are internally saved into a binary list which is pickilized when save_experiment() is used.

Code
save_experiment('experiment_23122019')

 

Output

Parameters:

experiment_name : string, default = none
Name of pickle file to be passed as a string.

Returns:

Message : Success Message

 

Load Experiment


 

load_experiment(experiment_name)

Description:

This function loads a previously saved experiment from the current active directory into current python environment. Load object must be a pickle file.

Code
saved_experiment = load_experiment('experiment_23122019')

 

Output

Output has been compressed.

This will load the entire experiment pipeline into the object saved_experiment. The experiment file must be in current directory.

Parameters:

experiment_name : string, default = none
Name of pickle file to be passed as a string.

Returns:

Information Grid : Information Grid containing details of saved objects in experiment pipeline.

 

Try this next


 

Was this page helpful?

Send feedback