Setting up Environment


 

setup(data, target=None, custom_stopwords=None, html = True, session_id = None, log_experiment = False, experiment_name = None, log_plots = True, log_data = True, verbose = True)

Description:

This function initializes the environment in pycaret. setup() must called before executing any other function in pycaret. It takes one mandatory parameter: dataframe {array-like, sparse matrix} or object of type list. If a dataframe is passed, target column containing text must be specified. When data passed is of type list, no target parameter is required. All other parameters are optional. This module only supports English Language at this time.

Code
#import the dataset from pycaret repository
from pycaret.datasets import get_data
kiva = get_data('kiva')

#import nlp module
from pycaret.nlp import *

#intialize the setup
exp_nlp = setup(data = kiva, target = 'en')

 

Output

kiva‘ is a pandas Dataframe and ‘en‘ is the name of column containing text.

Parameters:

data : dataframe or list
{array-like, sparse matrix}, shape (n_samples, n_features) where n_samples is the number of samples and n_features is the number of features or object of type list with n length.

target: string
If data is of type DataFrame, name of column containing text values must be passed as string.

custom_stopwords: list, default = None
list containing custom stopwords.

html: bool, default = True
If set to False, prevents runtime display of monitor. This must be set to False when using environment that doesnt support HTML.

session_id: int, default = None
If None, a random seed is generated and returned in the Information grid. The unique number is then distributed as a seed in all functions used during the experiment. This can be used for later reproducibility of the entire experiment.

log_experiment: bool, default = True
When set to True, all metrics and parameters are logged on MLFlow server.

experiment_name: str, default = None
Name of experiment for logging. When set to None, ‘nlp’ is by default used as alias for the experiment name.

log_plots: bool, default = False
When set to True, specific plots are logged in MLflow as a png file. By default, it is set to False.

log_data: bool, default = False
When set to True, train and test dataset are logged as csv.

verbose: Boolean, default = True
Information grid is not printed when verbose is set to False.

Returns:

Information Grid: Information grid is printed.

Environment: This function returns various outputs that are stored in variable as tuple. They are used by other functions in pycaret.

Warnings:

Some functionalities in pycaret.nlp requires you to have english language model. The language model is not downloaded automatically when you install pycaret.  You will have to download two models using your Anaconda Prompt or python command line interface. To download the model, please type the following in your command line:

  • python -m spacy download en_core_web_sm
  • python -m textblob.download_corpora

Once downloaded, please restart your kernel and re-run the setup.

Create Model


 

create_model(model=None, multi_core=False, num_topics = None, verbose=True, system = True, **kwargs)

Description:

This function creates a model on the dataset passed as a data param during the setup stage. setup() function must be called before using create_model(). This function returns a trained model object. 

Code

#

# create a lda model
lda = create_model('lda')

 

Output
LdaModel(num_terms=2742, num_topics=4, decay=0.5, chunksize=100)

This will return trained Latent Dirichlet Allocation model.

Parameters:

model : string, default = None
Enter abbreviated string of the model class. List of models supported:

ID Model
‘lda’ Latent Dirichlet Allocation
‘lsi’ Latent Semantic Indexing
‘hdp’ Hierarchical Dirichlet Process
‘rp’ Random Projections
‘nmf’ Non-Negative Matrix Factorization

multi_core: Boolean, default = False
True would utilize all CPU cores to parallelize and speed up model training. Only available for ‘lda’. For all other models, the multi_core parameter is ignored.

num_topics: integer, default = 4
Number of topics to be created. If None, default is set to 4.

verbose: Boolean, default = True
Status update is not printed when verbose is set to False.

system: Boolean, default = True
Must remain True all times. Only to be changed by internal functions.

**kwargs:
Additional keyword arguments to pass to the estimator.

Returns:

Model: Trained model object.

 

Assign Model


 

create_model(model, verbose=True)

Description:

This function assigns each of the data point in the dataset passed during setup stage to one of the topic using trained model object passed as model param. create_model() function must be called before using assign_model(). This function returns data frame with topic weights, dominant topic and % of the dominant topic (where applicable).

Code
# create a model
lda = create_model('lda')

# label the data using trained model
lda_df = assign_model(lda)

 

Output

This will return a dataframe with inferred topics using trained Latent Dirichlet Allocation model.

Parameters:

model : trained model object, default = None

verbose: Boolean, default = True
Status update is not printed when verbose is set to False.

Returns:

Data frame: Returns dataframe with inferred topics using trained model object.

 

Plot Model


 

plot_model(model = None, plot = ‘frequency’, topic_num = None, save = False, system = True)

Description:

This function takes a trained model object (optional) and returns a plot based on the inferred dataset by internally calling assign_model before generating a plot. Where a model parameter is not passed, a plot on the entire dataset will be returned instead of one at the topic level. As such, plot_model can be used with or without model. All plots with a model parameter passed as a trained model object will return a plot based on the first topic i.e. ‘Topic 0’. This can be changed using the topic_num param.

Code
# create a model
lda = create_model('lda')

# plot a model
plot_model(lda)

 

Output

This will return an AUC plot of a trained Logistic Regression model.

Parameters:

model : object, default = none
A trained model object can be passed. Model must be created using create_model().

plot : string, default = ‘frequency’
Enter abbreviation for type of plot. The current list of plots supported are:

Plot Name
‘frequency’ Word Token Frequency
‘distribution’ Word Distribution Plot
‘bigram’ Bigram Frequency Plot
‘trigram’ Trigram Frequency Plot
‘sentiment’ Sentiment Polarity Plot
‘pos’ Part of Speech Frequency
‘tsne’ t-SNE (3d) Dimension Plot
‘topic_model’ Topic Model (pyLDAvis)
‘topic_distribution’ Topic Infer Distribution
‘wordcloud’ Word cloud
‘umap’ UMAP Dimensionality Plot

topic_num : string, default = None
Topic number to be passed as a string. If set to None, default generation will be on ‘Topic 0’

save: Boolean, default = False
Plot is saved as png file in local directory when save parameter set to True.

system: Boolean, default = True
Must remain True all times. Only to be changed by internal functions.

Returns:

Visual Plot: Prints the visual plot.

Warnings:

  • ‘pos’ and ‘umap’ plot not available at model level. Hence the model parameter is ignored. The result will always be based on the entire training corpus.
  • ‘topic_model’ plot is based on pyLDAVis implementation. Hence its not available for model = ‘lsi’, ‘rp’ and ‘nmf’.

Evaluate Model


 

evaluate_model(model)

Description:

This function displays the user interface for all the available plots for a given model. It internally uses the plot_model() function.

Code
# create a model
lda = create_model('lda')

# evaluate a model
evaluate_model(lda)

 

Output

Parameters:

model : object, default = None
A trained model object should be passed.

Returns:

User Interface : Displays the user interface for plotting.

 

Tune Model


 

tune_model(model=None, multi_core=False, supervised_target=None, estimator=None, optimize=None, custom_grid = None, auto_fe = True, fold=10, verbose = True)

Description:

This function tunes the num_topics model parameter using a predefined grid with the objective of optimizing a supervised learning metric as defined in the optimize param. You can choose the supervised estimator from a large library available in pycaret. By default, supervised estimator is Linear. This function returns the tuned model object.

Code
tuned_lda = tune_model(model = 'lda', supervised_target = 'status')

 

Output
<gensim.models.ldamodel.LdaModel at 0x1ac5ccef408>
 

Parameters:

model : string, default = None
Enter ID of the models available in model library:

ID Name
‘lda’ Latent Dirichlet Allocation
‘lsi’ Latent Semantic Indexing
‘hdp’ Hierarchical Dirichlet Process
‘rp’ Random Projections
‘nmf’ Non-Negative Matrix Factorization

multi_core: Boolean, default = False
True would utilize all CPU cores to parallelize and speed up model training. Only available for ‘lda’. For all other models, multi_core parameter is ignored.

supervised_target: string
Name of the target column for supervised learning. If None, the model coherence value is used as the objective function.

estimator: string, default = None

ID Estimator Task
‘lr’ Logistic Regression Classification
‘knn’ K Nearest Neighbour Classification
‘nb’ Naives Bayes Classification
‘dt’ Decision Tree Classification
‘svm’ SVM (Linear) Classification
‘rbfsvm’ SVM (RBF) Classification
‘gpc’ Gaussian Process Classification
‘mlp’ Multi Level Perceptron Classification
‘ridge’ Ridge Classifier Classification
‘rf’ Random Forest Classification
‘qda’ Quadratic Disc. Analysis Classification
‘ada’ AdaBoost Classification
‘gbc’ Gradient Boosting Classifier Classification
‘lda’ Linear Disc. Analysis Classification
‘et’ Extra Trees Classifier Classification
‘xgboost’ Extreme Gradient Boosting Classification
‘lightgbm’ Light Gradient Boosting Classification
‘catboost’ Cat Boost Classifier Classification
‘lr’ Linear Regression Regression
‘lasso’ Lasso Regression Regression
‘ridge’ Ridge Regression Regression
‘en’ Elastic Net Regression
‘lar’ Least Angle Regression Regression
‘llar’ Lasso Least Angle Regression Regression
‘omp’ Orthogonal Matching Pursuit Regression
‘br’ Bayesian Ridge Regression
‘ard’ Automatic Relevance Determination Regression
‘par’ Passive Aggressive Regressor Regression
‘ransac’ Random Sample Consensus Regression
‘tr’ TheilSen Regressor Regression
‘huber’ Huber Regressor Regression
‘kn’ Kernel Ridge Regression
‘svm’ Support Vector Machine Regression
‘knn’ K Neighbors Regressor Regression
‘dt’ Decision Tree Regression
‘rf’ Random Forest Regression
‘et’ Extra Trees Regressor Regression
‘ada’ AdaBoost Regressor Regression
‘gbr’ Gradient Boosting Regressor Regression
‘mlp’ Multi Level Perceptron Regression
‘xgboost’ Extreme Gradient Boosting Regression
‘lightgbm’ Light Gradient Boosting Machine Regression
‘catboost’ CatBoost Regressor Regression

If set to None, Linear model is used by default for both classification and regression tasks.

optimize: string, default = None

custom_grid: list, default = None
By default, a pre-defined number of topics is iterated over to optimize the supervised objective. To overwrite default iteration, pass a list of num_topics to iterate over in custom_grid param.

For Classification tasks:
Accuracy, AUC, Recall, Precision, F1, Kappa

For Regression tasks:
MAE, MSE, RMSE, R2, ME

verbose: Boolean, default = True
Status update is not printed when verbose is set to False.

Returns:

Visual Plot: Visual plot with k number of topics on x-axis with metric to optimize on y-axis. Coherence is used when learning is unsupervised. Also, prints the best model metric.

Model: Trained model object with best K number of topics.

Warnings:

  • Random Projections (‘rp’) and Non Negative Matrix Factorization (‘nmf’) is not available for unsupervised learning. Error is raised when ‘rp’ or ‘nmf’ is passed without supervised_target.
  • Estimators using kernel based methods such as Kernel Ridge Regressor, Automatic Relevance Determinant, Gaussian Process Classifier, Radial Basis Support Vector Machine and Multi Level Perceptron may have longer training times.

Save Model


 

save_model(model, model_name, verbose=True)

Description:

This function saves the transformation pipeline and trained model object into the current active directory as a pickle file for later use.

Code
# create a model
lda = create_model('lda')

# save a model
save_model(lda, 'lda_model_23122019')

 

Output

Parameters:

model : object, default = none
A trained model object should be passed as an estimator.

model_name : string, default = none
Name of pickle file to be passed as a string.

verbose: Boolean, default = True
Success message is not printed when verbose is set to False.

Returns:

Message : Success Message

 

Load Model


 

load_model(model_name, platform = None, authentication = None, verbose=True)

Description:

This function loads a previously saved transformation pipeline and model from the current active directory into the current python environment. Load object must be a pickle file.

Code
saved_lda = load_model('lda_model_23122019')

 

Output

Parameters:

model_name : string, default = none
Name of pickle file to be passed as a string.

platform: string, default = None
Name of platform, if loading model from cloud. Current available options are: ‘aws’.

authentication : dict
dictionary of applicable authentication tokens.

When platform = ‘aws’:
{‘bucket’ : ‘Name of Bucket on S3’}

verbose: Boolean, default = True
Success message is not printed when verbose is set to False.

Returns:

Message : Success Message

 

Models


 

models(type = None)

Description:

Returns table of models available in model library.
Code
# show all models in library 
all_models = models()

 

Output

Parameters:

No parameters

Returns:

Dataframe: Pandas dataframe with meta data for all models.

Get Logs


 

get_logs(experiment_name = None, save = False)

Description:

Returns a table with experiment logs consisting run details, parameter, metrics and tags.
Code
# store experiment logs in pandas dataframe
logs = get_logs()

 

Output

Parameters:
experiment_name : string, default = None
set to None current active run is used.
 
save : bool, default = False
When set to True, csv file is saved in current directory.

Returns:

Dataframe: Pandas dataframe with logs.

Get Config


 

get_config(variable)

Description:

This function is used to access global environment variables. Following variables can be accessed:
 
  • text: Tokenized words as a list with length = # documents
  • data_: Dataframe containing text after all processing
  • corpus: List containing tuples of id to word mapping
  • id2word: gensim.corpora.dictionary.Dictionary
  • seed: random state set through session_id
  • target_: Name of column containing text. ‘en’ by default.
  • html_param: html_param configured through setup
  • exp_name_log: Name of experiment set through setup
  • logging_param: log_experiment param set through setup
  • log_plots_param: log_plots param set through setup
  • USI: Unique session ID parameter set through setup
Code
# get X dataframe
X = get_config('X')

 

Output

Set Config


 

set_config(variable, value)

Description:
This function is used to reset global environment variables. Following variables can be accessed:

  • text: Tokenized words as a list with length = # documents
  • data_: Dataframe containing text after all processing
  • corpus: List containing tuples of id to word mapping
  • id2word: gensim.corpora.dictionary.Dictionary
  • seed: random state set through session_id
  • target_: Name of column containing text. ‘en’ by default.
  • html_param: html_param configured through setup
  • exp_name_log: Name of experiment set through setup
  • logging_param: log_experiment param set through setup
  • log_plots_param: log_plots param set through setup
  • USI: Unique session ID parameter set through setup
Code
# change seed value in environment to '999'
set_config('seed', 999)

 

Output

No output.

Get System Logs


 

get_system_logs()

Description:

Read and print ‘logs.log’ file from current active directory.
Code
# Reading system logs in Notebook
get_system_logs()

 

Output

MLFlow UI


 

mlflow ui

Description:

Execute function in the current working directory to open MLFlow server on localhost:5000
Code
# loading dataset
from pycaret.datasets import get_data
data = get_data('kiva')

# initializing setup
from pycaret.nlp import *
nlp1 = setup(data, target = 'en', log_experiment = True, experiment_name = 'kiva1')

# create lda model
lda = create_model('lda') 

# create nmf model
nmf = create_model('nmf')

# run mlflow server (notebook)
!mlflow ui

### just 'mlflow ui' when running through command line.

 

Output

Try this next


 

Was this page helpful?

Send feedback