\n", "

\n", "`from pycaret.utils import enable_colab`

\n", "`enable_colab()`\n", "\n", "## 1.4 See also:\n", "- __[Regression Tutorial (REG101) - Level Beginner](https://github.com/pycaret/pycaret/blob/master/Tutorials/Regression%20Tutorial%20Level%20Beginner%20-%20REG101.ipynb)__\n", "- __[Regression Tutorial (REG103) - Level Expert](https://github.com/pycaret/pycaret/blob/master/Tutorials/Regression%20Tutorial%20Level%20Expert%20-%20REG103.ipynb)__" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "h1iXlqnO1RRd" }, "source": [ "# 2.0 Brief overview of techniques covered in this tutorial\n", "Before we into the practical execution of the techniques mentioned above in Section 1, it is important to understand what are these techniques are and when to use them. More often than not most of these techniques will help linear and parametric algorithms, however it is not surprising to also see performance gains in tree-based models. The Below explanations are only brief and we recommend that you do extra reading to dive deeper and get a more thorough understanding of these techniques.\n", "\n", "- **Normalization:** Normalization / Scaling (often used interchangeably with standardization) is used to transform the actual values of numeric variables in a way that provides helpful properties for machine learning. Many algorithms such as Linear Regression, Support Vector Machine and K Nearest Neighbors assume that all features are centered around zero and have variances that are at the same level of order. If a particular feature in a dataset has a variance that is larger in order of magnitude than other features, the model may not understand all features correctly and could perform poorly. __[Read more](https://sebastianraschka.com/Articles/2014_about_feature_scaling.html#z-score-standardization-or-min-max-scaling)__

\n", "

\n", "- **Transformation:** While normalization transforms the range of data to remove the impact of magnitude in variance, transformation is a more radical technique as it changes the shape of the distribution so that transformed data can be represented by a normal or approximate normal distirbution. In general, you should transform the data if using algorithms that assume normality or a gaussian distribution. Examples of such models are Linear Regression, Lasso Regression and Ridge Regression. __[Read more](https://en.wikipedia.org/wiki/Power_transform)__

\n", "

\n", "- **Target Transformation:** This is similar to the `transformation` technique explained above with the exception that this is only applied to the target variable. __[Read more](https://scikit-learn.org/stable/auto_examples/compose/plot_transformed_target.html)__ to understand the effects of transforming the target variable in regression.

\n", "

\n", "- **Combine Rare Levels:** Sometimes categorical features have levels that are insignificant in the frequency distribution. As such, they may introduce noise into the dataset due to a limited sample size for learning. One way to deal with rare levels in categorical features is to combine them into a new class.

\n", "

\n", "- **Bin Numeric Variables:** Binning or discretization is the process of transforming numerical variables into categorical features. An example would be `Carat Weight` in this experiment. It is a continious distribution of numeric values that can be discretized into intervals. Binning may improve the accuracy of a predictive model by reducing the noise or non-linearity in the data. PyCaret automatically determines the number and size of bins using Sturges rule. __[Read more](https://www.vosesoftware.com/riskwiki/Sturgesrule.php)__

\n", "

\n", "- **Model Ensembling and Stacking:** Ensemble modeling is a process where multiple diverse models are created to predict an outcome. This is achieved either by using many different modeling algorithms or using different samples of training data sets. The ensemble model then aggregates the predictions of each base model resulting in one final prediction for the unseen data. The motivation for using ensemble models is to reduce the generalization error of the prediction. As long as the base models are diverse and independent, the prediction error of the model decreases when the ensemble approach is used. The two most common methods in ensemble learning are `Bagging` and `Boosting`. Stacking is also a type of ensemble learning where predictions from multiple models are used as input features for a meta model that predicts the final outcome. __[Read more](https://blog.statsbot.co/ensemble-learning-d1dcd548e936)__

\n", "

\n", "- **Tuning Hyperparameters of Ensemblers:** Similar to hyperparameter tuning for a single machine learning model, we will also learn how to tune hyperparameters for an ensemble model." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "zrNWkm4v1RRh" }, "source": [ "# 3.0 Dataset for the Tutorial" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "n_N5u8MZ1RRm" }, "source": [ "For this tutorial we will be using the same dataset that was used in __[Regression Tutorial (REG101) - Level Beginner](https://github.com/pycaret/pycaret/blob/master/Tutorials/Regression%20Tutorial%20Level%20Beginner%20-%20REG101.ipynb)__.\n", "\n", "#### Dataset Acknowledgements:\n", "This case was prepared by Greg Mills (MBA ’07) under the supervision of Phillip E. Pfeifer, Alumni Research Professor of Business Administration. Copyright (c) 2007 by the University of Virginia Darden School Foundation, Charlottesville, VA. All rights reserved.\n", "\n", "The original dataset and description can be __[found here.](https://github.com/DardenDSC/sarah-gets-a-diamond)__ " ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "BuL2VDPq1RRq" }, "source": [ "# 4.0 Getting the Data" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "OljCicBG1RRu" }, "source": [ "You can download the data from the original source __[found here](https://github.com/DardenDSC/sarah-gets-a-diamond)__ and load it using the pandas read_csv function or you can use PyCaret's data respository to load the data using the get_data function (This will require internet connection)." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "colab": {}, "colab_type": "code", "id": "p84oCDfz1RRz", "outputId": "41972d81-7bf0-4023-9a62-8a75fb059d4b" }, "outputs": [ { "data": { "text/html": [ "\n", "