User Guide - Final (#1322)

* User Guide - Final * Query Profiler notebook * Create user_guide_performance_qprof_get_qplan_tree.PNG * vdf * big bug * multiple corrections * correcting the ML bug - wrong types * read_csv correction * fix feature engineering notebook * correction --------- Co-authored-by: Umar Farooq Ghumman <umarfarooq.ghumman@vertica.com>
vertica · Oct 24, 2024 · 6326d27 · 6326d27
1 parent 7d38dc8
commit 6326d27
Show file tree

Hide file tree

Showing 42 changed files with 1,240 additions and 224 deletions.
diff --git a/...website/user_guides/performance/user_guide_performance_qprof_get_qplan_tree.PNG b/...website/user_guides/performance/user_guide_performance_qprof_get_qplan_tree.PNG
diff --git a/docs/source/contribution_guidelines_code_auto_doc_example.rst b/docs/source/contribution_guidelines_code_auto_doc_example.rst
@@ -315,13 +315,13 @@ And to reference a module named vDataFrame:
 
     .. seealso:: 
 
-        :py:mod:`vDataFrame`
+        :py:mod:`~verticapy.vDataFrame`
 
 **Output:**
 
 .. seealso:: 
 
-   :py:mod:`vDataFrame`
+   :py:mod:`~verticapy.vDataFrame`
 
 Now you can go through the below examples to understand the usage in detail. From the examples you will note a few things:
 

diff --git a/docs/source/examples_business_base_station.rst b/docs/source/examples_business_base_station.rst
@@ -787,7 +787,7 @@ The :py:func:`~verticapy.machine_learning.model_selection.elbow` curve seems to
 Predicting Base Station Workload
 +++++++++++++++++++++++++++++++++
 
-With the predictive power of AutoML, we can predict the workload of the base stations. :py:func:`~verticapy.machine_learning.vertica.automl.AutoML` is a powerful technique that tests multiple models to maximize the input score.
+With the predictive power of AutoML, we can predict the workload of the base stations. :py:mod:`~verticapy.machine_learning.vertica.automl.AutoML` is a powerful technique that tests multiple models to maximize the input score.
 
 The features used to train our model will be longitude, latitude, total number of distinct users, average duration of the connections, total duration of connections, total number of connections, the cluster they belong to, total number of base stations in the cluster, and the workload of the clusters.
 

diff --git a/docs/source/examples_business_battery.rst b/docs/source/examples_business_battery.rst
@@ -587,7 +587,7 @@ We'll define new features that describe the minimum and maximum temperature duri
 Machine Learning
 -----------------
 
-:py:func:`~verticapy.machine_learning.vertica.AutoML` tests several models and returns input scores for each. We can use this to find the best model for our dataset.
+:py:mod:`~verticapy.machine_learning.vertica.AutoML` tests several models and returns input scores for each. We can use this to find the best model for our dataset.
 
 .. note:: We are only using the three algorithms, but you can change the `estimator` parameter to try all the 'native' algorithms: ``estimator = 'native' ``.
 

diff --git a/docs/source/examples_business_booking.rst b/docs/source/examples_business_booking.rst
@@ -234,7 +234,7 @@ We can see huge links between some of the variables ('mode_hotel_cluster_count'
 Machine Learning
 -----------------
 
-Let's create our :py:func:`~verticapy.machine_learning.vertica.LogisticRegression` model.
+Let's create our :py:mod:`~verticapy.machine_learning.vertica.LogisticRegression` model.
 
 .. ipython:: python
 
@@ -279,7 +279,7 @@ It looks like there are two main predictors: 'mode_hotel_cluster_count' and 'tri
 - look for a shorter trip duration.
 - not click as much (spend more time at the same web page).
 
-Let's add our prediction to the :py:mod:`vDataFrame`.
+Let's add our prediction to the :py:mod:`~verticapy.vDataFrame`.
 
 .. code-block:: python
 

diff --git a/docs/source/examples_business_churn.rst b/docs/source/examples_business_churn.rst
@@ -203,7 +203,7 @@ ________
 Machine Learning
 -----------------
 
-:py:func:`~verticapy.machine_learning.vertica.LogisticRegression` is a very powerful algorithm and we can use it to detect churns. Let's split our :py:mod:`vDataFrame` into training and testing set to evaluate our model.
+:py:mod:`~verticapy.machine_learning.vertica.LogisticRegression` is a very powerful algorithm and we can use it to detect churns. Let's split our :py:mod:`~verticapy.vDataFrame` into training and testing set to evaluate our model.
 
 .. ipython:: python
 

diff --git a/docs/source/examples_business_credit_card_fraud.rst b/docs/source/examples_business_credit_card_fraud.rst
@@ -330,7 +330,7 @@ Supervision
 
 Supervising would make this pretty easy since it would just be a binary classification problem. We can use different algorithms to optimize the prediction. Our dataset is unbalanced, so the AUC might be a good metric to evaluate the model. The PRC AUC would also be a relevant metric.
 
-:py:func:`~verticapy.machine_learning.vertica.LogisticRegression` works well with monotonic relationships. Since we have a lot of independent features that correlate with the response, it should be a good first model to use.
+:py:mod:`~verticapy.machine_learning.vertica.LogisticRegression` works well with monotonic relationships. Since we have a lot of independent features that correlate with the response, it should be a good first model to use.
 
 .. code-block:: python
 
@@ -401,7 +401,7 @@ Due to the complexity of the computations, anomalies are difficult to detect in
 
 - **Machine Learning:** We need to use easily-deployable algorithms to perform real-time fraud detection. Isolation forests and ``k-means`` can be easily deployed and they work well for detecting anomalies.
 - **Rules & Thresholds:** The z-score can be an efficient solution for detecting global outliers.
-- **Decomposition:** Robust :py:func:`~verticapy.machine_learning.vertica.PCA` is another technique for detecting outliers.
+- **Decomposition:** Robust :py:mod:`~verticapy.machine_learning.vertica.PCA` is another technique for detecting outliers.
 
 Before using these techniques, let's draw some scatter plots to get a better idea of what kind of anomalies we can expect.
 
@@ -642,7 +642,7 @@ We can catch outliers with a neighbors score. Again, the main problem with these
 Other Techniques
 +++++++++++++++++
 
-Other scalable techniques that can solve this problem are robust :py:func:`~verticapy.machine_learning.vertica.PCA` and isolation forest.
+Other scalable techniques that can solve this problem are robust :py:mod:`~verticapy.machine_learning.vertica.PCA` and isolation forest.
 
 Conclusion
 -----------

diff --git a/docs/source/examples_business_football.rst b/docs/source/examples_business_football.rst
@@ -979,7 +979,7 @@ To compute a ``k-means`` model, we need to find a value for 'k'. Let's draw an :
     model_kmeans.fit("football_clustering", predictors)
     model_kmeans.clusters_
 
-Let's add the prediction to the :py:mod:`vDataFrame`.
+Let's add the prediction to the :py:mod:`~verticapy.vDataFrame`.
 
 .. code-block:: python
 
@@ -1983,7 +1983,7 @@ Looking at the importance of each feature, it seems like direct confrontations a
 .. raw:: html
     :file: /project/data/VerticaPy/docs/figures/examples_football_features_importance.html
 
-Let's add the predictions to the :py:mod:`vDataFrame`.
+Let's add the predictions to the :py:mod:`~verticapy.vDataFrame`.
 
 Draws are pretty rare, so we'll only consider them if a tie was very likely to occur.
 

diff --git a/docs/source/examples_business_insurance.rst b/docs/source/examples_business_insurance.rst
@@ -38,7 +38,7 @@ You can skip the below cell if you already have an established connection.
     
     vp.connect("VerticaDSN")
 
-Let's create a new schema and assign the data to a :py:mod:`vDataFrame` object.
+Let's create a new schema and assign the data to a :py:mod:`~verticapy.vDataFrame` object.
 
 .. code-block:: ipython
 

diff --git a/docs/source/examples_business_movies.rst b/docs/source/examples_business_movies.rst
@@ -43,7 +43,7 @@ You can skip the below cell if you already have an established connection.
     
     vp.connect("VerticaDSN")
 
-Let's  create a new schema and assign the data to a :py:mod:`vDataFrame` object.
+Let's  create a new schema and assign the data to a :py:mod:`~verticapy.vDataFrame` object.
 
 .. code-block:: ipython
 
@@ -349,7 +349,7 @@ Let's join our notoriety metrics for actors and directors with the main dataset.
         ],
     )
 
-As we did many operation, it can be nice to save the :py:mod:`vDataFrame` as a table in the Vertica database.
+As we did many operation, it can be nice to save the :py:mod:`~verticapy.vDataFrame` as a table in the Vertica database.
 
 .. code-block:: python
 
@@ -754,7 +754,7 @@ Let's create a model to evaluate an unbiased score for each different movie.
 .. raw:: html
     :file: /project/data/VerticaPy/docs/figures/examples_movies_filmtv_complete_model_report.html
 
-The model is good. Let's add it in our :py:mod:`vDataFrame`.
+The model is good. Let's add it in our :py:mod:`~verticapy.vDataFrame`.
 
 .. code-block:: python
 
@@ -926,7 +926,7 @@ By looking at the elbow curve, we can choose 15 clusters. Let's create a ``k-mea
     model_kmeans.fit(filmtv_movies_complete, predictors)
     model_kmeans.clusters_
 
-Let's add the clusters in the :py:mod:`vDataFrame`.
+Let's add the clusters in the :py:mod:`~verticapy.vDataFrame`.
 
 
 .. code-block:: python

diff --git a/docs/source/examples_business_smart_meters.rst b/docs/source/examples_business_smart_meters.rst
@@ -44,7 +44,7 @@ You can skip the below cell if you already have an established connection.
     
     vp.connect("VerticaDSN")
 
-Create the :py:mod:`vDataFrame` of the datasets:
+Create the :py:mod:`~verticapy.vDataFrame` of the datasets:
 
 .. code-block:: python
 

diff --git a/docs/source/examples_business_spam.rst b/docs/source/examples_business_spam.rst
@@ -106,7 +106,7 @@ Let's compute some statistics using the length of the message.
 .. raw:: html
     :file: /project/data/VerticaPy/docs/figures/examples_spam_table_describe.html
 
-.. note:: Spam tends to be longer than a normal message. First, let's create a view with just spam. Then, we'll use the :py:func:`~verticapy.machine_learning.vertica.CountVectorizer` to create a dictionary and identify keywords.
+.. note:: Spam tends to be longer than a normal message. First, let's create a view with just spam. Then, we'll use the :py:mod:`~verticapy.machine_learning.vertica.CountVectorizer` to create a dictionary and identify keywords.
 
 .. code-block:: python
 
@@ -138,7 +138,7 @@ Let's compute some statistics using the length of the message.
 .. raw:: html
     :file: /project/data/VerticaPy/docs/figures/examples_spam_table_clean_2.html
 
-Let's add the most occurent words in our :py:mod:`vDataFrame` and compute the correlation vector.
+Let's add the most occurent words in our :py:mod:`~verticapy.vDataFrame` and compute the correlation vector.
 
 .. code-block:: python
 

diff --git a/docs/source/examples_business_spotify.rst b/docs/source/examples_business_spotify.rst
@@ -88,7 +88,7 @@ Create a new schema, "spotify".
 Data Loading
 -------------
 
-Load the datasets into the :py:mod:`vDataFrame` with :py:func:`~verticapy.read_csv` and then view them with :py:func:`~verticapy.vDataFrame.head`.
+Load the datasets into the :py:mod:`~verticapy.vDataFrame` with :py:func:`~verticapy.read_csv` and then view them with :py:func:`~verticapy.vDataFrame.head`.
 
 .. code-block::
 
@@ -521,14 +521,14 @@ Define a list of predictors and the response, and then save the normalized versi
 Machine Learning
 -----------------
 
-We can use :py:func:`~verticapy.machine_learning.vertica.automl.AutoML` to easily get a well-performing model.
+We can use :py:mod:`~verticapy.machine_learning.vertica.automl.AutoML` to easily get a well-performing model.
 
 .. ipython:: python
 
     # define a random seed so models tested by AutoML produce consistent results
     vp.set_option("random_state", 2)
 
-:py:func:`~verticapy.machine_learning.vertica.automl.AutoML` automatically tests several machine learning models and picks the best performing one.
+:py:mod:`~verticapy.machine_learning.vertica.automl.AutoML` automatically tests several machine learning models and picks the best performing one.
 
 .. ipython:: python
     :okwarning:
@@ -569,7 +569,7 @@ Train the model.
 .. raw:: html
     :file: /project/data/VerticaPy/docs/figures/examples_spotify_automl_plot.html
 
-Extract the best model according to :py:func:`~verticapy.machine_learning.vertica.automl.AutoML`. From here, we can look at the model type and its hyperparameters.
+Extract the best model according to :py:mod:`~verticapy.machine_learning.vertica.automl.AutoML`. From here, we can look at the model type and its hyperparameters.
 
 .. ipython:: python
 
@@ -581,7 +581,7 @@ Extract the best model according to :py:func:`~verticapy.machine_learning.vertic
     print(bm_type)
     print(hyperparams)
 
-Thanks to :py:func:`~verticapy.machine_learning.vertica.automl.AutoML`, we know best model type and its hyperparameters. Let's create a new model with this information in mind.
+Thanks to :py:mod:`~verticapy.machine_learning.vertica.automl.AutoML`, we know best model type and its hyperparameters. Let's create a new model with this information in mind.
 
 .. code-block:: 
 
@@ -915,4 +915,4 @@ Let's see how our model groups these artists together:
 Conclusion
 -----------
 
-We were able to predict the popularity Polish songs with a :py:func:`~verticapy.machine_learning.vertica.RandomForestRegressor` model suggested by :py:func:`~verticapy.machine_learning.vertica.automl.AutoML`. We then created a ``k-means`` model to group artists into "genres" (clusters) based on the feature-commonalities in their tracks.
+We were able to predict the popularity Polish songs with a :py:mod:`~verticapy.machine_learning.vertica.RandomForestRegressor` model suggested by :py:mod:`~verticapy.machine_learning.vertica.automl.AutoML`. We then created a ``k-means`` model to group artists into "genres" (clusters) based on the feature-commonalities in their tracks.
diff --git a/docs/source/examples_learn_commodities.rst b/docs/source/examples_learn_commodities.rst
@@ -320,12 +320,12 @@ Moving on to the correlation matrix, we can see many events that changed drastic
 .. raw:: html
     :file: /project/data/VerticaPy/docs/figures/examples_commodities_table_corr_2.html
 
-We can see strong correlations between most of the variables. A vector autoregression (:py:func:`~verticapy.machine_learning.vertica.VAR`) model seems ideal.
+We can see strong correlations between most of the variables. A vector autoregression (:py:mod:`~verticapy.machine_learning.vertica.VAR`) model seems ideal.
 
 Machine Learning
 -----------------
 
-Let's create the :py:func:`~verticapy.machine_learning.vertica.VAR` model to predict the value of various commodities.
+Let's create the :py:mod:`~verticapy.machine_learning.vertica.VAR` model to predict the value of various commodities.
 
 .. code-block:: python
 
@@ -446,7 +446,7 @@ Dol_Eur:
 .. raw:: html
     :file: /project/data/VerticaPy/docs/figures/examples_commodities_table_pred_plot_4.html
 
-The model performs well but may be somewhat unstable. To improve it, we could apply data preparation techniques, such as seasonal decomposition, before building the :py:func:`~verticapy.machine_learning.vertica.VAR` model.
+The model performs well but may be somewhat unstable. To improve it, we could apply data preparation techniques, such as seasonal decomposition, before building the :py:mod:`~verticapy.machine_learning.vertica.VAR` model.
 
 Conclusion
 -----------

diff --git a/docs/source/examples_learn_iris.rst b/docs/source/examples_learn_iris.rst
@@ -170,7 +170,7 @@ Our strategy is simple: we'll use two Linear Support Vector Classification (SVC)
 Machine Learning
 -----------------
 
-Let's build the first :py:func:`~verticapy.machine_learning.vertica.LinearSVC` to predict if a flower is an Iris setosa.
+Let's build the first :py:mod:`~verticapy.machine_learning.vertica.LinearSVC` to predict if a flower is an Iris setosa.
 
 .. code-block:: python
 
@@ -221,7 +221,7 @@ Let's plot the model to see the perfect separation.
 .. raw:: html
     :file: /project/data/VerticaPy/docs/figures/examples_model_plot.html
 
-We can add this probability to the :py:mod:`vDataFrame`.
+We can add this probability to the :py:mod:`~verticapy.vDataFrame`.
 
 .. code-block:: python
 
@@ -275,7 +275,7 @@ Let's create a model to classify the Iris virginica.
 .. raw:: html
     :file: /project/data/VerticaPy/docs/figures/examples_iris_table_ml_cv_2.html
 
-We have another excellent model. Let's add it to the :py:mod:`vDataFrame`.
+We have another excellent model. Let's add it to the :py:mod:`~verticapy.vDataFrame`.
 
 .. code-block:: python
 
@@ -294,7 +294,7 @@ We have another excellent model. Let's add it to the :py:mod:`vDataFrame`.
 .. raw:: html
     :file: /project/data/VerticaPy/docs/figures/examples_model_predict_proba_2.html
 
-Let's evaluate our final model (the combination of two :py:func:`~verticapy.machine_learning.vertica.LinearSVC`).
+Let's evaluate our final model (the combination of two :py:mod:`~verticapy.machine_learning.vertica.LinearSVC`).
 
 .. code-block:: python
 

diff --git a/docs/source/examples_learn_pokemon.rst b/docs/source/examples_learn_pokemon.rst
@@ -250,7 +250,7 @@ In terms of missing values, our only concern is the Pokemon's second type (Type_
 .. raw:: html
     :file: /project/data/VerticaPy/docs/figures/examples_pokemon_table_clean_2.html
 
-Let's use the current_relation method to see how our data preparation so far on the :py:mod:`vDataFrame` generates SQL code.
+Let's use the current_relation method to see how our data preparation so far on the :py:mod:`~verticapy.vDataFrame` generates SQL code.
 
 .. ipython:: python
 

diff --git a/docs/source/examples_learn_titanic.rst b/docs/source/examples_learn_titanic.rst
@@ -217,7 +217,7 @@ The "sibsp" column represents the number of siblings for each passenger, while t
 
     titanic["family_size"] = titanic["parch"] + titanic["sibsp"] + 1
 
-Let's move on to outliers. We have several tools for locating outliers (:py:func:`~verticapy.machine_learning.vertica.LocalOutlierFactor`, :py:func:`~verticapy.machine_learning.vertica.DBSCAN`, ``k-means``...), but we'll just use winsorization in this example. Again, "fare" has many outliers, so we'll start there.
+Let's move on to outliers. We have several tools for locating outliers (:py:mod:`~verticapy.machine_learning.vertica.LocalOutlierFactor`, :py:mod:`~verticapy.machine_learning.vertica.DBSCAN`, ``k-means``...), but we'll just use winsorization in this example. Again, "fare" has many outliers, so we'll start there.
 
 .. code-block:: python
 
@@ -302,7 +302,7 @@ Survival correlates strongly with whether or not a passenger has a lifeboat (the
 - Passengers with a lifeboat
 - Passengers without a lifeboat
 
-Before we move on: we did a lot of work to clean up this data, but we haven't saved anything to our Vertica database! Let's look at the modifications we've made to the :py:mod:`vDataFrame`.
+Before we move on: we did a lot of work to clean up this data, but we haven't saved anything to our Vertica database! Let's look at the modifications we've made to the :py:mod:`~verticapy.vDataFrame`.
 
 .. ipython:: python
 
@@ -322,7 +322,7 @@ VerticaPy dynamically generates SQL code whenever you make modifications to your
     vp.set_option("sql_on", False)
     print(titanic.info())
 
-Let's move on to modeling our data. Save the :py:mod:`vDataFrame` to your Vertica database.
+Let's move on to modeling our data. Save the :py:mod:`~verticapy.vDataFrame` to your Vertica database.
 
 .. ipython:: python
     :okwarning:

diff --git a/docs/source/examples_understand_africa_education.rst b/docs/source/examples_understand_africa_education.rst
@@ -260,7 +260,7 @@ Eight seems to be a suitable number of clusters. Let's compute a ``k-means`` mod
     model = KMeans(n_cluster = 8)
     model.fit(africa, X = ["lon", "lat"])
 
-We can add the prediction to the :py:mod:`vDataFrame` and draw the scatter map.
+We can add the prediction to the :py:mod:`~verticapy.vDataFrame` and draw the scatter map.
 
 .. code-block:: python
 
@@ -500,7 +500,7 @@ Let's look at the feature importance for each model.
 
 Feature importance between between math score and the reading score are almost identical.
 
-We can add these predictions to the main :py:mod:`vDataFrame`.
+We can add these predictions to the main :py:mod:`~verticapy.vDataFrame`.
 
 .. code-block:: python
 

diff --git a/docs/source/examples_understand_covid19.rst b/docs/source/examples_understand_covid19.rst
@@ -283,14 +283,14 @@ Because of the upward monotonic trend, we can also look at the correlation betwe
 
     covid19["elapsed_days"] = covid19["date"] - fun.min(covid19["date"])._over(by = [covid19["state"]])
 
-We can generate the SQL code of the :py:mod:`vDataFrame` 
-to see what happens behind the scenes when we modify our data from within the :py:mod:`vDataFrame`.
+We can generate the SQL code of the :py:mod:`~verticapy.vDataFrame` 
+to see what happens behind the scenes when we modify our data from within the :py:mod:`~verticapy.vDataFrame`.
 
 .. ipython:: python
 
     print(covid19.current_relation())
 
-The :py:mod:`vDataFrame` memorizes all of our operations on the data to dynamically generate the correct SQL statement and passes computation and aggregation to Vertica.
+The :py:mod:`~verticapy.vDataFrame` memorizes all of our operations on the data to dynamically generate the correct SQL statement and passes computation and aggregation to Vertica.
 
 Let's see the correlation between the number of deaths and the other variables.
 
@@ -307,7 +307,7 @@ Let's see the correlation between the number of deaths and the other variables.
 .. raw:: html
     :file: /project/data/VerticaPy/docs/figures/examples_covid19_table_plot_corr_5.html
 
-We can see clearly a high correlation for some variables. We can use them to compute a ``SARIMAX`` model, but we'll stick to a :py:func:`~verticapy.machine_learning.vertica.VAR` model for this study.
+We can see clearly a high correlation for some variables. We can use them to compute a ``SARIMAX`` model, but we'll stick to a :py:mod:`~verticapy.machine_learning.vertica.VAR` model for this study.
 
 Let's compute the total number of deaths and cases to create our VAR model.
 
@@ -335,7 +335,7 @@ Let's compute the total number of deaths and cases to create our VAR model.
 Machine Learning
 -----------------
 
-Let's create a :py:func:`~verticapy.machine_learning.vertica.VAR` model to predict the number of COVID-19 deaths and cases in the USA.
+Let's create a :py:mod:`~verticapy.machine_learning.vertica.VAR` model to predict the number of COVID-19 deaths and cases in the USA.
 
 .. code-block:: python