Skip to content

Commit

Permalink
User Guide - Final (#1322)
Browse files Browse the repository at this point in the history
* User Guide - Final

* Query Profiler notebook

* Create user_guide_performance_qprof_get_qplan_tree.PNG

* vdf

* big bug

* multiple corrections

* correcting the ML bug - wrong types

* read_csv correction

* fix feature engineering notebook

* correction

---------

Co-authored-by: Umar Farooq Ghumman <umarfarooq.ghumman@vertica.com>
  • Loading branch information
oualib and mail4umar authored Oct 24, 2024
1 parent 7d38dc8 commit 6326d27
Show file tree
Hide file tree
Showing 42 changed files with 1,240 additions and 224 deletions.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 2 additions & 2 deletions docs/source/contribution_guidelines_code_auto_doc_example.rst
Original file line number Diff line number Diff line change
Expand Up @@ -315,13 +315,13 @@ And to reference a module named vDataFrame:
.. seealso::
:py:mod:`vDataFrame`
:py:mod:`~verticapy.vDataFrame`
**Output:**

.. seealso::

:py:mod:`vDataFrame`
:py:mod:`~verticapy.vDataFrame`

Now you can go through the below examples to understand the usage in detail. From the examples you will note a few things:

Expand Down
2 changes: 1 addition & 1 deletion docs/source/examples_business_base_station.rst
Original file line number Diff line number Diff line change
Expand Up @@ -787,7 +787,7 @@ The :py:func:`~verticapy.machine_learning.model_selection.elbow` curve seems to
Predicting Base Station Workload
+++++++++++++++++++++++++++++++++

With the predictive power of AutoML, we can predict the workload of the base stations. :py:func:`~verticapy.machine_learning.vertica.automl.AutoML` is a powerful technique that tests multiple models to maximize the input score.
With the predictive power of AutoML, we can predict the workload of the base stations. :py:mod:`~verticapy.machine_learning.vertica.automl.AutoML` is a powerful technique that tests multiple models to maximize the input score.

The features used to train our model will be longitude, latitude, total number of distinct users, average duration of the connections, total duration of connections, total number of connections, the cluster they belong to, total number of base stations in the cluster, and the workload of the clusters.

Expand Down
2 changes: 1 addition & 1 deletion docs/source/examples_business_battery.rst
Original file line number Diff line number Diff line change
Expand Up @@ -587,7 +587,7 @@ We'll define new features that describe the minimum and maximum temperature duri
Machine Learning
-----------------

:py:func:`~verticapy.machine_learning.vertica.AutoML` tests several models and returns input scores for each. We can use this to find the best model for our dataset.
:py:mod:`~verticapy.machine_learning.vertica.AutoML` tests several models and returns input scores for each. We can use this to find the best model for our dataset.

.. note:: We are only using the three algorithms, but you can change the `estimator` parameter to try all the 'native' algorithms: ``estimator = 'native' ``.

Expand Down
4 changes: 2 additions & 2 deletions docs/source/examples_business_booking.rst
Original file line number Diff line number Diff line change
Expand Up @@ -234,7 +234,7 @@ We can see huge links between some of the variables ('mode_hotel_cluster_count'
Machine Learning
-----------------

Let's create our :py:func:`~verticapy.machine_learning.vertica.LogisticRegression` model.
Let's create our :py:mod:`~verticapy.machine_learning.vertica.LogisticRegression` model.

.. ipython:: python
Expand Down Expand Up @@ -279,7 +279,7 @@ It looks like there are two main predictors: 'mode_hotel_cluster_count' and 'tri
- look for a shorter trip duration.
- not click as much (spend more time at the same web page).

Let's add our prediction to the :py:mod:`vDataFrame`.
Let's add our prediction to the :py:mod:`~verticapy.vDataFrame`.

.. code-block:: python
Expand Down
2 changes: 1 addition & 1 deletion docs/source/examples_business_churn.rst
Original file line number Diff line number Diff line change
Expand Up @@ -203,7 +203,7 @@ ________
Machine Learning
-----------------

:py:func:`~verticapy.machine_learning.vertica.LogisticRegression` is a very powerful algorithm and we can use it to detect churns. Let's split our :py:mod:`vDataFrame` into training and testing set to evaluate our model.
:py:mod:`~verticapy.machine_learning.vertica.LogisticRegression` is a very powerful algorithm and we can use it to detect churns. Let's split our :py:mod:`~verticapy.vDataFrame` into training and testing set to evaluate our model.

.. ipython:: python
Expand Down
6 changes: 3 additions & 3 deletions docs/source/examples_business_credit_card_fraud.rst
Original file line number Diff line number Diff line change
Expand Up @@ -330,7 +330,7 @@ Supervision

Supervising would make this pretty easy since it would just be a binary classification problem. We can use different algorithms to optimize the prediction. Our dataset is unbalanced, so the AUC might be a good metric to evaluate the model. The PRC AUC would also be a relevant metric.

:py:func:`~verticapy.machine_learning.vertica.LogisticRegression` works well with monotonic relationships. Since we have a lot of independent features that correlate with the response, it should be a good first model to use.
:py:mod:`~verticapy.machine_learning.vertica.LogisticRegression` works well with monotonic relationships. Since we have a lot of independent features that correlate with the response, it should be a good first model to use.

.. code-block:: python
Expand Down Expand Up @@ -401,7 +401,7 @@ Due to the complexity of the computations, anomalies are difficult to detect in

- **Machine Learning:** We need to use easily-deployable algorithms to perform real-time fraud detection. Isolation forests and ``k-means`` can be easily deployed and they work well for detecting anomalies.
- **Rules & Thresholds:** The z-score can be an efficient solution for detecting global outliers.
- **Decomposition:** Robust :py:func:`~verticapy.machine_learning.vertica.PCA` is another technique for detecting outliers.
- **Decomposition:** Robust :py:mod:`~verticapy.machine_learning.vertica.PCA` is another technique for detecting outliers.

Before using these techniques, let's draw some scatter plots to get a better idea of what kind of anomalies we can expect.

Expand Down Expand Up @@ -642,7 +642,7 @@ We can catch outliers with a neighbors score. Again, the main problem with these
Other Techniques
+++++++++++++++++

Other scalable techniques that can solve this problem are robust :py:func:`~verticapy.machine_learning.vertica.PCA` and isolation forest.
Other scalable techniques that can solve this problem are robust :py:mod:`~verticapy.machine_learning.vertica.PCA` and isolation forest.

Conclusion
-----------
Expand Down
4 changes: 2 additions & 2 deletions docs/source/examples_business_football.rst
Original file line number Diff line number Diff line change
Expand Up @@ -979,7 +979,7 @@ To compute a ``k-means`` model, we need to find a value for 'k'. Let's draw an :
model_kmeans.fit("football_clustering", predictors)
model_kmeans.clusters_
Let's add the prediction to the :py:mod:`vDataFrame`.
Let's add the prediction to the :py:mod:`~verticapy.vDataFrame`.

.. code-block:: python
Expand Down Expand Up @@ -1983,7 +1983,7 @@ Looking at the importance of each feature, it seems like direct confrontations a
.. raw:: html
:file: /project/data/VerticaPy/docs/figures/examples_football_features_importance.html

Let's add the predictions to the :py:mod:`vDataFrame`.
Let's add the predictions to the :py:mod:`~verticapy.vDataFrame`.

Draws are pretty rare, so we'll only consider them if a tie was very likely to occur.

Expand Down
2 changes: 1 addition & 1 deletion docs/source/examples_business_insurance.rst
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ You can skip the below cell if you already have an established connection.
vp.connect("VerticaDSN")
Let's create a new schema and assign the data to a :py:mod:`vDataFrame` object.
Let's create a new schema and assign the data to a :py:mod:`~verticapy.vDataFrame` object.

.. code-block:: ipython
Expand Down
8 changes: 4 additions & 4 deletions docs/source/examples_business_movies.rst
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ You can skip the below cell if you already have an established connection.
vp.connect("VerticaDSN")
Let's create a new schema and assign the data to a :py:mod:`vDataFrame` object.
Let's create a new schema and assign the data to a :py:mod:`~verticapy.vDataFrame` object.

.. code-block:: ipython
Expand Down Expand Up @@ -349,7 +349,7 @@ Let's join our notoriety metrics for actors and directors with the main dataset.
],
)
As we did many operation, it can be nice to save the :py:mod:`vDataFrame` as a table in the Vertica database.
As we did many operation, it can be nice to save the :py:mod:`~verticapy.vDataFrame` as a table in the Vertica database.

.. code-block:: python
Expand Down Expand Up @@ -754,7 +754,7 @@ Let's create a model to evaluate an unbiased score for each different movie.
.. raw:: html
:file: /project/data/VerticaPy/docs/figures/examples_movies_filmtv_complete_model_report.html

The model is good. Let's add it in our :py:mod:`vDataFrame`.
The model is good. Let's add it in our :py:mod:`~verticapy.vDataFrame`.

.. code-block:: python
Expand Down Expand Up @@ -926,7 +926,7 @@ By looking at the elbow curve, we can choose 15 clusters. Let's create a ``k-mea
model_kmeans.fit(filmtv_movies_complete, predictors)
model_kmeans.clusters_
Let's add the clusters in the :py:mod:`vDataFrame`.
Let's add the clusters in the :py:mod:`~verticapy.vDataFrame`.


.. code-block:: python
Expand Down
2 changes: 1 addition & 1 deletion docs/source/examples_business_smart_meters.rst
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ You can skip the below cell if you already have an established connection.
vp.connect("VerticaDSN")
Create the :py:mod:`vDataFrame` of the datasets:
Create the :py:mod:`~verticapy.vDataFrame` of the datasets:

.. code-block:: python
Expand Down
4 changes: 2 additions & 2 deletions docs/source/examples_business_spam.rst
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,7 @@ Let's compute some statistics using the length of the message.
.. raw:: html
:file: /project/data/VerticaPy/docs/figures/examples_spam_table_describe.html

.. note:: Spam tends to be longer than a normal message. First, let's create a view with just spam. Then, we'll use the :py:func:`~verticapy.machine_learning.vertica.CountVectorizer` to create a dictionary and identify keywords.
.. note:: Spam tends to be longer than a normal message. First, let's create a view with just spam. Then, we'll use the :py:mod:`~verticapy.machine_learning.vertica.CountVectorizer` to create a dictionary and identify keywords.

.. code-block:: python
Expand Down Expand Up @@ -138,7 +138,7 @@ Let's compute some statistics using the length of the message.
.. raw:: html
:file: /project/data/VerticaPy/docs/figures/examples_spam_table_clean_2.html

Let's add the most occurent words in our :py:mod:`vDataFrame` and compute the correlation vector.
Let's add the most occurent words in our :py:mod:`~verticapy.vDataFrame` and compute the correlation vector.

.. code-block:: python
Expand Down
12 changes: 6 additions & 6 deletions docs/source/examples_business_spotify.rst
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ Create a new schema, "spotify".
Data Loading
-------------

Load the datasets into the :py:mod:`vDataFrame` with :py:func:`~verticapy.read_csv` and then view them with :py:func:`~verticapy.vDataFrame.head`.
Load the datasets into the :py:mod:`~verticapy.vDataFrame` with :py:func:`~verticapy.read_csv` and then view them with :py:func:`~verticapy.vDataFrame.head`.

.. code-block::
Expand Down Expand Up @@ -521,14 +521,14 @@ Define a list of predictors and the response, and then save the normalized versi
Machine Learning
-----------------

We can use :py:func:`~verticapy.machine_learning.vertica.automl.AutoML` to easily get a well-performing model.
We can use :py:mod:`~verticapy.machine_learning.vertica.automl.AutoML` to easily get a well-performing model.

.. ipython:: python
# define a random seed so models tested by AutoML produce consistent results
vp.set_option("random_state", 2)
:py:func:`~verticapy.machine_learning.vertica.automl.AutoML` automatically tests several machine learning models and picks the best performing one.
:py:mod:`~verticapy.machine_learning.vertica.automl.AutoML` automatically tests several machine learning models and picks the best performing one.

.. ipython:: python
:okwarning:
Expand Down Expand Up @@ -569,7 +569,7 @@ Train the model.
.. raw:: html
:file: /project/data/VerticaPy/docs/figures/examples_spotify_automl_plot.html

Extract the best model according to :py:func:`~verticapy.machine_learning.vertica.automl.AutoML`. From here, we can look at the model type and its hyperparameters.
Extract the best model according to :py:mod:`~verticapy.machine_learning.vertica.automl.AutoML`. From here, we can look at the model type and its hyperparameters.

.. ipython:: python
Expand All @@ -581,7 +581,7 @@ Extract the best model according to :py:func:`~verticapy.machine_learning.vertic
print(bm_type)
print(hyperparams)
Thanks to :py:func:`~verticapy.machine_learning.vertica.automl.AutoML`, we know best model type and its hyperparameters. Let's create a new model with this information in mind.
Thanks to :py:mod:`~verticapy.machine_learning.vertica.automl.AutoML`, we know best model type and its hyperparameters. Let's create a new model with this information in mind.

.. code-block::
Expand Down Expand Up @@ -915,4 +915,4 @@ Let's see how our model groups these artists together:
Conclusion
-----------

We were able to predict the popularity Polish songs with a :py:func:`~verticapy.machine_learning.vertica.RandomForestRegressor` model suggested by :py:func:`~verticapy.machine_learning.vertica.automl.AutoML`. We then created a ``k-means`` model to group artists into "genres" (clusters) based on the feature-commonalities in their tracks.
We were able to predict the popularity Polish songs with a :py:mod:`~verticapy.machine_learning.vertica.RandomForestRegressor` model suggested by :py:mod:`~verticapy.machine_learning.vertica.automl.AutoML`. We then created a ``k-means`` model to group artists into "genres" (clusters) based on the feature-commonalities in their tracks.
6 changes: 3 additions & 3 deletions docs/source/examples_learn_commodities.rst
Original file line number Diff line number Diff line change
Expand Up @@ -320,12 +320,12 @@ Moving on to the correlation matrix, we can see many events that changed drastic
.. raw:: html
:file: /project/data/VerticaPy/docs/figures/examples_commodities_table_corr_2.html

We can see strong correlations between most of the variables. A vector autoregression (:py:func:`~verticapy.machine_learning.vertica.VAR`) model seems ideal.
We can see strong correlations between most of the variables. A vector autoregression (:py:mod:`~verticapy.machine_learning.vertica.VAR`) model seems ideal.

Machine Learning
-----------------

Let's create the :py:func:`~verticapy.machine_learning.vertica.VAR` model to predict the value of various commodities.
Let's create the :py:mod:`~verticapy.machine_learning.vertica.VAR` model to predict the value of various commodities.

.. code-block:: python
Expand Down Expand Up @@ -446,7 +446,7 @@ Dol_Eur:
.. raw:: html
:file: /project/data/VerticaPy/docs/figures/examples_commodities_table_pred_plot_4.html

The model performs well but may be somewhat unstable. To improve it, we could apply data preparation techniques, such as seasonal decomposition, before building the :py:func:`~verticapy.machine_learning.vertica.VAR` model.
The model performs well but may be somewhat unstable. To improve it, we could apply data preparation techniques, such as seasonal decomposition, before building the :py:mod:`~verticapy.machine_learning.vertica.VAR` model.

Conclusion
-----------
Expand Down
8 changes: 4 additions & 4 deletions docs/source/examples_learn_iris.rst
Original file line number Diff line number Diff line change
Expand Up @@ -170,7 +170,7 @@ Our strategy is simple: we'll use two Linear Support Vector Classification (SVC)
Machine Learning
-----------------

Let's build the first :py:func:`~verticapy.machine_learning.vertica.LinearSVC` to predict if a flower is an Iris setosa.
Let's build the first :py:mod:`~verticapy.machine_learning.vertica.LinearSVC` to predict if a flower is an Iris setosa.

.. code-block:: python
Expand Down Expand Up @@ -221,7 +221,7 @@ Let's plot the model to see the perfect separation.
.. raw:: html
:file: /project/data/VerticaPy/docs/figures/examples_model_plot.html

We can add this probability to the :py:mod:`vDataFrame`.
We can add this probability to the :py:mod:`~verticapy.vDataFrame`.

.. code-block:: python
Expand Down Expand Up @@ -275,7 +275,7 @@ Let's create a model to classify the Iris virginica.
.. raw:: html
:file: /project/data/VerticaPy/docs/figures/examples_iris_table_ml_cv_2.html

We have another excellent model. Let's add it to the :py:mod:`vDataFrame`.
We have another excellent model. Let's add it to the :py:mod:`~verticapy.vDataFrame`.

.. code-block:: python
Expand All @@ -294,7 +294,7 @@ We have another excellent model. Let's add it to the :py:mod:`vDataFrame`.
.. raw:: html
:file: /project/data/VerticaPy/docs/figures/examples_model_predict_proba_2.html

Let's evaluate our final model (the combination of two :py:func:`~verticapy.machine_learning.vertica.LinearSVC`).
Let's evaluate our final model (the combination of two :py:mod:`~verticapy.machine_learning.vertica.LinearSVC`).

.. code-block:: python
Expand Down
2 changes: 1 addition & 1 deletion docs/source/examples_learn_pokemon.rst
Original file line number Diff line number Diff line change
Expand Up @@ -250,7 +250,7 @@ In terms of missing values, our only concern is the Pokemon's second type (Type_
.. raw:: html
:file: /project/data/VerticaPy/docs/figures/examples_pokemon_table_clean_2.html

Let's use the current_relation method to see how our data preparation so far on the :py:mod:`vDataFrame` generates SQL code.
Let's use the current_relation method to see how our data preparation so far on the :py:mod:`~verticapy.vDataFrame` generates SQL code.

.. ipython:: python
Expand Down
6 changes: 3 additions & 3 deletions docs/source/examples_learn_titanic.rst
Original file line number Diff line number Diff line change
Expand Up @@ -217,7 +217,7 @@ The "sibsp" column represents the number of siblings for each passenger, while t
titanic["family_size"] = titanic["parch"] + titanic["sibsp"] + 1
Let's move on to outliers. We have several tools for locating outliers (:py:func:`~verticapy.machine_learning.vertica.LocalOutlierFactor`, :py:func:`~verticapy.machine_learning.vertica.DBSCAN`, ``k-means``...), but we'll just use winsorization in this example. Again, "fare" has many outliers, so we'll start there.
Let's move on to outliers. We have several tools for locating outliers (:py:mod:`~verticapy.machine_learning.vertica.LocalOutlierFactor`, :py:mod:`~verticapy.machine_learning.vertica.DBSCAN`, ``k-means``...), but we'll just use winsorization in this example. Again, "fare" has many outliers, so we'll start there.

.. code-block:: python
Expand Down Expand Up @@ -302,7 +302,7 @@ Survival correlates strongly with whether or not a passenger has a lifeboat (the
- Passengers with a lifeboat
- Passengers without a lifeboat

Before we move on: we did a lot of work to clean up this data, but we haven't saved anything to our Vertica database! Let's look at the modifications we've made to the :py:mod:`vDataFrame`.
Before we move on: we did a lot of work to clean up this data, but we haven't saved anything to our Vertica database! Let's look at the modifications we've made to the :py:mod:`~verticapy.vDataFrame`.

.. ipython:: python
Expand All @@ -322,7 +322,7 @@ VerticaPy dynamically generates SQL code whenever you make modifications to your
vp.set_option("sql_on", False)
print(titanic.info())
Let's move on to modeling our data. Save the :py:mod:`vDataFrame` to your Vertica database.
Let's move on to modeling our data. Save the :py:mod:`~verticapy.vDataFrame` to your Vertica database.

.. ipython:: python
:okwarning:
Expand Down
4 changes: 2 additions & 2 deletions docs/source/examples_understand_africa_education.rst
Original file line number Diff line number Diff line change
Expand Up @@ -260,7 +260,7 @@ Eight seems to be a suitable number of clusters. Let's compute a ``k-means`` mod
model = KMeans(n_cluster = 8)
model.fit(africa, X = ["lon", "lat"])
We can add the prediction to the :py:mod:`vDataFrame` and draw the scatter map.
We can add the prediction to the :py:mod:`~verticapy.vDataFrame` and draw the scatter map.

.. code-block:: python
Expand Down Expand Up @@ -500,7 +500,7 @@ Let's look at the feature importance for each model.

Feature importance between between math score and the reading score are almost identical.

We can add these predictions to the main :py:mod:`vDataFrame`.
We can add these predictions to the main :py:mod:`~verticapy.vDataFrame`.

.. code-block:: python
Expand Down
10 changes: 5 additions & 5 deletions docs/source/examples_understand_covid19.rst
Original file line number Diff line number Diff line change
Expand Up @@ -283,14 +283,14 @@ Because of the upward monotonic trend, we can also look at the correlation betwe
covid19["elapsed_days"] = covid19["date"] - fun.min(covid19["date"])._over(by = [covid19["state"]])
We can generate the SQL code of the :py:mod:`vDataFrame`
to see what happens behind the scenes when we modify our data from within the :py:mod:`vDataFrame`.
We can generate the SQL code of the :py:mod:`~verticapy.vDataFrame`
to see what happens behind the scenes when we modify our data from within the :py:mod:`~verticapy.vDataFrame`.

.. ipython:: python
print(covid19.current_relation())
The :py:mod:`vDataFrame` memorizes all of our operations on the data to dynamically generate the correct SQL statement and passes computation and aggregation to Vertica.
The :py:mod:`~verticapy.vDataFrame` memorizes all of our operations on the data to dynamically generate the correct SQL statement and passes computation and aggregation to Vertica.

Let's see the correlation between the number of deaths and the other variables.

Expand All @@ -307,7 +307,7 @@ Let's see the correlation between the number of deaths and the other variables.
.. raw:: html
:file: /project/data/VerticaPy/docs/figures/examples_covid19_table_plot_corr_5.html

We can see clearly a high correlation for some variables. We can use them to compute a ``SARIMAX`` model, but we'll stick to a :py:func:`~verticapy.machine_learning.vertica.VAR` model for this study.
We can see clearly a high correlation for some variables. We can use them to compute a ``SARIMAX`` model, but we'll stick to a :py:mod:`~verticapy.machine_learning.vertica.VAR` model for this study.

Let's compute the total number of deaths and cases to create our VAR model.

Expand Down Expand Up @@ -335,7 +335,7 @@ Let's compute the total number of deaths and cases to create our VAR model.
Machine Learning
-----------------

Let's create a :py:func:`~verticapy.machine_learning.vertica.VAR` model to predict the number of COVID-19 deaths and cases in the USA.
Let's create a :py:mod:`~verticapy.machine_learning.vertica.VAR` model to predict the number of COVID-19 deaths and cases in the USA.

.. code-block:: python
Expand Down
Loading

0 comments on commit 6326d27

Please sign in to comment.