Skip to content

Commit

Permalink
corrections 2
Browse files Browse the repository at this point in the history
  • Loading branch information
oualib committed Oct 23, 2024

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
1 parent 512900d commit d593c2a
Showing 42 changed files with 151 additions and 185 deletions.
3 changes: 0 additions & 3 deletions docs/source/examples.rst
Original file line number Diff line number Diff line change
@@ -1,12 +1,9 @@
.. _examples:


============
Examples
============



.. grid:: 1 1 2 2

.. grid-item::
4 changes: 2 additions & 2 deletions docs/source/examples_business_africa_education.rst
Original file line number Diff line number Diff line change
@@ -260,7 +260,7 @@ Eight seems to be a suitable number of clusters. Let's compute a ``k-means`` mod
model = KMeans(n_cluster = 8)
model.fit(africa, X = ["lon", "lat"])
We can add the prediction to the ``vDataFrame`` and draw the scatter map.
We can add the prediction to the :py:mod:`vDataFrame` and draw the scatter map.


.. code-block:: python
@@ -501,7 +501,7 @@ Let's look at the feature importance for each model.

Feature importance between between math score and the reading score are almost identical.

We can add these predictions to the main ``vDataFrame``.
We can add these predictions to the main :py:mod:`vDataFrame`.

.. code-block:: python
18 changes: 5 additions & 13 deletions docs/source/examples_business_battery.rst
Original file line number Diff line number Diff line change
@@ -20,11 +20,7 @@ Dataset
++++++++

In this example of **predictive maintenance**, we propose a data-driven method
to estimate the health of a battery using the
`Li-ion battery dataset <https://ti.arc.nasa.gov/tech/dash/groups/pcoe/prognostic-data-repository/>`_
released by NASA.


to estimate the health of a battery using the `Li-ion battery dataset <https://ti.arc.nasa.gov/tech/dash/groups/pcoe/prognostic-data-repository/>`_ released by NASA.

This dataset includes information on Li-ion batteries over several charge
and discharge cycles at room temperature. Charging was at a constant current
@@ -87,8 +83,7 @@ Let us now ingest the data.
Understanding the Data
-----------------------

Let's examine our data. Here, we use `vDataFrame.head()`
to retrieve the first five rows of the dataset.
Let's examine our data. Here, we use :py:func:`~verticapy.vDataFrame.head` to retrieve the first five rows of the dataset.

.. ipython:: python
:suppress:
@@ -103,7 +98,7 @@ to retrieve the first five rows of the dataset.
:file: /project/data/VerticaPy/docs/figures/examples_battery_table_head.html


Let's perform a few aggregations with `vDataFrame.describe()` to get a high-level overview of the dataset.
Let's perform a few aggregations with :py:func:`~verticapy.vDataFrame.describe` to get a high-level overview of the dataset.


.. code-block:: python
@@ -567,12 +562,9 @@ and the time needed to reach minimum voltage and maximum temperature.
Machine Learning
-----------------

AutoML tests several models and returns input scores for each. We can use this to find the best model for our dataset.

AutoML tests several models and returns input
scores for each. We can use this to find the best model for our dataset.

.. note:: We are only using the three algorithms, but you can change the `estiamtor` parameter to try all the 'native' algorithms.
``estiamtor = 'native' ``
.. note:: We are only using the three algorithms, but you can change the `estimator` parameter to try all the 'native' algorithms: ``estimator = 'native' ``.

.. code-block:: python
12 changes: 6 additions & 6 deletions docs/source/examples_business_booking.rst
Original file line number Diff line number Diff line change
@@ -77,7 +77,7 @@ Data Exploration and Preparation

Sessionization is the process of gathering clicks for a certain period of time. We usually consider that after 30 minutes of inactivity, the user session ends (``date_time - lag(date_time) > 30 minutes``). For these kinds of use cases, aggregating sessions with meaningful statistics is the key for making accurate predictions.

We start by using the ``sessionize`` method to create the variable 'session_id'. We can then use this variable to aggregate the data.
We start by using the :py:func:`~verticapy.vDataFrame.sessionize` method to create the variable 'session_id'. We can then use this variable to aggregate the data.

.. code-block:: python
@@ -234,7 +234,7 @@ We can see huge links between some of the variables ('mode_hotel_cluster_count'
Machine Learning
-----------------

Let's create our ``LogisticRegression`` model.
Let's create our :py:func:`~verticapy.machine_learning.vertica.LogisticRegression` model.

.. ipython:: python
@@ -279,7 +279,7 @@ It looks like there are two main predictors: 'mode_hotel_cluster_count' and 'tri
- look for a shorter trip duration.
- not click as much (spend more time at the same web page).

Let's add our prediction to the ``vDataFrame``.
Let's add our prediction to the :py:mod:`vDataFrame`.

.. code-block:: python
@@ -304,7 +304,7 @@ Let's add our prediction to the ``vDataFrame``.
.. raw:: html
:file: /project/data/VerticaPy/docs/figures/examples_expedia_predict_proba_1.html

While analyzing the following boxplot (prediction partitioned by 'is_booking'), we can notice that the ``cutoff`` is around 0.22 because most of the positive predictions have a probability between 0.23 and 0.5. Most of the negative predictions are between 0.05 and 0.2.
While analyzing the following boxplot (prediction partitioned by 'is_booking'), we can notice that the `cutoff` is around 0.22 because most of the positive predictions have a probability between 0.23 and 0.5. Most of the negative predictions are between 0.05 and 0.2.

.. code-block:: python
@@ -320,13 +320,13 @@ While analyzing the following boxplot (prediction partitioned by 'is_booking'),
.. raw:: html
:file: /project/data/VerticaPy/docs/figures/examples_expedia_predict_boxplot_1.html

Let's confirm our hypothesis by computing the best ``cutoff``.
Let's confirm our hypothesis by computing the best `cutoff`.

.. ipython:: python
model_logit.score(metric = "best_cutoff")
Let's look at the efficiency of our model with a cutoff of ``0.22``.
Let's look at the efficiency of our model with a cutoff of 0.22.

.. code-block:: python
2 changes: 1 addition & 1 deletion docs/source/examples_business_churn.rst
Original file line number Diff line number Diff line change
@@ -203,7 +203,7 @@ ________
Machine Learning
-----------------

``LogisticRegression`` is a very powerful algorithm and we can use it to detect churns. Let's split our ``vDataFrame`` into training and testing set to evaluate our model.
:py:func:`~verticapy.machine_learning.vertica.LogisticRegression` is a very powerful algorithm and we can use it to detect churns. Let's split our :py:mod:`vDataFrame` into training and testing set to evaluate our model.

.. ipython:: python
10 changes: 5 additions & 5 deletions docs/source/examples_business_credit_card_fraud.rst
Original file line number Diff line number Diff line change
@@ -328,7 +328,7 @@ We will split the dataset into a train (day 1) and a test (day 2).

Supervising would make this pretty easy since it would just be a binary classification problem. We can use different algorithms to optimize the prediction. Our dataset is unbalanced, so the AUC might be a good metric to evaluate the model. The PRC AUC would also be a relevant metric.

``LogisticRegression`` works well with monotonic relationships. Since we have a lot of independent features that correlate with the response, it should be a good first model to use.
:py:func:`~verticapy.machine_learning.vertica.LogisticRegression` works well with monotonic relationships. Since we have a lot of independent features that correlate with the response, it should be a good first model to use.

.. code-block:: python
@@ -398,7 +398,7 @@ Due to the complexity of the computations, anomalies are difficult to detect in

- **Machine Learning:** We need to use easily-deployable algorithms to perform real-time fraud detection. Isolation forests and ``k-means`` can be easily deployed and they work well for detecting anomalies.
- **Rules & Thresholds:** The z-score can be an efficient solution for detecting global outliers.
- **Decomposition:** Robust ``PCA`` is another technique for detecting outliers.
- **Decomposition:** Robust :py:func:`~verticapy.machine_learning.vertica.PCA` is another technique for detecting outliers.

Before using these techniques, let's draw some scatter plots to get a better idea of what kind of anomalies we can expect.

@@ -453,7 +453,7 @@ For the rest of this example, we'll investigate labels and how they can help us

We begin by examining ``k-means`` clustering, which partitions the data into k clusters.

We can use an elbow curve to find a suitable number of clusters. We can then add more clusters then the amount suggested by the ``elbow`` curve to create clusters mainly composed of anomalies. Clusters with relatively fewer elements can then be investigated by an expert to label the anomalies.
We can use an elbow curve to find a suitable number of clusters. We can then add more clusters then the amount suggested by the :py:func:`~verticapy.machine_learning.model_selection.elbow` curve to create clusters mainly composed of anomalies. Clusters with relatively fewer elements can then be investigated by an expert to label the anomalies.

From there, we perform the following procedure:

@@ -535,7 +535,7 @@ Notice that clusters with fewer elemenets tend to contain much more fraudulent e

**Outliers of the distribution**

Let's use the ``z-score`` to detect global outliers of the distribution.
Let's use the ``Z-score`` to detect global outliers of the distribution.

.. code-block:: python
@@ -635,7 +635,7 @@ We can catch outliers with a neighbors score. Again, the main problem with these

**Other Techniques**

Other scalable techniques that can solve this problem are robust ``PCA`` and isolation forest.
Other scalable techniques that can solve this problem are robust :py:func:`~verticapy.machine_learning.vertica.PCA` and isolation forest.

Conclusion
-----------
6 changes: 3 additions & 3 deletions docs/source/examples_business_football.rst
Original file line number Diff line number Diff line change
@@ -903,7 +903,7 @@ Let's export the result to our Vertica database.
Team Rankings with k-means
---------------------------

To compute a ``k-means`` model, we need to find a value for 'k'. Let's draw an ``elbow`` curve to find a suitable number of clusters.
To compute a ``k-means`` model, we need to find a value for 'k'. Let's draw an :py:func:`~verticapy.machine_learning.model_selection.elbow` curve to find a suitable number of clusters.

.. code-block:: python
@@ -975,7 +975,7 @@ To compute a ``k-means`` model, we need to find a value for 'k'. Let's draw an `
model_kmeans.fit("football_clustering", predictors)
model_kmeans.clusters_
Let's add the prediction to the ``vDataFrame``.
Let's add the prediction to the :py:mod:`vDataFrame`.

.. code-block:: python
@@ -1974,7 +1974,7 @@ Looking at the importance of each feature, it seems like direct confrontations a
.. raw:: html
:file: /project/data/VerticaPy/docs/figures/examples_football_features_importance.html

Let's add the predictions to the ``vDataFrame``.
Let's add the predictions to the :py:mod:`vDataFrame`.

Draws are pretty rare, so we'll only consider them if a tie was very likely to occur.

2 changes: 1 addition & 1 deletion docs/source/examples_business_insurance.rst
Original file line number Diff line number Diff line change
@@ -38,7 +38,7 @@ You can skip the below cell if you already have an established connection.
vp.connect("VerticaDSN")
Let's create a new schema and assign the data to a ``vDataFrame`` object.
Let's create a new schema and assign the data to a :py:mod:`vDataFrame` object.

.. code-block:: ipython
10 changes: 5 additions & 5 deletions docs/source/examples_business_movies.rst
Original file line number Diff line number Diff line change
@@ -43,7 +43,7 @@ You can skip the below cell if you already have an established connection.
vp.connect("VerticaDSN")
Let's create a new schema and assign the data to a ``vDataFrame`` object.
Let's create a new schema and assign the data to a :py:mod:`vDataFrame` object.

.. code-block:: ipython
@@ -349,7 +349,7 @@ Let's join our notoriety metrics for actors and directors with the main dataset.
],
)
As we did many operation, it can be nice to save the ``vDataFrame`` as a table in the Vertica database.
As we did many operation, it can be nice to save the :py:mod:`vDataFrame` as a table in the Vertica database.

.. code-block:: python
@@ -754,7 +754,7 @@ Let's create a model to evaluate an unbiased score for each different movie.
.. raw:: html
:file: /project/data/VerticaPy/docs/figures/examples_movies_filmtv_complete_model_report.html

The model is good. Let's add it in our ``vDataFrame``.
The model is good. Let's add it in our :py:mod:`vDataFrame`.

.. code-block:: python
@@ -871,7 +871,7 @@ Since ``k-means`` clustering is sensitive to unnormalized data, let's normalize
.. raw:: html
:file: /project/data/VerticaPy/docs/figures/examples_movies_filmtv_normalize_minmax.html

Let's compute the ``elbow`` curve to find a suitable number of clusters.
Let's compute the :py:func:`~verticapy.machine_learning.model_selection.elbow` curve to find a suitable number of clusters.

.. ipython:: python
@@ -926,7 +926,7 @@ By looking at the elbow curve, we can choose 15 clusters. Let's create a ``k-mea
model_kmeans.fit(filmtv_movies_complete, predictors)
model_kmeans.clusters_
Let's add the clusters in the ``vDataFrame``.
Let's add the clusters in the :py:mod:`vDataFrame`.


.. code-block:: python
4 changes: 2 additions & 2 deletions docs/source/examples_business_smart_meters.rst
Original file line number Diff line number Diff line change
@@ -44,7 +44,7 @@ You can skip the below cell if you already have an established connection.
vp.connect("VerticaDSN")
Create the ``vDataFrames`` of the datasets:
Create the :py:mod:`vDataFrame` of the datasets:

.. code-block:: python
@@ -217,7 +217,7 @@ The dataset 'sm_meters' is pretty important. In particular, the type of residenc
:width: 100%
:align: center

Based on the scatter plot, five seems like the optimal number of clusters. Let's verify this hypothesis using an ``elbow`` curve.
Based on the scatter plot, five seems like the optimal number of clusters. Let's verify this hypothesis using an :py:func:`~verticapy.machine_learning.model_selection.elbow` curve.

.. code-block:: python
4 changes: 2 additions & 2 deletions docs/source/examples_business_spam.rst
Original file line number Diff line number Diff line change
@@ -106,7 +106,7 @@ Let's compute some statistics using the length of the message.
.. raw:: html
:file: /project/data/VerticaPy/docs/figures/examples_spam_table_describe.html

**Notice:** spam tends to be longer than a normal message. First, let's create a view with just spam. Then, we'll use the ``CountVectorizer`` to create a dictionary and identify keywords.
**Notice:** spam tends to be longer than a normal message. First, let's create a view with just spam. Then, we'll use the :py:func:`~verticapy.machine_learning.vertica.CountVectorizer` to create a dictionary and identify keywords.

.. code-block:: python
@@ -138,7 +138,7 @@ Let's compute some statistics using the length of the message.
.. raw:: html
:file: /project/data/VerticaPy/docs/figures/examples_spam_table_clean_2.html

Let's add the most occurent words in our ``vDataFrame`` and compute the correlation vector.
Let's add the most occurent words in our :py:mod:`vDataFrame` and compute the correlation vector.

.. code-block:: python
14 changes: 7 additions & 7 deletions docs/source/examples_business_spotify.rst
Original file line number Diff line number Diff line change
@@ -88,7 +88,7 @@ Create a new schema, "spotify".
Data Loading
-------------

Load the datasets into the ``vDataFrame`` with ``read_csv()`` and then view them with ``display()``.
Load the datasets into the :py:mod:`vDataFrame` with :py:func:`~verticapy.read_csv` and then view them with :py:func:`~verticapy.vDataFrame.head`.

.. code-block::
@@ -521,14 +521,14 @@ Define a list of predictors and the response, and then save the normalized versi
Machine Learning
-----------------

We can use ``AutoML`` to easily get a well-performing model.
We can use :py:func:`~verticapy.machine_learning.vertica.automl.AutoML` to easily get a well-performing model.

.. ipython:: python
# define a random seed so models tested by AutoML produce consistent results
vp.set_option("random_state", 2)
``AutoML`` automatically tests several machine learning models and picks the best performing one.
:py:func:`~verticapy.machine_learning.vertica.automl.AutoML` automatically tests several machine learning models and picks the best performing one.

.. ipython:: python
:okwarning:
@@ -569,7 +569,7 @@ Train the model.
.. raw:: html
:file: /project/data/VerticaPy/docs/figures/examples_spotify_automl_plot.html

Extract the best model according to ``AutoML``. From here, we can look at the model type and its hyperparameters.
Extract the best model according to :py:func:`~verticapy.machine_learning.vertica.automl.AutoML`. From here, we can look at the model type and its hyperparameters.

.. ipython:: python
@@ -581,7 +581,7 @@ Extract the best model according to ``AutoML``. From here, we can look at the mo
print(bm_type)
print(hyperparams)
Thanks to ``AutoML``, we know best model type and its hyperparameters. Let's create a new model with this information in mind.
Thanks to :py:func:`~verticapy.machine_learning.vertica.automl.AutoML`, we know best model type and its hyperparameters. Let's create a new model with this information in mind.

.. code-block::
@@ -797,7 +797,7 @@ Let's start by taking the averages of these numerical features for each artist.
.. raw:: html
:file: /project/data/VerticaPy/docs/figures/examples_spotify_artists_features.html

Grouping means clustering, so we use an ``elbow`` curve to find a suitable number of clusters.
Grouping means clustering, so we use an :py:func:`~verticapy.machine_learning.model_selection.elbow` curve to find a suitable number of clusters.

.. ipython:: python
:okwarning:
@@ -915,4 +915,4 @@ Let's see how our model groups these artists together:
Conclusion
-----------

We were able to predict the popularity Polish songs with a ``RandomForestRegressor`` model suggested by ``AutoML``. We then created a ``k-means`` model to group artists into "genres" (clusters) based on the feature-commonalities in their tracks.
We were able to predict the popularity Polish songs with a :py:func:`~verticapy.machine_learning.vertica.RandomForestRegressor` model suggested by :py:func:`~verticapy.machine_learning.vertica.automl.AutoML`. We then created a ``k-means`` model to group artists into "genres" (clusters) based on the feature-commonalities in their tracks.
6 changes: 3 additions & 3 deletions docs/source/examples_learn_commodities.rst
Original file line number Diff line number Diff line change
@@ -320,12 +320,12 @@ Moving on to the correlation matrix, we can see many events that changed drastic
.. raw:: html
:file: /project/data/VerticaPy/docs/figures/examples_commodities_table_corr_2.html

We can see strong correlations between most of the variables. A vector autoregression (``VAR``) model seems ideal.
We can see strong correlations between most of the variables. A vector autoregression (:py:func:`~verticapy.machine_learning.vertica.VAR`) model seems ideal.

Machine Learning
-----------------

Let's create the ``VAR`` model to predict the value of various commodities.
Let's create the :py:func:`~verticapy.machine_learning.vertica.VAR` model to predict the value of various commodities.

.. code-block:: python
@@ -441,7 +441,7 @@ Our model is excellent. Let's predict the values these commodities in the near f
.. raw:: html
:file: /project/data/VerticaPy/docs/figures/examples_commodities_table_pred_plot_4.html

The model performs well but may be somewhat unstable. To improve it, we could apply data preparation techniques, such as seasonal decomposition, before building the ``VAR`` model.
The model performs well but may be somewhat unstable. To improve it, we could apply data preparation techniques, such as seasonal decomposition, before building the :py:func:`~verticapy.machine_learning.vertica.VAR` model.

Conclusion
-----------
Loading

0 comments on commit d593c2a

Please sign in to comment.