|
| 1 | +.. _user_guide.full_stack.to_json: |
| 2 | + |
| 3 | +========================= |
| 4 | +Example: XGBoost.to_json |
| 5 | +========================= |
| 6 | + |
| 7 | +Connect to Vertica |
| 8 | +-------------------- |
| 9 | + |
| 10 | + |
| 11 | +For a demonstration on how to create a new connection to Vertica, |
| 12 | +see :ref:`connection`. In this example, we will use an |
| 13 | +existing connection named 'VerticaDSN'. |
| 14 | + |
| 15 | +.. code-block:: python |
| 16 | +
|
| 17 | + import verticapy as vp |
| 18 | + vp.connect("VerticaDSN") |
| 19 | +
|
| 20 | +
|
| 21 | +Create a Schema (Optional) |
| 22 | +--------------------------- |
| 23 | + |
| 24 | + |
| 25 | +Schemas allow you to organize database objects in a collection, |
| 26 | +similar to a namespace. If you create a database object |
| 27 | +without specifying a schema, Vertica uses the 'public' |
| 28 | +schema. For example, to specify the 'example_table' in 'example_schema', |
| 29 | +you would use: 'example_schema.example_table'. |
| 30 | + |
| 31 | +To keep things organized, this example creates the 'xgb_to_json' |
| 32 | +schema and drops it (and its associated tables, views, etc.) at the end: |
| 33 | + |
| 34 | +.. ipython:: python |
| 35 | + :suppress: |
| 36 | +
|
| 37 | + import verticapy as vp |
| 38 | +
|
| 39 | +.. ipython:: python |
| 40 | +
|
| 41 | + vp.drop("xgb_to_json", method = "schema") |
| 42 | + vp.create_schema("xgb_to_json") |
| 43 | +
|
| 44 | +Load Data |
| 45 | +---------- |
| 46 | + |
| 47 | +VerticaPy lets you load many well-known datasets like Iris, Titanic, Amazon, etc. |
| 48 | +For a full list, check out :ref:`datasets`. |
| 49 | + |
| 50 | +.. ipython:: python |
| 51 | +
|
| 52 | + from verticapy.datasets import load_titanic |
| 53 | + vdf = load_titanic( |
| 54 | + name = "titanic", |
| 55 | + schema = "xgb_to_json", |
| 56 | + ) |
| 57 | +
|
| 58 | +
|
| 59 | +You can also load your own data. To ingest data from a CSV file, |
| 60 | +use the :py:func:`verticapy.read_csv` function. |
| 61 | + |
| 62 | +Create a vDataFrame |
| 63 | +-------------------- |
| 64 | + |
| 65 | +vDataFrames allow you to prepare and explore your data without modifying its representation in your Vertica database. Any changes you make are applied to the vDataFrame as modifications to the SQL query for the table underneath. |
| 66 | + |
| 67 | +To create a vDataFrame out of a table in your Vertica database, specify its schema and table name with the standard SQL syntax. For example, to create a vDataFrame out of the 'titanic' table in the 'xgb_to_json' schema: |
| 68 | + |
| 69 | +.. ipython:: python |
| 70 | +
|
| 71 | + vdf = vp.vDataFrame("xgb_to_json.titanic") |
| 72 | +
|
| 73 | +Create an XGB model |
| 74 | +------------------- |
| 75 | + |
| 76 | +Create a :py:func:`verticapy.machine_learning.vertica.ensemble.XGBClassifier` model. |
| 77 | + |
| 78 | +Unlike a vDataFrame object, which simply queries the table it |
| 79 | +was created with, the VerticaPy :py:func:`verticapy.machine_learning.vertica.ensemble.XGBClassifier` object creates |
| 80 | +and then references a model in Vertica, so it must be stored in a |
| 81 | +schema like any other database object. |
| 82 | + |
| 83 | +This example creates the 'my_model' :py:func:`verticapy.machine_learning.vertica.ensemble.XGBClassifier` model in |
| 84 | +the 'xgb_to_json' schema: |
| 85 | + |
| 86 | +This example loads the Titanic dataset with the load_titanic function |
| 87 | +into a table called 'titanic' in the 'xgb_to_json' schema: |
| 88 | + |
| 89 | +.. ipython:: python |
| 90 | +
|
| 91 | + from verticapy.machine_learning.vertica.ensemble import XGBClassifier |
| 92 | + model = XGBClassifier( |
| 93 | + "xgb_to_json.my_model", |
| 94 | + max_ntree = 4, |
| 95 | + max_depth = 3, |
| 96 | + ) |
| 97 | +
|
| 98 | +Prepare the Data |
| 99 | +----------------- |
| 100 | + |
| 101 | + |
| 102 | +While Vertica XGBoost supports columns of type VARCHAR, |
| 103 | +Python XGBoost does not, so you must encode the categorical |
| 104 | +columns you want to use. You must also drop or impute missing values. |
| 105 | + |
| 106 | +This example drops 'age,' 'fare,' 'sex,' 'embarked,' and |
| 107 | +'survived' columns from the vDataFrame and then encodes the |
| 108 | +'sex' and 'embarked' columns. These changes are applied to |
| 109 | +the vDataFrame's query and does not affect the main |
| 110 | +"xgb_to_json.titanic' table stored in Vertica: |
| 111 | + |
| 112 | +.. ipython:: python |
| 113 | +
|
| 114 | + vdf = vdf[["age", "fare", "sex", "embarked", "survived"]]; |
| 115 | + vdf.dropna(); |
| 116 | + vdf["sex"].label_encode(); |
| 117 | + vdf["embarked"].label_encode(); |
| 118 | +
|
| 119 | +
|
| 120 | +.. ipython:: python |
| 121 | + :suppress: |
| 122 | + :okwarning: |
| 123 | +
|
| 124 | + res = vdf |
| 125 | + html_file = open("/project/data/VerticaPy/docs/figures/ug_fs_to_json_vdf.html", "w") |
| 126 | + html_file.write(res._repr_html_()) |
| 127 | + html_file.close() |
| 128 | +
|
| 129 | +.. raw:: html |
| 130 | + :file: /project/data/VerticaPy/docs/figures/ug_fs_to_json_vdf.html |
| 131 | + |
| 132 | + |
| 133 | + |
| 134 | +Split your data into training and testing: |
| 135 | + |
| 136 | +.. ipython:: python |
| 137 | +
|
| 138 | + train, test = vdf.train_test_split(0.05); |
| 139 | +
|
| 140 | +Train the Model |
| 141 | +---------------- |
| 142 | + |
| 143 | +Define the predictor and the response columns: |
| 144 | + |
| 145 | +.. ipython:: python |
| 146 | +
|
| 147 | + relation = train; |
| 148 | + X = ["age", "fare", "sex", "embarked"] |
| 149 | + y = "survived" |
| 150 | +
|
| 151 | +Train the model with fit(): |
| 152 | + |
| 153 | +.. ipython:: python |
| 154 | + :okwarning: |
| 155 | +
|
| 156 | + model.fit(relation, X, y) |
| 157 | +
|
| 158 | +Evaluate the Model |
| 159 | +-------------------- |
| 160 | + |
| 161 | +Evaluate the model with ``.report()``: |
| 162 | + |
| 163 | +.. code-block:: ipython |
| 164 | +
|
| 165 | + model.report() |
| 166 | +
|
| 167 | +.. ipython:: python |
| 168 | + :suppress: |
| 169 | + :okwarning: |
| 170 | +
|
| 171 | + res = model.report() |
| 172 | + html_file = open("/project/data/VerticaPy/docs/figures/ug_fs_to_json_report.html", "w") |
| 173 | + html_file.write(res._repr_html_()) |
| 174 | + html_file.close() |
| 175 | +
|
| 176 | +.. raw:: html |
| 177 | + :file: /project/data/VerticaPy/docs/figures/ug_fs_to_json_report.html |
| 178 | + |
| 179 | +Use to_json() to export the model to a JSON file. If you omit a filename, VerticaPy prints the model: |
| 180 | + |
| 181 | +.. ipython:: python |
| 182 | +
|
| 183 | + model.to_json() |
| 184 | +
|
| 185 | +
|
| 186 | +To export and save the model as a JSON file, specify a filename: |
| 187 | + |
| 188 | +.. ipython:: python |
| 189 | +
|
| 190 | + model.to_json("exported_xgb_model.json"); |
| 191 | +
|
| 192 | +Unlike Python XGBoost, Vertica does not store some information like |
| 193 | +'sum_hessian' or 'loss_changes,' and the exported model from |
| 194 | +``to_json()`` replaces this information with a list of zeroes |
| 195 | +These information are replaced by a list filled with zeros. |
| 196 | + |
| 197 | +Make Predictions with an Exported Model |
| 198 | +---------------------------------------- |
| 199 | + |
| 200 | +This exported model can be used with the Python XGBoost API right away, |
| 201 | +and exported models make identical predictions in Vertica and Python: |
| 202 | + |
| 203 | +.. ipython:: python |
| 204 | +
|
| 205 | + import pytest |
| 206 | + import xgboost as xgb |
| 207 | + model_python = xgb.XGBClassifier(); |
| 208 | + model_python.load_model("exported_xgb_model.json"); |
| 209 | + # Convert to numpy format |
| 210 | + X_test = test["age","fare","sex","embarked"].to_numpy() ; |
| 211 | + y_test_vertica = model.to_python(return_proba = True)(X_test); |
| 212 | + y_test_python = model_python.predict_proba(X_test); |
| 213 | + result = (y_test_vertica - y_test_python) ** 2; |
| 214 | + result = result.sum() / len(result); |
| 215 | + assert result == pytest.approx(0.0, abs = 1.0E-14) |
| 216 | +
|
| 217 | +For multiclass classifiers, the probabilities returned by the VerticaPy and the exported model may differ slightly because of normalization; while Vertica uses multinomial logistic regression, XGBoost Python uses Softmax. Again, this difference does not affect the model's final predictions. Categorical predictors must be encoded. |
| 218 | + |
| 219 | + |
| 220 | +Clean the Example Environment |
| 221 | +------------------------------ |
| 222 | + |
| 223 | +Drop the 'xgb_to_json' schema, using CASCADE to drop any |
| 224 | +database objects stored inside (the 'titanic' table, the |
| 225 | +:py:func:`verticapy.machine_learning.vertica.ensemble.XGBClassifier` |
| 226 | +model, etc.), then delete the 'exported_xgb_model.json' file: |
| 227 | + |
| 228 | +.. ipython:: python |
| 229 | +
|
| 230 | + import os |
| 231 | + os.remove("exported_xgb_model.json") |
| 232 | + vp.drop("xgb_to_json", method = "schema") |
| 233 | +
|
| 234 | +Conclusion |
| 235 | +----------- |
| 236 | + |
| 237 | +VerticaPy lets you to create, train, evaluate, and export |
| 238 | +Vertica machine learning models. There are some notable |
| 239 | +nuances when importing a Vertica XGBoost model into |
| 240 | +Python XGBoost, but these do not affect the accuracy of the model or its predictions: |
| 241 | + |
| 242 | +Some information computed during the training phase may not |
| 243 | +be stored (e.g. 'sum_hessian' and 'loss_changes'). |
| 244 | +The exact probabilities of multiclass classifiers in a |
| 245 | +Vertica model may differ from those in Python, but bot h |
| 246 | +will make the same predictions. |
| 247 | +Python XGBoost does not support categorical predictors, |
| 248 | +so you must encode them before training the model in VerticaPy. |
0 commit comments