This project demonstrates the use of GridGain as a high-performance prediction cache for a product recommendation model built on Google Analytics data.
- BigQuery Integration: Create datasets, tables, and recommendation models in BigQuery.
- Data Export: Export data from BigQuery to Google Cloud Storage (GCS) in Parquet format.
- AWS S3 Integration: Create S3 buckets and transfer data from GCS to S3.
- GridGain based prediction Caching: Create tables and push data to GridGain for fast, in-memory access.
- Recommendation Engine:
- Generate recommendations using BigQuery ML.
- Retrieve cached recommendations from GridGain.
- RESTful API: All functionalities exposed through a well-structured RESTful API. -- Flexible Configuration: The project, dataset, access creds are all parameterized within the API
-
FastAPI Application (
api.py
): Sets up the FastAPI application and defines all the API endpoints for BigQuery, AWS, and GridGain operations. -
GCP Helper (
gcp_helper.py
): Contains functions for interacting with Google Cloud Platform services, including BigQuery and Google Cloud Storage. -
GridGain Helper (
gg_helper.py
): Handles operations related to GridGain, including table creation, data pushing, and cached recommendation retrieval.
-
Python 3.11.7
- You can use
pyenv
to manage multiple Python versions (optional):- Install
pyenv
:brew install pyenv
(or your system's package manager) - Create and activate the environment:
pyenv virtualenv 3.11.7 ga-demo-env source $HOME/.pyenv/versions/ga-demo-env/bin/activate
- Install
- Alternatively, ensure Python 3.11.7 is installed directly.
- You can use
-
GCP
-
GCP CLI
- You should have the gcp cli installed and configured with your GCP Default Credentials.
-
GCP Project : Create a project in GCP with the following
- APIs enabled
- Dataform API
- Analytics Hub API
- BigQuery API
- BigQuery Connection API
- BigQuery Data Policy API
- BigQuery Migration API
- BigQuery Reservation API
- BigQuery Storage API
- Cloud Dataplex API
- Google Cloud Data Catalog API
- Google Cloud Storage JSON API
- Storage Insights API
-
GCP Roles : The following roles must be assigned to the user on the GCP project
- BigQuery Data Editor
- BigQuery Job User
-
Before implementing the retail recommender model, you must configure specific resource allocations in your Google Cloud Project:
-
Enable BigQuery Reservation API
- Navigate to: https://console.cloud.google.com/apis/library
- Search for "BigQuery Reservation API"
- Click "Enable" if not already enabled
-
Create Slot Reservation
-
Create Assignment
Cost Considerations:
- Flex slots are billed by the second
- Minimum commitment: 100 slots
- Can be deleted after model training is complete
- Consider monthly/annual commitments for production workloads
- Install project dependencies using pip:
pip install pygridgain s3fs pandas==2.2.2 numpy==1.26.4 google-cloud-storage google-cloud-bigquery fastapi==0.111.0 pydantic==2.7.4 uvicorn==0.30.1 pyarrow==16.1.0 requests==2.32.3
Authenticate to GCP:
The gcp cli requires regular authentication, it expires after some time.
Please run gcloud auth application-default login
to reauthenticate.
Start FastAPI Server:
cd src
uvicorn api:app --reload
- Access the Swagger UI at
http://localhost:8000/docs
to explore and test the API endpoints.
This application provides endpoints for managing BigQuery datasets, creating recommendation models, and interacting with GridGain and AWS S3.
Some important points to note:
We do not necessarily need to load data from AWS to GCS, the GridGain cache can be kept empty at the start and loaded with each execution of /get_recommendations api.
- Endpoint:
/bigquery/create_dataset
- Method: POST
- Description: Creates a BigQuery dataset.
- Parameters:
{ "project_id": "ga-ignite-test", "dataset_id": "ga_dataset" }
- Endpoint:
/bigquery/create_aggregate_web_stats_table
- Method: POST
- Description: Creates or replaces the aggregate_web_stats table.
- Parameters:
{ "project_id": "ga-ignite-test", "dataset_id": "ga_dataset" }
- Endpoint:
/bigquery/create_retail_recommender_model
- Method: POST
- Description: Creates or replaces the retail_recommender matrix factorization model.
- Note: Requires proper BigQuery ML reservation setup to avoid the following error:
google.api_core.exceptions.BadRequest: 400 Training Matrix Factorization models is not available for on-demand usage. To train, please set up a reservation (flex or regular) based on instructions in BigQuery public docs.
- Parameters:
{ "project_id": "ga-ignite-test", "dataset_id": "ga_dataset" }
- Endpoint:
/bigquery/generate_recommendations
- Method: POST
- Description: Generates recommendations and stores them in the recommend_content table.
- Parameters:
{ "project_id": "ga-ignite-test", "dataset_id": "ga_dataset" }
- Endpoint:
/bigquery/create_all_recommendations_table
- Method: POST
- Description: Creates or replaces the all_recommendations table with unique IDs.
- Parameters:
{ "project_id": "ga-ignite-test", "dataset_id": "ga_dataset" }
- Endpoint:
/gridgain/create_table
- Method: POST
- Description: Creates or replaces the all_recommendations table in GridGain.
- Parameters:
{ "username": "<your gridgain cluster username>", "password": "<your gridgain cluster password>", "url": "<your gridgain cluster url>", "port": 10800 }
- Endpoint:
/get_recommendations
- Method: POST
- Description: Gets a recommendation from GridGain model, if not found then gets a recommendation from the BQ Model and updates it in the GridGain Cache.
- Parameters:
{ "gg": { "username": "<your gridgain cluster username>", "password": "<your gridgain cluster password>", "url": "<your gridgain cluster url>", "port": 10800 }, "gcp": { "project_id": "ga-ignite-test", "visitor_id": "8016003971239765913-2" } }
- Endpoint:
/bigquery/get_predicted_recommendations
- Method: POST
- Description: Gets a recommendation from the BQ Model. This is the older method, it does not cache the recommendation in GridGain.
- Parameters:
{ "project_id": "ga-ignite-test", "visitor_id": "8016003971239765913-2" }
- Endpoint:
/gridgain/get_cached_recommendations
- Method: POST
- Description: Gets a recommendation from the cache.
- Parameters:
{ "username": "<your gridgain cluster username>", "password": "<your gridgain cluster password>", "url": "<your gridgain cluster url>", "port": 10800, "visitor_id": "8016003971239765913-2" }