Prediction cache GA

This project demonstrates the use of GridGain as a high-performance prediction cache for a product recommendation model built on Google Analytics data.

Features

BigQuery Integration: Create datasets, tables, and recommendation models in BigQuery.
Data Export: Export data from BigQuery to Google Cloud Storage (GCS) in Parquet format.
AWS S3 Integration: Create S3 buckets and transfer data from GCS to S3.
GridGain based prediction Caching: Create tables and push data to GridGain for fast, in-memory access.
Recommendation Engine:
- Generate recommendations using BigQuery ML.
- Retrieve cached recommendations from GridGain.
RESTful API: All functionalities exposed through a well-structured RESTful API. -- Flexible Configuration: The project, dataset, access creds are all parameterized within the API

Architecture & Project Structure

FastAPI Application (api.py): Sets up the FastAPI application and defines all the API endpoints for BigQuery, AWS, and GridGain operations.
GCP Helper (gcp_helper.py): Contains functions for interacting with Google Cloud Platform services, including BigQuery and Google Cloud Storage.
GridGain Helper (gg_helper.py): Handles operations related to GridGain, including table creation, data pushing, and cached recommendation retrieval.

Setup Instructions

Prerequisites

Python 3.11.7
- You can use pyenv to manage multiple Python versions (optional):
  1. Install pyenv: brew install pyenv (or your system's package manager)
  2. Create and activate the environment:
```
pyenv virtualenv 3.11.7 ga-demo-env
source $HOME/.pyenv/versions/ga-demo-env/bin/activate 
```
- Alternatively, ensure Python 3.11.7 is installed directly.
GCP
- GCP CLI
  1. You should have the gcp cli installed and configured with your GCP Default Credentials.
- GCP Project : Create a project in GCP with the following
  1. APIs enabled
  2. Dataform API
  3. Analytics Hub API
  4. BigQuery API
  5. BigQuery Connection API
  6. BigQuery Data Policy API
  7. BigQuery Migration API
  8. BigQuery Reservation API
  9. BigQuery Storage API
  10. Cloud Dataplex API
  11. Google Cloud Data Catalog API
  12. Google Cloud Storage JSON API
  13. Storage Insights API
- GCP Roles : The following roles must be assigned to the user on the GCP project
  1. BigQuery Data Editor
  2. BigQuery Job User

BigQuery ML Slot Reservation Setup

Before implementing the retail recommender model, you must configure specific resource allocations in your Google Cloud Project:

Enable BigQuery Reservation API
- Navigate to: https://console.cloud.google.com/apis/library
- Search for "BigQuery Reservation API"
- Click "Enable" if not already enabled
Create Slot Reservation
- Access BigQuery Admin Console
- Navigate to "Capacity Management"
- Click "Create Reservation"
- Configure:
  - Location: US (must match your dataset location)
  - Reservation name: (e.g., "ml-training-reservation")
Create Assignment
- In the Reservations page, click "Create Assignment"
- Configure:
  - Project: ga-ignite-test
  - Job type: QUERY

Cost Considerations:

Flex slots are billed by the second
Minimum commitment: 100 slots
Can be deleted after model training is complete
Consider monthly/annual commitments for production workloads

Installation

Install project dependencies using pip:

pip install pygridgain s3fs pandas==2.2.2 numpy==1.26.4 google-cloud-storage google-cloud-bigquery fastapi==0.111.0 pydantic==2.7.4 uvicorn==0.30.1 pyarrow==16.1.0 requests==2.32.3

Running the Project

Authenticate to GCP: The gcp cli requires regular authentication, it expires after some time. Please run gcloud auth application-default login to reauthenticate.

Start FastAPI Server:

cd src
uvicorn api:app --reload

Access the Swagger UI at http://localhost:8000/docs to explore and test the API endpoints.

API Documentation

This application provides endpoints for managing BigQuery datasets, creating recommendation models, and interacting with GridGain and AWS S3.

Some important points to note:

We do not necessarily need to load data from AWS to GCS, the GridGain cache can be kept empty at the start and loaded with each execution of /get_recommendations api.

API Endpoints

1. Create Dataset

Endpoint: /bigquery/create_dataset
Method: POST
Description: Creates a BigQuery dataset.

Parameters:

{
  "project_id": "ga-ignite-test",
  "dataset_id": "ga_dataset"
}

2. Create Aggregate Web Stats Table

Endpoint: /bigquery/create_aggregate_web_stats_table
Method: POST
Description: Creates or replaces the aggregate_web_stats table.

Parameters:

{
  "project_id": "ga-ignite-test",
  "dataset_id": "ga_dataset"
}

3. Create Retail Recommender Model

Endpoint: /bigquery/create_retail_recommender_model
Method: POST
Description: Creates or replaces the retail_recommender matrix factorization model.

Note: Requires proper BigQuery ML reservation setup to avoid the following error:

google.api_core.exceptions.BadRequest: 400 Training Matrix Factorization models is not available for on-demand usage. To train, please set up a reservation (flex or regular) based on instructions in BigQuery public docs.

Parameters:

{
  "project_id": "ga-ignite-test",
  "dataset_id": "ga_dataset"
}

4. Generate Recommendations

Endpoint: /bigquery/generate_recommendations
Method: POST
Description: Generates recommendations and stores them in the recommend_content table.

Parameters:

{
  "project_id": "ga-ignite-test",
  "dataset_id": "ga_dataset"
}

5. Create All Recommendations Table

Endpoint: /bigquery/create_all_recommendations_table
Method: POST
Description: Creates or replaces the all_recommendations table with unique IDs.

Parameters:

{
  "project_id": "ga-ignite-test",
  "dataset_id": "ga_dataset"
}

6. Create All Recommendations Table in GridGain

Endpoint: /gridgain/create_table
Method: POST
Description: Creates or replaces the all_recommendations table in GridGain.

Parameters:

{
"username": "<your gridgain cluster username>",
"password": "<your gridgain cluster password>",
"url": "<your gridgain cluster url>",
"port": 10800
}

7. Get Recommendations

Endpoint: /get_recommendations
Method: POST
Description: Gets a recommendation from GridGain model, if not found then gets a recommendation from the BQ Model and updates it in the GridGain Cache.

Parameters:

{
  "gg": {
    "username": "<your gridgain cluster username>",
    "password": "<your gridgain cluster password>",
    "url": "<your gridgain cluster url>",
    "port": 10800
  },
  "gcp": {
    "project_id": "ga-ignite-test",
    "visitor_id": "8016003971239765913-2"
  }
}

Optional Endpoints

1. Get Predicted Recommendations

Endpoint: /bigquery/get_predicted_recommendations
Method: POST
Description: Gets a recommendation from the BQ Model. This is the older method, it does not cache the recommendation in GridGain.

Parameters:

{
  "project_id": "ga-ignite-test",
  "visitor_id": "8016003971239765913-2"
}

2. Get Cached Recommendations

Endpoint: /gridgain/get_cached_recommendations
Method: POST
Description: Gets a recommendation from the cache.

Parameters:

{
  "username": "<your gridgain cluster username>",
  "password": "<your gridgain cluster password>",
  "url": "<your gridgain cluster url>",
  "port": 10800,
  "visitor_id": "8016003971239765913-2"
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets/images		assets/images
src		src
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Prediction cache GA

Table of Contents

Features

Architecture & Project Structure

Setup Instructions

Prerequisites

BigQuery ML Slot Reservation Setup

Installation

Running the Project

API Documentation

API Endpoints

1. Create Dataset

2. Create Aggregate Web Stats Table

3. Create Retail Recommender Model

4. Generate Recommendations

5. Create All Recommendations Table

6. Create All Recommendations Table in GridGain

7. Get Recommendations

Optional Endpoints

1. Get Predicted Recommendations

2. Get Cached Recommendations

About

Uh oh!

Releases

Packages

Languages

GridGain-Demos/prediction_cache_ga

Folders and files

Latest commit

History

Repository files navigation

Prediction cache GA

Table of Contents

Features

Architecture & Project Structure

Setup Instructions

Prerequisites

BigQuery ML Slot Reservation Setup

Installation

Running the Project

API Documentation

API Endpoints

1. Create Dataset

2. Create Aggregate Web Stats Table

3. Create Retail Recommender Model

4. Generate Recommendations

5. Create All Recommendations Table

6. Create All Recommendations Table in GridGain

7. Get Recommendations

Optional Endpoints

1. Get Predicted Recommendations

2. Get Cached Recommendations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages