These are the containers I'm running:
- React (Frontend) Container
- Spring Boot (Backend) Container
- MySQL (preloaded Database) Container
- Elasticsearch (SearchEngine) Container
- MinIO (preloaded FileStorage) Container
- Traefik (reverse proxy) Container
For CI / CD I use GitHub Workflows.
I created an entity-relationship diagram to simplify schema creation. I installed mysql-server on my machine, processed the dataset and imported it into the database.
For this I used the powerful capabilities of the Python framework Pandas which can easily process big datasets. All steps are verifiable through a jupyter notebook.
- download
title.basics.tsv.gz
andtitle.ratings.tsv.gz
from IMDb - process dataset using Python, Pandas, Numpy:
- replace empty values by '\N'
- remove incorrect values (consistent datatype per column)
- merge Rating-, Movie- and image/description dataframes
- set
tconst
as index
Instead of rerunning the jupyter notebook you can also just download the Processed Dataset.
- execute
create table
statements andload infile
using init.sql file