Skip to content

Using tools like selenium and other scrapping libraries, as well as a Vector Data Base PGVector and docker, I will create an ETL that can be used once a week to populate this vector database with the releases of the week

Notifications You must be signed in to change notification settings

ilanaliouchouche/WeeklyMovies-VDB

Repository files navigation

SensCritique WeekReal Database 🎬

Table of Contents

Overview

The SensCritique WeeklyReal Database project is an advanced ETL (Extract, Transform, Load) application developed in Python. It focuses on gathering weekly cinema release data from sens-critique. For the transformation phase, we will leverage a Large Language Model (LLM) and the TEI project to vectorize the reviews. The project's primary aim is to extract movie data, transform it using these advanced tools, and then store it in a PGVector database, a specialized vector data structure. This choice is motivated by the need to process and embed movie reviews, categorizing them into positive or negative sentiments, which is pivotal for subsequent data analysis and visualization.

Key Features

  • Automated ETL Pipeline: Extracts data from sens-critique, transforms it, and loads it into a PGVector database.
  • Review Analysis: Captures and categorizes movie reviews, enabling detailed sentiment analysis.
  • Vector Database Utilization: Leverages PGVector for efficient handling and querying of vector data.
  • Dashboard Compatibility: Designed to support data visualization and dashboard creation in PowerBI.
  • Scheduled and On-Demand Execution: The process can be executed at any time, with checks to prevent reprocessing of current week's data.

Important Note

🚨🚨🚨

  • Code Maintenance: The code might not always be up-to-date due to possible changes in sens-critique's website structure. While re-adaptation of the code is straightforward, regular updates may not be feasible.

Technology Stack

  • Text Embedding Inference (TEI): For processing and embedding review texts.
  • PGVector: A vector database for efficient data storage and retrieval.
  • Docker: For containerizing the ETL process.
  • Selenium: For web scraping and data extraction.
  • PwerBI: For reporting.

Repository Structure

Directory/File Description
etl/ Package containing Extract, Transform, Load modules.
docker-compose.yml Docker Compose file to link VDB, the app, and TEI.
Dockerfile Dockerfile for creating the application's image.
main.py Script to execute the ETL process.
setup_vcb.py Script for initial database setup (if running without volumes).
bddr-sc-env.yml Script for setup the conda env.
requirements.txt To install the dependencies with pip.
reporting/ Folder containing all the reporting section.

Usage

  • Docker Setup: Fetch the docker-compose.yml, required volumes, and project image. Run main.py within the container. If running without volumes, execute setup_vcb.py first.
  • Conda Environment: Setup a Conda environment and execute main.py, or use the classes within a Notebook. In this case, setup the rights ENV VAR
    Note: Don't forget to launch pgvector and TEI images.

HANDBOOK available here

  • Reporting: For reporting purposes, retrieve only the volume, launch a PGVector instance, and connect to the database from PowerBI. See reporting/.

🎬

About

Using tools like selenium and other scrapping libraries, as well as a Vector Data Base PGVector and docker, I will create an ETL that can be used once a week to populate this vector database with the releases of the week

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages