- Overview
- Key Features
- Important Note
- Technology Stack
- Repository Structure
- Usage
- Reporting and visualization
The SensCritique WeeklyReal Database project is an advanced ETL (Extract, Transform, Load) application developed in Python. It focuses on gathering weekly cinema release data from sens-critique. For the transformation phase, we will leverage a Large Language Model (LLM) and the TEI project to vectorize the reviews. The project's primary aim is to extract movie data, transform it using these advanced tools, and then store it in a PGVector database, a specialized vector data structure. This choice is motivated by the need to process and embed movie reviews, categorizing them into positive or negative sentiments, which is pivotal for subsequent data analysis and visualization.
- Automated ETL Pipeline: Extracts data from sens-critique, transforms it, and loads it into a PGVector database.
- Review Analysis: Captures and categorizes movie reviews, enabling detailed sentiment analysis.
- Vector Database Utilization: Leverages PGVector for efficient handling and querying of vector data.
- Dashboard Compatibility: Designed to support data visualization and dashboard creation in PowerBI.
- Scheduled and On-Demand Execution: The process can be executed at any time, with checks to prevent reprocessing of current week's data.
🚨🚨🚨
- Code Maintenance: The code might not always be up-to-date due to possible changes in sens-critique's website structure. While re-adaptation of the code is straightforward, regular updates may not be feasible.
- Text Embedding Inference (TEI): For processing and embedding review texts.
- PGVector: A vector database for efficient data storage and retrieval.
- Docker: For containerizing the ETL process.
- Selenium: For web scraping and data extraction.
- PwerBI: For reporting.
Directory/File | Description |
---|---|
etl/ |
Package containing Extract, Transform, Load modules. |
docker-compose.yml |
Docker Compose file to link VDB, the app, and TEI. |
Dockerfile |
Dockerfile for creating the application's image. |
main.py |
Script to execute the ETL process. |
setup_vcb.py |
Script for initial database setup (if running without volumes). |
bddr-sc-env.yml |
Script for setup the conda env. |
requirements.txt |
To install the dependencies with pip. |
reporting/ |
Folder containing all the reporting section. |
- Docker Setup: Fetch the
docker-compose.yml
, required volumes, and project image. Runmain.py
within the container. If running without volumes, executesetup_vcb.py
first. - Conda Environment: Setup a Conda environment and execute
main.py
, or use the classes within a Notebook. In this case, setup the rights ENV VAR
Note: Don't forget to launch pgvector and TEI images.
HANDBOOK available here
- Reporting: For reporting purposes, retrieve only the volume, launch a PGVector instance, and connect to the database from PowerBI. See
reporting/
.