This project involves scraping cost of living data from the Numbeo website, storing the data in a PostgreSQL database, and using Docker to manage the environment.
The goal of this project is to extract country names from the Numbeo website, create a PostgreSQL database to store the cost of living data, and scrape cost of living information for each country and city. The project uses Docker and Docker Compose for containerization and environment management.
Here's a brief overview of the project structure:
.
├── Dockerfile
├── LICENSE
├── README.md
├── docker-compose.yaml
├── notebooks
│ ├── numbeo-v2.ipynb
│ ├── numbeo-v3.ipynb
│ ├── numbeo-v4.ipynb
│ └── numbeo.ipynb
├── requirements.txt
└── src
├── country_name_extractor.py
├── numbeo_web_crawler.py
├── run.py
└── utils
└── db.py
Dockerfile
: Defines the Docker image for the project.docker-compose.yaml
: Configuration for Docker Compose to set up services.requirements.txt
: Python package dependencies.src/
: Source code directory.country_name_extractor.py
: Extracts country names from the Numbeo website.numbeo_web_crawler.py
: Scrapes cost of living data.run.py
: Entry point for executing the project.utils/db.py
: Utility functions for database operations.
notebooks/
: Jupyter notebooks for analysis and experimentation.
Before setting up the project, ensure you have the following installed on your system:
- Docker
- Docker Compose
- Python 3.11 or later
Follow these steps to set up the project:
-
Clone the Repository
git clone https://github.com/sinanazem/numbeo-web-crawling.git cd web-crawling-numbeo
-
Build and Start Docker Containers
Build the Docker image and start the containers using Docker Compose:
docker-compose up --build
This will create and start the necessary containers for the project, including the PostgreSQL database.
-
Access the Docker Container
You can access the running container to interact with the application:
docker-compose exec app /bin/bash
-
Install Python Dependencies
Inside the container, install the required Python packages:
pip install -r requirements.txt
-
Extract Country Names
Run the script to extract country names:
python src/country_name_extractor.py
-
Scrape Cost of Living Data
Execute the web crawler to scrape the cost of living data:
python src/numbeo_web_crawler.py
-
Run the Application
To run the full application:
python src/run.py