-
This application allows you to get some information and download images from provided array of websites. This project tries to use asynchronous functions, libraries, and tools (
aiohttp,aiokafka,asyncio,aiobotocore,asyncpg) where it is possible - to speed up parsing. -
There are two main microservices -
webto interact with client andparser- to parse data for webpages (as for now - get html length) and upload webpage images to minio storage. Microservices communicate with each other via kafka topics (both microservices have producer and consumer). You could monitor kafka cluster by usingprovectuslabs/kafka-uidashboard. -
For each of website - there will be created a row in Postgresql database using asyncpg driver. You will get website data as well as parsing status (which will be
UPDATEDto different statusespending->in_progress->finished/failedwhile parsing). -
All database connections could be made by
webmicroservice only, while bothwebandminiointeract with minio. You will be able toPOSTwebsite entity (which will trigger data parsing as well),GETinfo for each website (with updating data) andDELETEwebsite entities (all minio data for this website will be deleted as well). -
After uploading images to minio you could request S3
presigned urlsto be able to download this images.
-
Make sure that you have installed the latest versions of
pythonandpipon your computer. Also, you have to install Docker and Docker Compose.Note: each microservice -
parserandwebhas ownDockerfileand.dockerignorewith appropriate build stages. -
This project by default uses poetry for dependency and virtual environment management. Make sure to install it too.
Note: each microservice -
parserandwebhas own poetry files and dependencies specified. -
Make sure to provide all required environment variables (via
.envfile,exportcommand, secrets, etc.) before running application.Note: each microservice -
parserandwebshould have own .env/secrets variables specified.
-
For managing pre-commit hooks this project uses pre-commit.
-
For import sorting this project uses isort.
-
For code format checking this project uses black.
-
For type checking his project uses mypy
-
For create commits and lint commit messages this project uses commitizen. Run
make committo use commitizen during commits. -
There is special
build_devstage in Docker file to build dev version of application image. -
Because there are two separate microservices, all
pre-commitand Dockertestbuild stage checks run both forparserandwebmicroservices from repo root. -
New application version should be specified in
web/version.txtfile to updatewebmicroservice openapi documentation.
-
This project involves github actions to run all checks and unit-tests on
pushto remote repository. -
There will be two jobs running in one workflow - for
parserandwebmicroservices (built from each of directories separately viastrategy.matrix).
There are lots of useful commands in Makefile included into this project's repo. Use make <some_command> syntax to run each of them.
If your system doesn't support make commands - you may copy commands from Makefile directly into terminal.
Note: there are many commands that will perform actions both for parser and web microservice. Even so, all Makefile commands
should be run from repo root directory only.
-
To install all the required dependencies and set up a virtual environment run in the cloned repository directory use:
poetry installYou can also install project dependencies using
pip install -r requirements.txtfrom repo root directory.Note: this command will install ALL dependencies for the project - both for
parserandwebmicroservice. Separate dependencies will be installed automatically during Docker image build (or github actions run). -
To config pre-commit hooks for code linting, code format checking and linting commit messages run in the cloned directory:
poetry run pre-commit install -
Build app images (for
parserandweb) usingmake buildTo build reloadable application locally use
make build_devto build images in development environment. -
Run all necessary Docker containers together using
make upContainers will start depending on each other and considering health checks.
Note: this will also create and attach persistent named volume
logsfor Docker containers. Containers will use this volume to store applicationapp.logfile. -
Stop and remove Docker containers using
make downIf you also want to remove log volume use
make down_volume
-
For managing migrations this project uses alembic.
-
Dockerfile for
webmicroservice already includesalembic upgrade headcommand to run all revision migrations, required by current version of application. -
Run
make upgradeto manually upgrade database tables state. You could also manually upgrade to specific revision with.pyscript (fromweb/alembic/versions/) by running:alembic upgrade <revision id number> -
You could also downgrade one revision down with
make downgradecommand, to specific revision - by runningalembic downgrade <revision id number>, or make full downgrade to initial database state with:make downgrade_full
-
By default, web application will be accessible at http://localhost:8080, minio storage console - at http://localhost:9001, database - at http://localhost:5432, kafka cluster UI - at http://localhost:9093. You can try all endpoints with SWAGGER documentation at http://localhost:8080/docs
Note:
parsermicroservice will run at http://localhost:8081 but user don't need to interact with it directly. -
Make sure to create minio bucket (specified in your .env/secrets) before interaction with web application resources.
-
Use
/websitesresource withPOSTmethod to create database entities for each URL and start parsing. Created entities with ids will return in response body.

-
Use
/websites/{website_id}resource withGETorDELETEmethod to get and delete row in database respectivelly. Use get to monitor status of parsing and data updates. Delete also clears minio storage objects associated with URL (website URL is used as prefix for picture keys of this webpage)
-
Use
/websites/{website_id}/picture_linksto get website database entity with generated S3 presigned URLs array in response body. You could use some tool (e.g. POSTMAN, wget, curl, etc.) to download this images via generated URL. URL will expire after 5 minutes.
-
Description of all project's endpoints and API may be viewed without running any services from
documentation/openapi.yamlfile -
You can update web/documentation/openapi.yaml documentation for API at any time by using
make openapicommand. -
All warnings and info messages will be shown in container's stdout and saved in
web.logandparser.logfiles. -
To minimize chances to be blocked
parseruses custom user-agents for request headers fromparser/documentation/user_agents.txtfile.
-
Use
make testto build test images forparserandwebmicroservices and run all linters checks and unit-tests for each of them. -
After all tests coverage report will be also shown.
-
Staged changes will be checked during commits via pre-commit hook.
-
All checks and tests (both for
parserandwebmicroservices) will run on code push to remote repository as part of github actions.