PDF Text Extractor

Objective

Developed an app to extract text from PDFs using libraries such as PyMuPDF and llama-parse for text extraction.

Architecture

A high-level architecture diagram that illustrates the structure of a monolithic application deployed inside a Docker container. It shows the interaction between the user, the frontend, and the backend.

`Sequence Diagram`

Basic flow for the sequence diagram:

The User sends a PDF to the Frontend.
The Frontend receives the PDF.
The Frontend sends the PDF to the Backend for processing.
The Backend uses PyMuPDF to extract text from the PDF.
The Backend uses Llama-Parse to process (parse or analyze) the extracted text.
The Backend returns the processed text to the Frontend.
The Frontend displays the text to the User.

Tech Stack

Project Structure

pdf_text_extractor/
├── .dockerignore        # Files and folders to exclude from Docker builds
├── .env                 # Environment variables (keep secret)
├── .env.example         # Example environment variables
├── .gitignore           # Git ignore rules
├── .python-version      # Python version used in the project
├── .venv/               # Local virtual environment (ignored in git)
├── data/                # Folder for input/output data (PDFs, extracted text, etc.)
├── dist/                # Distribution or build files
├── docs/                # Documentation files
├── main.py              # Entry point of the application
├── notebook/            # Jupyter notebooks for experiments or testing
├── pyproject.toml       # Project dependencies and metadata
├── README.md            # Project README file
├── Dockerfile           # Dockerfile to build the container
├── src/                 # Source code for the project
└── uv.lock              # Dependency lock file for uv

Installation

✅ 1. Clone the repository

git clone https://github.com/estelacode/pdf_text_extractor.git
cd pdf_text_extractor

✅ 2. Create and activate a virtual environment

py -3.13 -m venv .venv
.venv\Scripts\activate  # Windows
# or
source .venv/bin/activate # Linux/macOS

✅ 3. Install UV

pip install uv

✅ 4. Install dependences from .toml file.

uv pip install -e .

✅ 5. Configure the .env file

Project Setup

☑️ Step 1: Create Virtual Enviroment

py -3.13 -m venv .venv

☑️ Step 2: Activate Virtual Enviroment

.venv\Script\activate

☑️ Step 3: Install UV

pip install uv

☑️ Step 4: Create a project with uv

uv init

☑️ Step 5: Link my local repository to my Github remote repository

git remote add origin https://github.com/estelacode/pdf_text_extractor.git
git remote -v  # Verify the remote repository is added

☑️ Step 6: Add first commit and push the current branch and set the remote as upstream

git add README.md
git commit -m "README.md"
git push --set-upstream origin master 
git push -u origin master

☑️ Step 7: Add and remove dependencies

uv add [OPTIONS] <PACKAGES>...  # Add dependencies to the project
uv remove [OPTIONS] <PACKAGES>... # Remove dependencies from the project.

Usage

cd pdf_text_extractor
uv run main.py 
# navigate to http://localhost:7860/

Build the artifact

uv build

Requirements

uv pip freeze > requirements.txt

Developer mode

uv pip install -e . #
uv pip install --editable . # Install the editable package based on the provided local file path.

Devops

Generate the whl file in the dist folder

uv build

Build a Docker image with the code and dependencies from my project.

docker build -t pdf_text_extractor . # build your Docker image
docker images # list docker images
docker rmi <docker_image-id> # remove the docker image with image id (Ex.1bec6217270e)

Create and run a docker container

docker run -d -p 8080:8080 pdf_text_extractor
docker ps -a # list all the docker containers
docker rm -f <container-id> #remove the docker container with id (Ex.cf99422731ef)

Create a docker container with enviroment variables

docker run -d -p 8080:8080 -e LLAMA_CLOUD_API_KEY="XXXXXXXXXXXXXXXXXXXXXXXXXX" pdf_text_extractor

Tech Stack

PDF Procesing Libraries

👋 Author

Estela Madariaga

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PDF Text Extractor

Objective

Architecture

`Sequence Diagram`

Tech Stack

Project Structure

Installation

Project Setup

Usage

Build the artifact

Requirements

Developer mode

Devops

Tech Stack

PDF Procesing Libraries

👋 Author

About

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
docs		docs
notebook		notebook
src/pdf_lab		src/pdf_lab
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
Dockerfile		Dockerfile
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

estelacode/pdf_text_extractor

Folders and files

Latest commit

History

Repository files navigation

PDF Text Extractor

Objective

Architecture

Sequence Diagram

Tech Stack

Project Structure

Installation

Project Setup

Usage

Build the artifact

Requirements

Developer mode

Devops

Tech Stack

PDF Procesing Libraries

👋 Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages

`Sequence Diagram`