The PDF Extractor API is a FastAPI-based application designed to extract text and metadata from PDF files. It supports authentication using JWT tokens and rate limiting to manage API usage. The API allows users to upload PDF files, extract headers and items based on provided keywords, and handle responses in a user-friendly format.
- Authentication: Secure API access with JWT tokens.
- File Upload: Upload PDF files in base64 format.
- PDF Extraction: Extract headers and items from PDF files.
- Rate Limiting: Protect the API from excessive usage.
To get started with the PDF Extractor API, follow these instructions to set up your development environment and run the application.
- Python 3.11+
- Docker (optional, for containerized deployment)
-
Clone the Repository
git clone https://github.com/yourusername/pdf-extractor-api.git cd pdf-extractor-api
-
Set Up a Virtual Environment
python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate`
-
Install Dependencies
pip install -r requirements.txt
-
Configure Environment
Create a config.json file in the root directory with the following content:
{ "client_id": "your_client_id", "client_secret": "your_client_secret", "url_auth": "your_auth_url", "api_url": "your_api_url", "access_token": "", "expires_at": "" }
Replace the placeholders with your actual configuration values.
-
Start the Server
uvicorn main:app --host 0.0.0.0 --port 8000
-
Access the API
Open your browser or API client and navigate to http://localhost:8000/docs to access the interactive API documentation provided by FastAPI.
-
API Endpoints
- POST /token: Obtain an access token.
- GET /users/me: Get information about the current user.
- POST /upload: Upload a PDF file in base64 format.
- POST /extract-header: Extract header information from a PDF.
- POST /extract-items: Extract item information from a PDF.
-
Authenticate and Get a Token
curl -X POST "http://localhost:8000/token" -H "Content-Type: application/x-www-form-urlencoded" -d "username=TSPABAP&password=Welcome@321"
-
Upload a PDF File
curl -X POST "http://localhost:8000/upload" -H "Content-Type: application/json" -d '{"base64_string": "your_base64_encoded_pdf"}'
-
Extract Header
curl -X POST "http://localhost:8000/extract-header" -H "Authorization: Bearer your_access_token" -H "Content-Type: application/json" -d '{"file_id": "your_file_id", "keywords": ["keyword1", "keyword2"], "prompt": "Extract the header from the PDF."}'
-
Extract Items
curl -X POST "http://localhost:8000/extract-items" -H "Authorization: Bearer your_access_token" -H "Content-Type: application/json" -d '{"file_id": "your_file_id", "keywords": ["keyword1", "keyword2"], "prompt": "Extract the items from the PDF."}'
This project is licensed under the GNU General Public License v3.0 (GPL-3.0). See the LICENSE file for more details.
We welcome contributions to improve the PDF Extractor API. Please follow these steps to contribute:
- Fork the repository.
- Create a new branch for your changes.
- Make your changes and test them.
- Submit a pull request with a detailed description of your changes.
For any questions or support, please open an issue in the repository.