Agropontos Regex is a small Python program that extracts geolocation coordinates from PDF files, eg.: rural property registration documents.
It works for two types of coordinates, UTM and Lat-Long. And generates a CSV file that can be imported directly to GIS software, like QGIS.
The program interface can be used like a notepad to correct any errors or wrong characters brought by the OCR scanning. It also generates a new PDF file correcting the page tilt and rotation.
![](https://private-user-images.githubusercontent.com/2325925/241585939-af976dc1-9415-4b6a-b2bb-71f8e9642555.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkxNzU5MDMsIm5iZiI6MTczOTE3NTYwMywicGF0aCI6Ii8yMzI1OTI1LzI0MTU4NTkzOS1hZjk3NmRjMS05NDE1LTRiNmEtYjJiYi03MWY4ZTk2NDI1NTUucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI1MDIxMCUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNTAyMTBUMDgyMDAzWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9YjcxZDYwNjAxYTM5ZTUxZDcxNTNkYTc5ZjNmNzEyZGM3YTNjNzI3NzMyZGQ2ZDNhMzE1ZjVhNzU4YzYzZmRmOSZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ.ZM7JfWSuqrC7VQdYOo97Q85hrBKHdCZkfRjACwHrppQ)
You need to install the following packages for Windows:
I recommend using the Chocolatey package manager to install some of the following: (Run in an Administrator command prompt)
- Python 3.8 (64-bit) or later
choco install python3
- Tesseract 4.1.1 (64-bit) or later
choco install --pre tesseract
- You'll also need the trained data files for Tesseract, according to your language
- Ghostscript 9.50 (64-bit) or later
choco install ghostscript
- OCRmyPDF 14.2.0 (64-bit) or later
pip install ocrmypdf
- pypdf 3.9.0 (64-bit) or later
pip install pypdf