PDF to JSON-ified Text Converter with Efficient Wiktionary Search
This repository contains a web application that allows users to upload their own PDF files and converts them into JSON-ified text. The PDF is sent to a Python Flask server hosted on Render.com. The server efficiently processes the PDF, stripping it down to individual words and then finding the basic version of each word by making requests to Wiktionary. The resulting JSON-ified text of the PDF, along with a dictionary containing each word and its basic form, is then stored in a MongoDB database.
-
PDF Upload and Conversion: Users can easily upload their PDF files through the web interface, and the Flask server handles the conversion process.
-
Efficient Word Processing: The Flask server smartly processes the PDF, breaking it down into individual words and then searching for the basic form of each word on Wiktionary. This approach minimizes unnecessary requests and optimizes the overall performance.
-
Data Storage: The JSON-ified text of the PDF and the associated dictionary (word and its basic form) are securely saved to a MongoDB database, providing a scalable and flexible solution for data management.
-
User-Friendly Interface: The website allows users to view all uploaded files in their JSON-ified form, providing a clear and organized representation of the converted content.
-
On-Demand Word Description: Each uploaded file comes with an associated dictionary generated by the Flask server. This enables the backend to make accurate and real-time searches in the Wiktionary dump for full descriptions of any word the user clicks on.
Check out the live application at http://jakubgrad.ddns.net:2227/frontend/about and the source code on Github. Feel free to explore the codebase and contribute to the project. The live application currently runs on my private server.
The hybrid approach of the current application has serious disadvantages, especially in terms of memory efficiency and speed when making requests to Wiktionary for each word in a PDF. To address this problem, I'm in the process of implementing a better approach:
- Using a Flask Python server with direct access to the Wiktionary dump.
This new approach will allow a more efficient search for the basic form of a word in the Wiktionary dump without the need for additional searches for the full description. By directly accessing the relevant information, valuable processing time will be saved and the overall performance of the application will enhance.
If you have any suggestions, ideas, or would like to contribute, feel free to open an issue or submit a pull request!
Thank you for your interest in the project!
- Is it a serious approach to create a language reading app? Semi-serious. The UI definitely needs improvement, and so does the method for finding words. The current speed of about a minute for 1 page of a PDF is prohibitively slow, and there is no personalization for users of the website, like saving words or marking progress in a PDF. Though this might come soon, there is also no UI option for flipping pages o_o