Init commit

KeithTheEE · Nov 23, 2019 · 35284ae · 35284ae
commit 35284ae
Show file tree

Hide file tree

Showing 4 changed files with 174 additions and 0 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -0,0 +1,54 @@
+
+
+# CHANGELOG: pdfn2tex 0.0.00
+All notable changes to this project will be documented in this file.
+
+The format is adapted from [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
+It wont adhere perfectly, but it's a start. 
+
+
+Dates follow YYYY-MM-DD format
+
+
+> “What do you think?” he demanded impetuously.
+
+> “About what?” He waved his hand toward the book-shelves.
+
+> “About that. As a matter of fact you needn’t bother to ascertain. I ascertained. They’re real.”
+
+> “The books?”
+
+> He nodded.
+
+> “Absolutely real — have pages and everything. I thought they’d be a nice durable cardboard. Matter of fact, they’re absolutely real. Pages and — Here! Lemme show you.”
+
+> Taking our scepticism for granted, he rushed to the bookcases and returned with Volume One of the “Stoddard Lectures.”
+
+> “See!” he cried triumphantly. “It’s a bona-fide piece of printed matter. It fooled me. This fella’s a regular Belasco. It’s a triumph. What thoroughness! What realism! Knew when to stop, too — didn’t cut the pages. But what do you want? What do you expect?”
+
+-- Owl Eyes in Great Gatsby by F. Scott Fitzgerald
+
+
+## pdfn2tex [0.0.00] 2019-XX-XX
+In Progress
+### Contributors
+Keith Murray
+
+email: kmurrayis@gmail.com |
+twitter: [@keithTheEE](https://twitter.com/keithTheEE) |
+github: [CrakeNotSnowman](https://github.com/CrakeNotSnowman)
+
+Unless otherwise noted, all changes by @kmurrayis
+
+#### Big Picture: What happened, what was worked on
+
+
+#### Added
+#### Changed
+#### Deprecated
+#### Removed
+#### Fixed
+#### Security
+#### Documentation
+#### Testing
+
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,7 @@
+Copyright 2019, Keith Murray
+
+Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
diff --git a/README.md b/README.md
@@ -0,0 +1,38 @@
+# pdfn2tex Version 0.0.00
+
+A Skeleton Outline for a PDF->latex converter (in development)
+
+This project makes assumptions in order to made development easier. While the assumptions do not hold for general pdf's, they should be moderate for pdf's made for research publications, dissertations, and theses, assuming the pdf's do not play with fancy formatting. 
+
+This early version is going to play with Tesseract OCR v4 to aid the interpretation process. 
+
+It is a portion of a larger 'Document to eReader' project. 
+
+
+## Style Traits 
+This project takes an approach to mapping pdf contents to tex. 
+
+It assumes text flows top to bottom, and all sizable blocks of text can be mapped to a top-down order and position.
+
+Images and tables are treated as additional info, to be displayed 'along side' the text, so their position is mutable. Diagrams and algorithm blocks are treated as a part of this class.
+
+Equations and special charaters are treated as inline text, and their position is not mutable. 
+
+## System Structure
+Right now tesseract is simply going to be used to grab text and drop it into a flat file, likely a .txt, this way I can explore how to use it. Once I've got a handle on it, we can start the propper program.
+
+
+First pass through the pdf should generate predicted document formating flags:
+
+ - \documentclass
+ - \usepackage
+ - \newcommand (?)
+ - pagenumbering (?)
+
+ These are the packages needed to render the pdf 
+
+
+## Error Measures
+The generated Latex document is not intended to be a faithful reproduction of the source (assuming the training process goes Latex->PDF->Latex). The reader of the material (the important element in this project) should not care whether or not the text is formatted in one column or two columns, a word occurs on page 11 vs 12, or the exact position of an image. 
+
+What does matter is that sections are grouped and nested correctly, internal document links are maintained (figure 1, reference 3, etc), and equations and tables are preserved.
diff --git a/ROADMAP.md b/ROADMAP.md
@@ -0,0 +1,75 @@
+
+
+# ROADMAP: pdfn2tex 0.0.00
+Future expansions are considered in this file. 
+Their presence is not a promise that they'll exist, but rather this file serves as an
+early outline of features this project hopes to add, as well as changes in directions
+
+
+The format is adapted from [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
+It wont adhere perfectly, but it's a start. 
+
+
+Dates follow YYYY-MM-DD format
+
+
+> “What do you think?” he demanded impetuously.
+
+> “About what?” He waved his hand toward the book-shelves.
+
+> “About that. As a matter of fact you needn’t bother to ascertain. I ascertained. They’re real.”
+
+> “The books?”
+
+> He nodded.
+
+> “Absolutely real — have pages and everything. I thought they’d be a nice durable cardboard. Matter of fact, they’re absolutely real. Pages and — Here! Lemme show you.”
+
+> Taking our scepticism for granted, he rushed to the bookcases and returned with Volume One of the “Stoddard Lectures.”
+
+> “See!” he cried triumphantly. “It’s a bona-fide piece of printed matter. It fooled me. This fella’s a regular Belasco. It’s a triumph. What thoroughness! What realism! Knew when to stop, too — didn’t cut the pages. But what do you want? What do you expect?”
+
+-- Owl Eyes in Great Gatsby by F. Scott Fitzgerald
+
+
+
+
+## [0.0.00] pdfn2tex 2019-XX-XX
+In Progress.
+### Contributors
+Keith Murray
+
+email: kmurrayis@gmail.com |
+twitter: [@keithTheEE](https://twitter.com/keithTheEE) |
+github: [CrakeNotSnowman](https://github.com/CrakeNotSnowman)
+
+Unless otherwise noted, all changes by @kmurrayis
+
+### Short Term Roadmap
+convert PDF into images
+
+Pytesseract to get text
+
+opencv to get format type?
+
+
+#### Add
+ - Make a flag for direct PDF generated by Latex, vs PDF generated by latex + artifacts, where artifacts would be caused by actions like scanning a paper copy of the PDF. 
+#### Change
+#### Deprecate
+#### Remove
+#### Fix
+#### Security
+ - Because the 'convert' function is vulnerable to remote code execution, and because this tool is meant to be paired with a browser extension, I need to search for another method to render the pdf as an image. 
+ - Work around is pdftoppm
+#### Documentation
+#### Consider
+#### Testing
+
+---
+
+### General to Long Term Expansion
+
+ - Consider exporting equations from latex to python via sympy.
+   - See https://stackoverflow.com/questions/1381741/converting-latex-code-to-images-or-other-displayble-format-with-python
+ - Also consider rendering equations as png using the same function (useful when build target is old ereaders like the touch)