Skip to content

Commit

Permalink
Init commit
Browse files Browse the repository at this point in the history
  • Loading branch information
KeithTheEE committed Nov 23, 2019
0 parents commit 35284ae
Show file tree
Hide file tree
Showing 4 changed files with 174 additions and 0 deletions.
54 changes: 54 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@


# CHANGELOG: pdfn2tex 0.0.00
All notable changes to this project will be documented in this file.

The format is adapted from [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
It wont adhere perfectly, but it's a start.


Dates follow YYYY-MM-DD format


> “What do you think?” he demanded impetuously.
> “About what?” He waved his hand toward the book-shelves.
> “About that. As a matter of fact you needn’t bother to ascertain. I ascertained. They’re real.”
> “The books?”
> He nodded.
> “Absolutely real — have pages and everything. I thought they’d be a nice durable cardboard. Matter of fact, they’re absolutely real. Pages and — Here! Lemme show you.”
> Taking our scepticism for granted, he rushed to the bookcases and returned with Volume One of the “Stoddard Lectures.”
> “See!” he cried triumphantly. “It’s a bona-fide piece of printed matter. It fooled me. This fella’s a regular Belasco. It’s a triumph. What thoroughness! What realism! Knew when to stop, too — didn’t cut the pages. But what do you want? What do you expect?”
-- Owl Eyes in Great Gatsby by F. Scott Fitzgerald


## pdfn2tex [0.0.00] 2019-XX-XX
In Progress
### Contributors
Keith Murray

email: kmurrayis@gmail.com |
twitter: [@keithTheEE](https://twitter.com/keithTheEE) |
github: [CrakeNotSnowman](https://github.com/CrakeNotSnowman)

Unless otherwise noted, all changes by @kmurrayis

#### Big Picture: What happened, what was worked on


#### Added
#### Changed
#### Deprecated
#### Removed
#### Fixed
#### Security
#### Documentation
#### Testing

7 changes: 7 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
Copyright 2019, Keith Murray

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
38 changes: 38 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# pdfn2tex Version 0.0.00

A Skeleton Outline for a PDF->latex converter (in development)

This project makes assumptions in order to made development easier. While the assumptions do not hold for general pdf's, they should be moderate for pdf's made for research publications, dissertations, and theses, assuming the pdf's do not play with fancy formatting.

This early version is going to play with Tesseract OCR v4 to aid the interpretation process.

It is a portion of a larger 'Document to eReader' project.


## Style Traits
This project takes an approach to mapping pdf contents to tex.

It assumes text flows top to bottom, and all sizable blocks of text can be mapped to a top-down order and position.

Images and tables are treated as additional info, to be displayed 'along side' the text, so their position is mutable. Diagrams and algorithm blocks are treated as a part of this class.

Equations and special charaters are treated as inline text, and their position is not mutable.

## System Structure
Right now tesseract is simply going to be used to grab text and drop it into a flat file, likely a .txt, this way I can explore how to use it. Once I've got a handle on it, we can start the propper program.


First pass through the pdf should generate predicted document formating flags:

- \documentclass
- \usepackage
- \newcommand (?)
- pagenumbering (?)

These are the packages needed to render the pdf


## Error Measures
The generated Latex document is not intended to be a faithful reproduction of the source (assuming the training process goes Latex->PDF->Latex). The reader of the material (the important element in this project) should not care whether or not the text is formatted in one column or two columns, a word occurs on page 11 vs 12, or the exact position of an image.

What does matter is that sections are grouped and nested correctly, internal document links are maintained (figure 1, reference 3, etc), and equations and tables are preserved.
75 changes: 75 additions & 0 deletions ROADMAP.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@


# ROADMAP: pdfn2tex 0.0.00
Future expansions are considered in this file.
Their presence is not a promise that they'll exist, but rather this file serves as an
early outline of features this project hopes to add, as well as changes in directions


The format is adapted from [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
It wont adhere perfectly, but it's a start.


Dates follow YYYY-MM-DD format


> “What do you think?” he demanded impetuously.
> “About what?” He waved his hand toward the book-shelves.
> “About that. As a matter of fact you needn’t bother to ascertain. I ascertained. They’re real.”
> “The books?”
> He nodded.
> “Absolutely real — have pages and everything. I thought they’d be a nice durable cardboard. Matter of fact, they’re absolutely real. Pages and — Here! Lemme show you.”
> Taking our scepticism for granted, he rushed to the bookcases and returned with Volume One of the “Stoddard Lectures.”
> “See!” he cried triumphantly. “It’s a bona-fide piece of printed matter. It fooled me. This fella’s a regular Belasco. It’s a triumph. What thoroughness! What realism! Knew when to stop, too — didn’t cut the pages. But what do you want? What do you expect?”
-- Owl Eyes in Great Gatsby by F. Scott Fitzgerald




## [0.0.00] pdfn2tex 2019-XX-XX
In Progress.
### Contributors
Keith Murray

email: kmurrayis@gmail.com |
twitter: [@keithTheEE](https://twitter.com/keithTheEE) |
github: [CrakeNotSnowman](https://github.com/CrakeNotSnowman)

Unless otherwise noted, all changes by @kmurrayis

### Short Term Roadmap
convert PDF into images

Pytesseract to get text

opencv to get format type?


#### Add
- Make a flag for direct PDF generated by Latex, vs PDF generated by latex + artifacts, where artifacts would be caused by actions like scanning a paper copy of the PDF.
#### Change
#### Deprecate
#### Remove
#### Fix
#### Security
- Because the 'convert' function is vulnerable to remote code execution, and because this tool is meant to be paired with a browser extension, I need to search for another method to render the pdf as an image.
- Work around is pdftoppm
#### Documentation
#### Consider
#### Testing

---

### General to Long Term Expansion

- Consider exporting equations from latex to python via sympy.
- See https://stackoverflow.com/questions/1381741/converting-latex-code-to-images-or-other-displayble-format-with-python
- Also consider rendering equations as png using the same function (useful when build target is old ereaders like the touch)

0 comments on commit 35284ae

Please sign in to comment.