Tasks with unclassified time period

Pass receiver through to spatials calls, or construct receiver

from method calls in spatials calls.

Write methods to create spatial objects.

Use show_text_with_positioning to construct text runs.

Record spatial objects that result from spatials method calls.

Complete the covert method - write spatial hashs out to xml.

Write Spatial class with alter method.

Small tasks, 10 minutes or so

Improve inclusion of spatial object modules. Shouldn’t need to

call, for example, include_text_runs.

Pass set of previously constructed spatial objects to sptials calls,

via a new method in parser - parser.previous :text_runs { … }. Spatial objects such as margins depend on the positions of text runs.

Determine units of and apply correctly :rise and :leading.

Handle UserUnit transformations.

Reset global text state at the end of each page.

Position coords appear a bit above and to the right of where they

should. Graphics state or page translation?

Spaces appear one character before where they should in

text_chunks.

Apostrophes cause a chunk break.

Split views into View, PdfView, PngView. Pass only explicit

spatials in an easier form for Views. Move into lib/view.

Move analysis modules into lib/analysis.

Non-ascii chars that are transliterated are appearing in output one

place before they should. Transliterated chars also cause a word break. Not getting their width correct in glyph width dict? ! Looks like this occurs for chars whose codes are in the font’s @differences map.

pdf_view.rb shouldn’t call doc.go_to_page more than once per page.

Causes objects to be rendered on the wrong page.

pdf_view.rb should keep the same auto colour for object types on

different pages.

Some margins calculated with negative x or y. Because of characters

incorrectly calculated to be out of the mediabox?

Some characters don’t get a correct width.

Should merge with region above from right to left (or, in the

opposite direction to writing direction). Causes last two lines of paragraphs to merge incorrectly.

Figure out which state, and when (text object start/end, page

start/end, text show ops) should be pushed/popped.

Move parsing parameters to the Pdf class. Settable via pdf-extract.

Tasks, up to 3 hours

Handle new line operators, and all show text operators.

Handle font metrics correctly, including glyph widths, displacement

vectors and bounding boxes.

Handle text matrix when it is applying a rotation.

Handle type 3 font font matrices.

Handle writing mode selection for composite fonts (type 0)

(different font metrics).

Some way of splitting SpatialObjects by page.

!! Handle type 3 font operators. These may not be supported by

pdf-reader!

Add spatials parser.post { }, use in text_runs to sort and merge

adjacent runs. Or split text_runs into characters and text_runs.

Implement json output.

For some PDFs, character width and height not detected correctly.

In some PDFs, ascent, descent and bbox info for fonts is not

available. Seems to be those fonts whose base font is one of the base 14.

Prawn doesn’t render over some PDFs.

Assign colour, font, font size to character objects. Pass on to

text chunks and regions.

Characters appear too wide in some3.pdf test PDF.

Characters on pages with images are sometimes not detected. Graphics

state issue?

When –margins and –zones specified duplicate margins appear

in output.

Pass chunk locations through to resolved references.

Long tasks, greater than 3 hours

Examine text_runs spatial definition and determine processing that

is generic. Move into Parser methods. E.g. Handling global / object-specific state.

Rewrite pdf.rb.

Better organise pre/object/post call storage in pdf.rb . Perhaps

a pre and post per object type.

To version 0.1

Join regions into sections only when their textual attributes match,

such as letter ratio.

Allow skipping when joining regions, or try to join vertically, then

horizontally. Should allow for merging of disparate regions such as at the end of an article and beginning of another.

Bug: First line sometimes not appearing in lines or –no-lines content.

Probably region construction is creating regions with line and non-line content.

Bug: Some section headers being merged into their body as a single region.

Either move pdf-reader changes into pdf-extract or get patches committed.

Split by line spacing and split by margin

Bug: Section objects appear within <page> elements. Should be pageless.

Bug: When merging chunks left to right, sometimes they merge out of order.

Could be occuring in chunk generation or in region generation.

Bug: Regions appear in reverse y sort order in output. Likely causing

problems with section header analysis.

Catalog system, download OAI-PMH metadata and PDFs

Apply name service data to section detection

Check for sequential delimiters first

Partition refs on delimiter type frequency

e.g. for x_offset (margin) delimiters, partition on the second most frequent x_offset.

Sections still get split by one-liner regions

Such as refs section of one-column.pdf Those one-liners should really get merged into regions, they are often part of a line.

No longer an issue due to ignoring regions that are less than almost column width when joining sections.

Headers joining onto section bodies. Thus appear as refs!

Was due to not ignoring one-liners that were far less than the column width. Though will still have an issue with multi-line section headers.

Files

TODO.org

Latest commit

History