You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Nov 29, 2019. It is now read-only.
Pass receiver through to spatials calls, or construct receiver
from method calls in spatials calls.
Write methods to create spatial objects.
Use show_text_with_positioning to construct text runs.
Record spatial objects that result from spatials method calls.
Complete the covert method - write spatial hashs out to xml.
Write Spatial class with alter method.
Small tasks, 10 minutes or so
Improve inclusion of spatial object modules. Shouldn’t need to
call, for example, include_text_runs.
Pass set of previously constructed spatial objects to sptials calls,
via a new method in parser - parser.previous :text_runs { … }.
Spatial objects such as margins depend on the positions of text
runs.
Determine units of and apply correctly :rise and :leading.
Handle UserUnit transformations.
Reset global text state at the end of each page.
Position coords appear a bit above and to the right of where they
should. Graphics state or page translation?
Spaces appear one character before where they should in
text_chunks.
Apostrophes cause a chunk break.
Split views into View, PdfView, PngView. Pass only explicit
spatials in an easier form for Views. Move into lib/view.
Move analysis modules into lib/analysis.
Non-ascii chars that are transliterated are appearing in output one
place before they should. Transliterated chars also cause a word
break. Not getting their width correct in glyph width dict?
! Looks like this occurs for chars whose codes are in the font’s
@differences map.
pdf_view.rb shouldn’t call doc.go_to_page more than once per page.
Causes objects to be rendered on the wrong page.
pdf_view.rb should keep the same auto colour for object types on
different pages.
Some margins calculated with negative x or y. Because of characters
incorrectly calculated to be out of the mediabox?
Some characters don’t get a correct width.
Should merge with region above from right to left (or, in the
opposite direction to writing direction). Causes last two lines of
paragraphs to merge incorrectly.
Figure out which state, and when (text object start/end, page
start/end, text show ops) should be pushed/popped.
Move parsing parameters to the Pdf class. Settable via pdf-extract.
Tasks, up to 3 hours
Handle new line operators, and all show text operators.
Handle font metrics correctly, including glyph widths, displacement
vectors and bounding boxes.
Handle text matrix when it is applying a rotation.
Handle type 3 font font matrices.
Handle writing mode selection for composite fonts (type 0)
(different font metrics).
Some way of splitting SpatialObjects by page.
!! Handle type 3 font operators. These may not be supported by
pdf-reader!
Add spatials parser.post { }, use in text_runs to sort and merge
adjacent runs. Or split text_runs into characters and text_runs.
Implement json output.
For some PDFs, character width and height not detected correctly.
In some PDFs, ascent, descent and bbox info for fonts is not
available. Seems to be those fonts whose base font is one of the
base 14.
Prawn doesn’t render over some PDFs.
Assign colour, font, font size to character objects. Pass on to
text chunks and regions.
Characters appear too wide in some3.pdf test PDF.
Characters on pages with images are sometimes not detected. Graphics
state issue?
When –margins and –zones specified duplicate margins appear
in output.
Pass chunk locations through to resolved references.
Long tasks, greater than 3 hours
Examine text_runs spatial definition and determine processing that
is generic. Move into Parser methods. E.g. Handling global /
object-specific state.
Rewrite pdf.rb.
Better organise pre/object/post call storage in pdf.rb . Perhaps
a pre and post per object type.
To version 0.1
Join regions into sections only when their textual attributes match,
such as letter ratio.
Allow skipping when joining regions, or try to join vertically, then
horizontally. Should allow for merging of disparate regions such
as at the end of an article and beginning of another.
Bug: First line sometimes not appearing in lines or –no-lines content.
Probably region construction is creating regions with line and
non-line content.
Bug: Some section headers being merged into their body as a single region.
Either move pdf-reader changes into pdf-extract or get patches committed.
Split by line spacing and split by margin
Bug: Section objects appear within <page> elements. Should be pageless.
Bug: When merging chunks left to right, sometimes they merge out of order.
Could be occuring in chunk generation or in region generation.
Bug: Regions appear in reverse y sort order in output. Likely causing
problems with section header analysis.
Catalog system, download OAI-PMH metadata and PDFs
Apply name service data to section detection
Check for sequential delimiters first
Partition refs on delimiter type frequency
e.g. for x_offset (margin) delimiters, partition on the second
most frequent x_offset.
Sections still get split by one-liner regions
Such as refs section of one-column.pdf
Those one-liners should really get merged into regions, they
are often part of a line.
No longer an issue due to ignoring regions that are less than
almost column width when joining sections.
Headers joining onto section bodies. Thus appear as refs!
Was due to not ignoring one-liners that were far less than the
column width. Though will still have an issue with multi-line
section headers.
Problems
How do I determine the correct reference parsing scheme?
How do I distingush between junk references and real references?
Reference section detection does not always work, and may misidentify
non-ref sections as reference sections.
Proper name ratio
Punctuation ratio
Capitalised words ratio
Year literals ratio
Are all of these above average for the content as a whole?