All notable changes to this project will be documented in this file. Currently goes back to v0.4.3
.
The format is based on Keep a Changelog.
- Fix issue #53, in which non-decimalize-able (non_)stroking_color properties were raising errors.
.travis.yml
, but failing on.to_image()
- Move from defunct
pycrypto
topycryptodome
- Update
pdfminer.six
to20170720
- Fix issue #41, in which PDF-object-referenced cropboxes/mediaboxes weren't being fully resolved.
- Access to
__version__
from main namespace
- Fix issue #33, by checking
decode_text
's argument type
- Pin
pdfminer.six
to version20151013
(for now), fixing incompatibility
- Allow
import pdfplumber
even if ImageMagick not installed.
- Access to
curve
points. (E.g.,page.curves[0]["points"]
.) - Ability for
.draw_line
to drawcurve
points.
- Disaggregated "min_words_vertical" (default: 3) and "min_words_horizontal" (default: 1), removing "text_word_threshold".
- Internally, made
utils.decimalize
a bit more robust; now throws errors on non-decimalizable items. - Now explicitly ignoring some (obscure)
pdfminer
object attributes. - Raw input for
.draw_line
from a bounding box to((x, y), (x, y))
, for consistency withcurve["points"]
and withPillow
's underlying method.
- Fixed typo bug when
.rect_edges
is called before.edges
- Quick-draw
PageImage
methods:.draw_vline
,.draw_vlines
,.draw_hline
, and.draw_hlines
. - Boolean parameter
keep_blank_chars
for.extract_words(...)
andTableFinder
settings.
- Increased default
text_tolerance
andintersection_tolerance
TableFinder values from 1 to 3.
- Properly handle conversion of PDFs with transparency to
pillow
images. - Properly handle
pandas
DataFrames as inputs to multi-draw commands (e.g.,PageImage.draw_rects(...)
).
- Visual debugging features, via
Page.to_image(...)
andPageImage
. (Introduceswand
andpillow
as package requirements.) - More powerful options for extracting data from tables. See changes below.
- Entirely overhaul the table-extraction methods. Now based on Anssi Nurminen's master's thesis.
- Disentangle
.crop
from.intersects_bbox
and.within_bbox
. - Change default
x_tolerance
andy_tolerance
for word extraction from5
to3
- Fix bug stemming from non-decimalized page heights. [h/t @jsfenfen]
- Provide access to
Page.page_number
- Use
.page_number
instead of.page_id
as primary identifier. [h/t @jsfenfen] - Change default
x_tolerance
andy_tolerance
for word extraction from0
to5
- Provide proper support for rotated pages
- Fix bug stemming from when metadata includes a PostScript literal. [h/t @boblannon]
Whoops.
- When extracting table cells, use chars' midpoints instead of top-points.
- Fix find_gutters — should ignore
" "
chars