Releases · aphp/edspdf

12 Feb 14:09

percevalw

v0.10.0

7177a29

v0.10.0 Latest

Latest

Changelog

Added

Support packaging models made in setuptools based projects

Fixed

Support packaging with poetry 2.0

Changed

Handle cases like distant superscript "³ something" where the super script and the rest of the text are parsed are two lines one above the other, when they should be on the same line.

Pull Requests

Handle cases like distant superscripts by @percevalw in #32
chore: bump version to 0.10.0 by @percevalw in #33

Full Changelog: v0.9.3...v0.10.0

Contributors

percevalw

Assets 2

21 Nov 12:05

percevalw

v0.9.3

5778c0e

v0.9.3

Changelog

Support pydantic v2

Pull Requests

Support pydantic v2 by @percevalw in #31

Full Changelog: v0.9.2...v0.9.3

Contributors

percevalw

Assets 2

20 Nov 13:38

percevalw

v0.9.2

cb974f6

v0.9.2

Changelog

Changed

Default to fp16 when inferring with gpu
Support inputs parameter in TrainablePipe.postprocess(...) method (as in edsnlp)
We now check that the user isn't trying to write a single file in a split fashion (when write_in_worker is True or num_rows_per_file is not None) and raise an error if they do

Fixed

Batches full of empty content boxes no longer crash the huggingface-embedding component
Ensure models are always loaded in non training mode
Improved performance of edspdf.data methods over a filesystem (fs parameter)

Pull Requests

Fix empty batches & update data API by @percevalw in #28
chore: bump version to 0.9.2 by @percevalw in #30

Full Changelog: v0.9.1...v0.9.2

Contributors

percevalw

Assets 2

19 Mar 13:37

percevalw

v0.9.1

b2caa6f

v0.9.1

Changelog

Fixed

It is now possible to recursively retrieve pdf files in a directory using edspdf.data.read_files

What's Changed

fix: allow recursive pdf file searching by @percevalw and @acalliger in #26

Full Changelog: v0.9.0...v0.9.1

Contributors

percevalw and acalliger

Assets 2

26 Feb 10:42

percevalw

v0.9.0

0680f51

v0.9.0

What's Changed ?

Added

New unified edspdf.data api (pdf files, pandas, parquet) and LazyCollection object
to efficiently read / write data from / to different formats & sources. This API is
has been heavily inspired by the edsnlp.data API.
New unified processing API to select the execution backend via data.set_processing(...)
to replace the old accelerators API (which is now deprecated, but still available).
huggingface-embedding now supports quantization and other AutoModel.from_pretrained kwargs
It is now possible to add convert a label to multiple labels in the simple-aggregator component :

# To build the "text" field, we will aggregate "title", "body" and "table" lines,
# and output "title" lines in a separate field as well.
label_map = {
    "text" : [ "title", "body", "table" ],
    "title": "title",
    }

Fixed

huggingface-embedding now resize bbox features for large PDFs, instead of making the model crash
huggingface-embedding and sub-box-cnn-pooler now handle empty PDFs correctly

Pull Requests

API update (data & processing) by @percevalw in #25

Full Changelog: v0.8.1...v0.9.0

Contributors

percevalw

Assets 2

26 Sep 08:42

percevalw

v0.8.1

1f1b89e

v0.8.1

Changelog

Fixed

Fix typing to allow passing an accelerator dict to Pipeline.pipe(...)
Removed multiprocessing accelerator debug output
Fixed absolute links in github-pages docs (e.g. image assets)

Changed

Added auto-links to components in the docs (by comparing span contents with entry points)

Pull Requests

v0.8.1 by @percevalw in #23

Full Changelog: v0.8.0...v0.8.1

Contributors

percevalw

Assets 2

07 Sep 16:04

percevalw

v0.8.0

ec3c5ce

v0.8.0

What's changed

Added

Add multi-modal transformers (huggingface-embedding) with windowing options
Add render_page option to pdfminer extractor, for multi-modal PDF features
Add inference utilities (accelerators), with simple mono process support and multi gpu / cpu support
Packaging utils (pipeline.package(...)) to make a pip installable package from a pipeline

Changed

Updated API to follow EDS-NLP's refactoring
Updated confit to 0.4.2 (better errors) and foldedtensor to 0.3.0 (better multiprocess support)
Removed pipeline.score. You should use pipeline.pipe, a custom scorer and pipeline.select_pipes instead.
Better test coverage
Use hatch instead of setuptools to build the package / docs and run the tests

Fixed

Fixed attrs dependency only being installed in dev mode

Pull Requests

Huggingface multi-modal transformers by @percevalw in #15
Dev install documentation and dependencies fix by @ian-fox in #16
Huggingface by @percevalw in #17
Accelerators by @percevalw in #19
Scoring by @percevalw in #20
Packaging utils by @percevalw in #18
chore: bump version to 0.8.0 by @percevalw in #21
feat: switch to hatch package manager by @percevalw in #22

New Contributors

@ian-fox made their first contribution in #16

Full Changelog: v0.7.0...v0.8.0

Contributors

ian-fox and percevalw

Assets 2

09 Jun 14:11

percevalw

v0.7.0

ded336d

v0.7.0

What's changed

This public release comes with a major overhaul of the library since v0.5.3

Core features

new pipeline system whose API is inspired by spaCy
first-class support for pytorch
hybrid model inference and training (rules + deep learning)
moved from pandas DataFrame to attrs dataclasses (PDFDoc, Page, Box, ...) for representing PDF documents
new configuration system based on confit, with support for instantiation of complex deep learning models, off-the-shelf CLI, ...

Functional features

new extractors: pymupdf and poppler (separate packages for licensing reasons)
many deep learning layers (box-transformer, 2d attention with relative position information, ...)
trainable deep learning classifier
training recipes for deep learning models

Full Changelog: v0.5.3...v0.7.0

Assets 2

31 Aug 10:01

percevalw

v0.5.3

677ea9d

v0.5.3

What's Changed

Added

Add label mapping parameter to aggregators (to merge different types of blocks such as title and body)
Improved line aggregation formula

Full Changelog: v0.5.2...v0.5.3

Assets 2

30 Aug 09:50

percevalw

v0.5.2

655efc4

v0.5.2

What's Changed

ci: remove unnecessary poppler dependency by @bdura in #7
Fix aggregation for empty documents by @percevalw in #8

Full Changelog: v0.5.1...v0.5.2

Contributors

percevalw and bdura

Assets 2

Releases: aphp/edspdf

v0.10.0

Changelog

Added

Fixed

Changed

Pull Requests

Contributors

v0.9.3

Changelog

Pull Requests

Contributors

v0.9.2

Changelog

Changed

Fixed

Pull Requests

Contributors

v0.9.1

Changelog

Fixed

What's Changed

Contributors

v0.9.0

What's Changed ?

Added

Fixed

Pull Requests

Contributors

v0.8.1

Changelog

Fixed

Changed

Pull Requests

Contributors

v0.8.0

What's changed

Added

Changed

Fixed

Pull Requests

New Contributors

Contributors

v0.7.0

What's changed

Core features

Functional features

v0.5.3

What's Changed

Added

v0.5.2

What's Changed

Contributors