Skip to content

Releases: aphp/edspdf

v0.10.0

12 Feb 14:09
Compare
Choose a tag to compare

Changelog

Added

  • Support packaging models made in setuptools based projects

Fixed

  • Support packaging with poetry 2.0

Changed

  • Handle cases like distant superscript "³ something" where the super script and the rest of the text are parsed are two lines one above the other, when they should be on the same line.

Pull Requests

Full Changelog: v0.9.3...v0.10.0

v0.9.3

21 Nov 12:05
Compare
Choose a tag to compare

Changelog

  • Support pydantic v2

Pull Requests

Full Changelog: v0.9.2...v0.9.3

v0.9.2

20 Nov 13:38
Compare
Choose a tag to compare

Changelog

Changed

  • Default to fp16 when inferring with gpu
  • Support inputs parameter in TrainablePipe.postprocess(...) method (as in edsnlp)
  • We now check that the user isn't trying to write a single file in a split fashion (when write_in_worker is True or num_rows_per_file is not None) and raise an error if they do

Fixed

  • Batches full of empty content boxes no longer crash the huggingface-embedding component
  • Ensure models are always loaded in non training mode
  • Improved performance of edspdf.data methods over a filesystem (fs parameter)

Pull Requests

Full Changelog: v0.9.1...v0.9.2

v0.9.1

19 Mar 13:37
Compare
Choose a tag to compare

Changelog

Fixed

  • It is now possible to recursively retrieve pdf files in a directory using edspdf.data.read_files

What's Changed

Full Changelog: v0.9.0...v0.9.1

v0.9.0

26 Feb 10:42
Compare
Choose a tag to compare

What's Changed ?

Added

  • New unified edspdf.data api (pdf files, pandas, parquet) and LazyCollection object
    to efficiently read / write data from / to different formats & sources. This API is
    has been heavily inspired by the edsnlp.data API.
  • New unified processing API to select the execution backend via data.set_processing(...)
    to replace the old accelerators API (which is now deprecated, but still available).
  • huggingface-embedding now supports quantization and other AutoModel.from_pretrained kwargs
  • It is now possible to add convert a label to multiple labels in the simple-aggregator component :
# To build the "text" field, we will aggregate "title", "body" and "table" lines,
# and output "title" lines in a separate field as well.
label_map = {
    "text" : [ "title", "body", "table" ],
    "title": "title",
    }

Fixed

  • huggingface-embedding now resize bbox features for large PDFs, instead of making the model crash
  • huggingface-embedding and sub-box-cnn-pooler now handle empty PDFs correctly

Pull Requests

Full Changelog: v0.8.1...v0.9.0

v0.8.1

26 Sep 08:42
Compare
Choose a tag to compare

Changelog

Fixed

  • Fix typing to allow passing an accelerator dict to Pipeline.pipe(...)
  • Removed multiprocessing accelerator debug output
  • Fixed absolute links in github-pages docs (e.g. image assets)

Changed

  • Added auto-links to components in the docs (by comparing span contents with entry points)

Pull Requests

Full Changelog: v0.8.0...v0.8.1

v0.8.0

07 Sep 16:04
Compare
Choose a tag to compare

What's changed

Added

  • Add multi-modal transformers (huggingface-embedding) with windowing options
  • Add render_page option to pdfminer extractor, for multi-modal PDF features
  • Add inference utilities (accelerators), with simple mono process support and multi gpu / cpu support
  • Packaging utils (pipeline.package(...)) to make a pip installable package from a pipeline

Changed

  • Updated API to follow EDS-NLP's refactoring
  • Updated confit to 0.4.2 (better errors) and foldedtensor to 0.3.0 (better multiprocess support)
  • Removed pipeline.score. You should use pipeline.pipe, a custom scorer and pipeline.select_pipes instead.
  • Better test coverage
  • Use hatch instead of setuptools to build the package / docs and run the tests

Fixed

  • Fixed attrs dependency only being installed in dev mode

Pull Requests

New Contributors

Full Changelog: v0.7.0...v0.8.0

v0.7.0

09 Jun 14:11
Compare
Choose a tag to compare

What's changed

This public release comes with a major overhaul of the library since v0.5.3

Core features

  • new pipeline system whose API is inspired by spaCy
  • first-class support for pytorch
  • hybrid model inference and training (rules + deep learning)
  • moved from pandas DataFrame to attrs dataclasses (PDFDoc, Page, Box, ...) for representing PDF documents
  • new configuration system based on confit, with support for instantiation of complex deep learning models, off-the-shelf CLI, ...

Functional features

  • new extractors: pymupdf and poppler (separate packages for licensing reasons)
  • many deep learning layers (box-transformer, 2d attention with relative position information, ...)
  • trainable deep learning classifier
  • training recipes for deep learning models

Full Changelog: v0.5.3...v0.7.0

v0.5.3

31 Aug 10:01
Compare
Choose a tag to compare

What's Changed

Added

  • Add label mapping parameter to aggregators (to merge different types of blocks such as title and body)
  • Improved line aggregation formula

Full Changelog: v0.5.2...v0.5.3

v0.5.2

30 Aug 09:50
Compare
Choose a tag to compare

What's Changed

  • ci: remove unnecessary poppler dependency by @bdura in #7
  • Fix aggregation for empty documents by @percevalw in #8

Full Changelog: v0.5.1...v0.5.2