Skip to content

Latest commit

 

History

History
150 lines (81 loc) · 6.86 KB

Planning.md

File metadata and controls

150 lines (81 loc) · 6.86 KB

Tesseract release planning

Here we can plan the next releases of Tesseract.

Future releases

Here are some ideas for future Tesseract releases.

  • Modernize the code using C++11 (see discussions here and here).

  • Use llvm's tools: clang-format, clang-tidy, scan-build, sanitizers.

  • Replace more Tesseract data types by C++ standard types (GenericVector, ...), especially for the API.

  • Add json (or xml) output format. It will be used for full ocr and for psm 2 - layout info only.

  • Add option to use alternative binarization methods from leptonica.

  • Add an option to output separate files for multipage input (out1.hocr, out2.hocr ...).

  • Add multi-threading option to the command line (openmp will be disabled at runtime in this mode).

  • Explore the option to use Protocol Buffers or FlatBuffers for the traineddata.

  • Improve error handling and don't ignore return values from functions (see discussion).

  • Replace tprintf etc. by advanced logging API with log levels.

5.0.0

Advanced logging

Requirements (see also discussion):

Log levels:

  • trace
  • debug
  • info
  • warning
  • error
  • fatal

Related issues:

Useful links:

4.0.0

See the release notes.

See also the discussion for issue #1423.

Open issues which should be fixed

  • Issues with the "bug" label (see list here)

  • Noise characters recognized with bbox as the entire page #1192

  • Segmentation fault when using integer models for LSTM training #1573

  • Report a warning when the Tesseract initialisation code detects an unsupported locale setting. (See comment)

  • Insufficient error message when output file cannot be created Issue 1424

  • “no best words!!” on mixed language (fra+ara) items (see issue 235)

  • mgr_.Init(traineddata_path.c_str()):Error:Assert failed: #1075 (see issue 1075)

Features wanted for this release

To be discussed

Depending on available resources and opinions, these suggestions will either be added to the planning for the next or a future release or abandoned.

  • Enhance --list-langs to show additional information for scripts and languages like legacy / LSTM, version

    This will make the command slower, because each file must be opened and parsed. Add this as --list-langs-details or as --list-lang-details for one language file based on lang-code?

  • --list-langs should also display the directory it is using

  • Fix the autotools build so that the debug mode uses -O0 as intended

  • Add option to optionally select implementation for dot product (CPU, SSE, AVX, ...)

  • Relative includes for traineddata

    tessedit_load_sublangs should search for the sublangs relative to the parent, not starting in tessdata dir.

  • More fixes for compiler warnings and issues reported by Coverity Scan

  • Add a simple bash script for building tesseract

  • New traineddata format

    In addition to the current proprietary format Tesseract could also support ZIP archives (see discussion). A possible implementation using libarchive is available, but needs more testing.

  • "Training light" - Learning by doing (see issue)

  • Modify text2image to use PrepareDistortedPix() #1052

  • Schedule date

Regression of features from 3.0x

Tesseract 4.0 should be a full replacement for Tesseract 3.05 and have the same features when used with the old OCR engine (--oem 0). The following regressions still need verification (are they really regressions, or are they just missing features for LSTM):

Fixed in 4.1.0

Features from 3.0x which are missing for LSTM

These features still work with the old OCR engine (--oem 0), but are missing and desired for LSTM.

Future release

Here we collect important issues and features for the release(s) following 4.0.0.

  • New LSTM-based OSD detector (see comment).

  • Remove Legacy Tesseract Engine (see issue)

  • Better Multi-language implementation for training (See comment)

  • ARM SIMD support for dot product #519

  • Using OpenMP for dot product #983

  • Remove deprecated code

    This does not include OpenCL or the old Tesseract engine.

  • Tesseract creates output for missing input (see issue 1023).

    Mostly solved, but could be improved.

  • Issue 1353: Patch for /training/tessopt.cpp (see pull request 13)

    It looks like it is not possible to run more than one training in the same process. The pull request describes a possible fix, but does not include a complete implementation (low priority).