Skip to content

Latest commit

 

History

History
78 lines (66 loc) · 2.29 KB

Readme.md

File metadata and controls

78 lines (66 loc) · 2.29 KB

Hocr to JSON - WIP

alt text

WORK IN PROGRESS

Simple tool to convert .hocr files to json for further processing

Work in progress, so far tested with tesseract output

Example result JSON

The result will give some meta information and a representation of the hocr file in json.

Example output

{
  "contentType": "text/html;charset=utf-8",
  "ocrCapabilities": "ocr_page ocr_carea ocr_par ocr_line ocrx_word ocrp_wconf",
  "ocrSystem": "tesseract 4.1.0-rc1-752-g8b69",
  "pages": [
    {
      "bbox": [[ 0, 0 ], [ 640, 480 ]],
      "careas": [
        {
          "bbox": [[ 36, 92], [ 618, 361 ]],
          "id": "block_1_1",
          "pars": [
            {
              "id": "par_1_1",
              "lang": "eng",
              "lines": [
                {
                  "baseline": [ 0, -6 ],
                  "bbox": [[ 36, 92 ], [ 580, 122 ]],
                  "id": "line_1_1",
                  "words": [
                    {
                      "bbox": [[ 36, 92 ], [ 96, 116 ]],
                      "content": "This",
                      "id": "word_1_1",
                      "xWconf": 91
                    },
                    {
                      "bbox": [[ 109, 92 ], [ 129, 116 ]],
                      "content": "is",
                      "id": "word_1_2",
                      "xWconf": 92
                    },
                    {
                      "bbox": [[ 141, 98 ], [ 156, 116 ]],
                      "content": "a",
                      "id": "word_1_3",
                      "xWconf": 92
                    },
                    {
                      "bbox": [[ 169, 92 ], [ 201, 116 ]],
                      "content": "lot",
                      "id": "word_1_4",
                      "xWconf": 90
                    },
                    {
                      "bbox": [[ 212, 92 ], [ 240, 116 ]],
                      "content": "of",
                      "id": "word_1_5",
                      "xWconf": 93
                    },
                    // ...

Further example content can be found in /stub/ directory.