Skip to content

Latest commit

 

History

History
40 lines (40 loc) · 844 Bytes

boilerpipe_flow.md

File metadata and controls

40 lines (40 loc) · 844 Bytes
raw html
   |
   |
  sax input -> sax parser(html parser) ->  HTML Content handler -> tokenizer ---------
                                                                                     |
    -------------------------------------<------------------------------------<------|
    |              |            |
text blocks    text blocks  text blocks
    |              |            |
    |              |            |
    -----------------------------
          |
          |
     text document
          |
          |
        filter
          |
        filter
          |
        filter
          |
        filter
          |
        filter
          |
        filter
          |
        filter
          |
        filter
          |
        filter
          |
          |
     text document
          |
  outputs extracted text