raw html
|
|
sax input -> sax parser(html parser) -> HTML Content handler -> tokenizer ---------
|
-------------------------------------<------------------------------------<------|
| | |
text blocks text blocks text blocks
| | |
| | |
-----------------------------
|
|
text document
|
|
filter
|
filter
|
filter
|
filter
|
filter
|
filter
|
filter
|
filter
|
filter
|
|
text document
|
outputs extracted text