You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This file has name with .doc, but actually is html file. When processing it, yomu will running for a very long time without and response, until I force to kill the thread.
Even if I change the filename to *.html, it still the same, so maybe the file is special.
And then I try to parse with tika directly, it extract text rightly.
@sherllochen I don't believe this project is maintained. Suggest try using the newer version of Tika (v1.14). I've forked this project and updated Tika. See https://github.com/abrom/henkei
This file has name with .doc, but actually is html file. When processing it, yomu will running for a very long time without and response, until I force to kill the thread.
Even if I change the filename to *.html, it still the same, so maybe the file is special.
And then I try to parse with tika directly, it extract text rightly.
fake_doc_but_htm.doc.zip
The text was updated successfully, but these errors were encountered: