Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use libzim IndexData::getContent to provide currated content to index. #282

Open
mgautierfr opened this issue Mar 14, 2023 · 3 comments
Open

Comments

@mgautierfr
Copy link

libzim provides a way for scrappers to provide a different content than the one stored for the indexation.

It allow a better indexation when a lot of content is not relevant about the subject of the content itself.

mwoffliner should parse the html content and extract only the relevant information (so remove thing such has menu, footer, user information, links to other questions...)

See comments in openzim/libzim#653

@stale
Copy link

stale bot commented May 26, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

@stale stale bot added the stale label May 26, 2023
@stale stale bot removed the stale label Jul 16, 2023
@kelson42
Copy link
Contributor

@rgaudin @benoit74 Would this approach, here, brings a real improvement? If "yes', which one?

@rgaudin
Copy link
Member

rgaudin commented Jul 21, 2023

Improvement would be marginal I think because we don't include much non-content text in the HTML.
A side effect would be parsing all our output using an in-scraper HTML parser versus letting libzim do it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants