Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for removing HTML elements using XPath selectors #2632

Conversation

michaelmcmillan
Copy link
Contributor

@michaelmcmillan michaelmcmillan commented Sep 15, 2024

Similarly to how XPath selectors can be used to extract matching HTML elements, it would also be useful to remove HTML elements that match an XPath selector.

Screenshot 2024-09-16 at 21 51 03 Screenshot 2024-09-16 at 21 49 51 Screenshot 2024-09-16 at 21 48 03

@dgtlmoon
Copy link
Owner

thanks so much :)

@michaelmcmillan
Copy link
Contributor Author

@dgtlmoon: Added support for those now. Mind taking another look?

@dgtlmoon
Copy link
Owner

sorry about the test failure, it is unrelated to your changes

@dgtlmoon
Copy link
Owner

sorry, one more change, can you change the logic so its the same as this

  • use futures so that we capture any memory flood caused by anything that used LXML
  • use the same logic for finding what xpath filter to use (because xpath 2+ uses a different implementation)

xpath: or // assume xpath2+ compat

xpath1: should be only xpath1

for filter_rule in include_filters_rule:
# For HTML/XML we offer xpath as an option, just start a regular xPath "/.."
if filter_rule[0] == '/' or filter_rule.startswith('xpath:'):
with ProcessPoolExecutor() as executor:
# Use functools.partial to create a callable with arguments - anything using bs4/lxml etc is quite "leaky"
future = executor.submit(partial(html_tools.xpath_filter, xpath_filter=filter_rule.replace('xpath:', ''),
html_content=self.fetcher.content,
append_pretty_line_formatting=not watch.is_source_type_url,
is_rss=is_rss))
html_content += future.result()
elif filter_rule.startswith('xpath1:'):
with ProcessPoolExecutor() as executor:
# Use functools.partial to create a callable with arguments - anything using bs4/lxml etc is quite "leaky"
future = executor.submit(partial(html_tools.xpath1_filter, xpath_filter=filter_rule.replace('xpath1:', ''),
html_content=self.fetcher.content,
append_pretty_line_formatting=not watch.is_source_type_url,
is_rss=is_rss))
html_content += future.result()
else:
with ProcessPoolExecutor() as executor:
# Use functools.partial to create a callable with arguments - anything using bs4/lxml etc is quite "leaky"
# CSS Filter, extract the HTML that matches and feed that into the existing inscriptis::get_text
future = executor.submit(partial(html_tools.include_filters, include_filters=filter_rule,
html_content=self.fetcher.content,
append_pretty_line_formatting=not watch.is_source_type_url))
html_content += future.result()
if not html_content.strip():

@dgtlmoon
Copy link
Owner

@michaelmcmillan please update your fork with our master and repush

@michaelmcmillan
Copy link
Contributor Author

sorry, one more change, can you change the logic so its the same as this

  • use futures so that we capture any memory flood caused by anything that used LXML
  • use the same logic for finding what xpath filter to use (because xpath 2+ uses a different implementation)

xpath: or // assume xpath2+ compat

xpath1: should be only xpath1

That was easier said than done. Been going at it for two hours, just getting errors... Can you give it a shot?

@dgtlmoon dgtlmoon merged commit dc936a2 into dgtlmoon:master Sep 17, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants