Add support for removing HTML elements using XPath selectors #2632

michaelmcmillan · 2024-09-15T19:57:27Z

Similarly to how XPath selectors can be used to extract matching HTML elements, it would also be useful to remove HTML elements that match an XPath selector.

changedetectionio/html_tools.py

dgtlmoon · 2024-09-15T20:37:45Z

thanks so much :)

michaelmcmillan · 2024-09-16T15:48:02Z

@dgtlmoon: Added support for those now. Mind taking another look?

dgtlmoon · 2024-09-16T20:37:53Z

sorry about the test failure, it is unrelated to your changes

dgtlmoon · 2024-09-16T20:40:21Z

sorry, one more change, can you change the logic so its the same as this

use futures so that we capture any memory flood caused by anything that used LXML
use the same logic for finding what xpath filter to use (because xpath 2+ uses a different implementation)

xpath: or // assume xpath2+ compat

xpath1: should be only xpath1

changedetection.io/changedetectionio/processors/text_json_diff/processor.py

Lines 174 to 202 in 19f3851

    
           for filter_rule in include_filters_rule: 
        
               # For HTML/XML we offer xpath as an option, just start a regular xPath "/.." 
        
               if filter_rule[0] == '/' or filter_rule.startswith('xpath:'): 
        
                   with ProcessPoolExecutor() as executor: 
        
                       # Use functools.partial to create a callable with arguments - anything using bs4/lxml etc is quite "leaky" 
        
                       future = executor.submit(partial(html_tools.xpath_filter, xpath_filter=filter_rule.replace('xpath:', ''), 
        
                                                           html_content=self.fetcher.content, 
        
                                                           append_pretty_line_formatting=not watch.is_source_type_url, 
        
                                                           is_rss=is_rss)) 
        
                       html_content += future.result() 
        
               elif filter_rule.startswith('xpath1:'): 
        
                   with ProcessPoolExecutor() as executor: 
        
                       # Use functools.partial to create a callable with arguments - anything using bs4/lxml etc is quite "leaky" 
        
                       future = executor.submit(partial(html_tools.xpath1_filter, xpath_filter=filter_rule.replace('xpath1:', ''), 
        
                                                           html_content=self.fetcher.content, 
        
                                                           append_pretty_line_formatting=not watch.is_source_type_url, 
        
                                                           is_rss=is_rss)) 
        
                       html_content += future.result() 
        
               else: 
        
                   with ProcessPoolExecutor() as executor: 
        
                       # Use functools.partial to create a callable with arguments - anything using bs4/lxml etc is quite "leaky" 
        
                       # CSS Filter, extract the HTML that matches and feed that into the existing inscriptis::get_text 
        
                       future = executor.submit(partial(html_tools.include_filters, include_filters=filter_rule, 
        
                                                              html_content=self.fetcher.content, 
        
                                                              append_pretty_line_formatting=not watch.is_source_type_url)) 
        
                       html_content += future.result() 
        
           if not html_content.strip():

dgtlmoon · 2024-09-17T17:08:16Z

@michaelmcmillan please update your fork with our master and repush

michaelmcmillan · 2024-09-17T19:48:45Z

sorry, one more change, can you change the logic so its the same as this

use futures so that we capture any memory flood caused by anything that used LXML

use the same logic for finding what xpath filter to use (because xpath 2+ uses a different implementation)

xpath: or // assume xpath2+ compat

xpath1: should be only xpath1

That was easier said than done. Been going at it for two hours, just getting errors... Can you give it a shot?

…ype on single page product dgtlmoon#2636 (dgtlmoon#2638)

…n#2629)

add support for removing html elements using xpath selectors

a0d6f67

dgtlmoon reviewed Sep 15, 2024

View reviewed changes

changedetectionio/html_tools.py Outdated Show resolved Hide resolved

support //, xpath1: and xpath: selectors

0b89e12

michaelmcmillan added 2 commits September 16, 2024 21:52

update form description to include support for xpath queries

c0417a4

allow xpath when validating global subtractive filters

60376d6

fetuffani and others added 5 commits September 17, 2024 21:50

Restock/Price detection - Fix duplicated prices with different data t…

0f25e80

…ype on single page product dgtlmoon#2636 (dgtlmoon#2638)

Testing - Fix false filter missing check alerts

e1b9a63

Testing - Fixing Restock test dgtlmoon#2641

6342d06

browser_steps: add "click element containing text if exists" (dgtlmoo…

8ef23c7

…n#2629)

Update AppRise notification library to 1.9.0 (dgtlmoon#2624)

3204a19

dgtlmoon merged commit dc936a2 into dgtlmoon:master Sep 17, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for removing HTML elements using XPath selectors #2632

Add support for removing HTML elements using XPath selectors #2632

michaelmcmillan commented Sep 15, 2024 •

edited

Loading

dgtlmoon commented Sep 15, 2024

michaelmcmillan commented Sep 16, 2024

dgtlmoon commented Sep 16, 2024

dgtlmoon commented Sep 16, 2024

dgtlmoon commented Sep 17, 2024

michaelmcmillan commented Sep 17, 2024

Add support for removing HTML elements using XPath selectors #2632

Add support for removing HTML elements using XPath selectors #2632

Conversation

michaelmcmillan commented Sep 15, 2024 • edited Loading

dgtlmoon commented Sep 15, 2024

michaelmcmillan commented Sep 16, 2024

dgtlmoon commented Sep 16, 2024

dgtlmoon commented Sep 16, 2024

dgtlmoon commented Sep 17, 2024

michaelmcmillan commented Sep 17, 2024

michaelmcmillan commented Sep 15, 2024 •

edited

Loading