-
Notifications
You must be signed in to change notification settings - Fork 41
Description
Bug Description
Open Web Crawler extraction_rulesets cannot extract Swiftype-style meta tags (for example ) because extraction rules only return node text and do not support HTML attribute extraction. As a result, Swiftype metadata fields such as boost and published_at are unreachable during crawl-time extraction, even though they exist in the HTML and can be extracted externally.
This behavior blocks migrations from Swiftype/App Search metadata conventions to Open Web Crawler without custom post-processing.
To Reproduce
Steps to reproduce the behavior:
- Configure Open Web Crawler with extraction_rules targeting Swiftype meta tags, for example:
selector: "//meta[@Class='swiftype' and @name='boost']/@content" - Crawl a page that contains valid Swiftype meta tags in the HTML head.
- Inspect the indexed document fields.
- Observe that Swiftype fields (boost, published_at, etc.) are empty or missing.
Expected behavior
Extraction rules should be able to retrieve meta tag attribute values (for example content=) or provide a supported mechanism to extract Swiftype-style metadata, similar to existing helpers like meta_keywords and meta_description.
Screenshots
Not applicable. Validation was performed via indexed documents and debug fields.
Environment
- OS: Linux (Docker-based Open Web Crawler)
- Browser: N/A
- Version: Elastic Open Web Crawler (current main branch as of Jan 2026), Elasticsearch 9.x
Additional context
- Standard crawler fields (title, body_content, links, headings) populate correctly.
- extraction_rules execute for CSS/XPath selectors that return text nodes.
- Swiftype meta tags do not contain inner text, only attributes, so extract_by_css_selector and extract_by_xpath_selector return empty results.
- Ruby call path:
Crawler::ContentEngine::Extractor.extract
-> execute_rules
-> extract_from_crawl_result
-> extract_by_css_selector / extract_by_xpath_selector - These methods only return node text and do not expose attributes.
- External validation using a standalone Python stdlib script confirms the HTML contains valid Swiftype meta tags and values.
- Ingest pipelines cannot fully solve this because raw_html is not exposed by default and extraction_rules cannot access attributes directly.
This appears to be a crawler capability gap rather than a configuration issue. Potential enhancements include:
- Supporting attribute extraction in extraction_rules.
- Adding configurable meta tag helpers similar to meta_tags_elastic.
- Exposing selected meta blocks or raw_html for downstream processing.