-
Notifications
You must be signed in to change notification settings - Fork 41
Description
Problem Description
We are migrating content from a legacy Swiftype/App Search–based implementation to the Elastic Open Web Crawler.
Our existing content relies on Swiftype-style meta tags embedded in HTML, for example:
During testing, we confirmed:
- These meta tags exist in the HTML and are valid.
- The Open Web Crawler successfully indexes standard fields such as title, body_content, links, headings, description, keywords, and site_name (via og:site_name or ingest fallbacks).
- However, Swiftype-style meta tags are not extractable via extraction_rulesets.
Root cause identified through code review and testing:
- extraction_rulesets only extract node text content.
- tags do not contain inner text, only attributes (e.g., content=).
- extract_by_css_selector and extract_by_xpath_selector return text nodes only.
- Attribute values are therefore unreachable by design.
- raw_html is not exposed in a way that reliably enables ingest pipeline–based extraction at crawl scale.
As a result, fields such as boost and published_at cannot be extracted at crawl time, even though the data is present and accessible externally.
This creates a migration gap for users moving from Swiftype/App Search metadata conventions to the Open Web Crawler.
Proposed Solution
Support attribute-level extraction in extraction_rulesets, for example:
Option A:
Allow extraction rules to specify an attribute:
selector: meta.swiftype[name=boost]
attribute: content
Option B:
Provide a generic meta-tag extraction helper similar to existing helpers (meta_keywords, meta_description, meta_tags_elastic), but configurable:
- Support arbitrary meta tag classes (e.g., swiftype)
- Support name/content pairs dynamically
Option C:
Optionally expose raw_html (or selected meta blocks) in a supported, documented way for ingest pipelines, with clear guidance on performance and storage implications.
Any one of these would allow users to extract structured metadata currently stored in meta attributes.
Alternatives
- Manual post-processing using external scripts (e.g., Python crawlers) to re-fetch pages and update documents after indexing.
- Re-implementing boost and published_at semantics using Elastic-native scoring, ranking, and date fields instead of Swiftype metadata.
- Hard-coding additional meta helpers in the crawler for specific legacy conventions.
These alternatives are workable but add operational complexity and make migration more difficult for large sites.
Additional Context
- This limitation appears to be intentional given the crawler’s current design, but it is not obvious from the documentation that attribute extraction is unsupported.
- The issue primarily affects customers migrating from Swiftype/App Search who have years of metadata encoded in meta tags.
- We can provide reproducible test URLs, crawler.yml configuration, and extracted HTML examples if needed.
- This request is intended as an enhancement, not a bug report.
We are seeking confirmation on whether:
- Attribute extraction support is planned.
- This is an intentional long-term limitation.
- There is a recommended migration pattern for Swiftype-style metadata.
@anish Mathur