Skip to content

Open Web Crawler cannot extract Swiftype-style meta tags via extraction_rulesets (meta class filtering limitation) #413

@lcrane777

Description

@lcrane777

Problem Description

We are migrating content from a legacy Swiftype/App Search–based implementation to the Elastic Open Web Crawler.
Our existing content relies on Swiftype-style meta tags embedded in HTML, for example:

During testing, we confirmed:

  • These meta tags exist in the HTML and are valid.
  • The Open Web Crawler successfully indexes standard fields such as title, body_content, links, headings, description, keywords, and site_name (via og:site_name or ingest fallbacks).
  • However, Swiftype-style meta tags are not extractable via extraction_rulesets.

Root cause identified through code review and testing:

  • extraction_rulesets only extract node text content.
  • tags do not contain inner text, only attributes (e.g., content=).
  • extract_by_css_selector and extract_by_xpath_selector return text nodes only.
  • Attribute values are therefore unreachable by design.
  • raw_html is not exposed in a way that reliably enables ingest pipeline–based extraction at crawl scale.

As a result, fields such as boost and published_at cannot be extracted at crawl time, even though the data is present and accessible externally.

This creates a migration gap for users moving from Swiftype/App Search metadata conventions to the Open Web Crawler.


Proposed Solution

Support attribute-level extraction in extraction_rulesets, for example:

Option A:
Allow extraction rules to specify an attribute:
selector: meta.swiftype[name=boost]
attribute: content

Option B:
Provide a generic meta-tag extraction helper similar to existing helpers (meta_keywords, meta_description, meta_tags_elastic), but configurable:

  • Support arbitrary meta tag classes (e.g., swiftype)
  • Support name/content pairs dynamically

Option C:
Optionally expose raw_html (or selected meta blocks) in a supported, documented way for ingest pipelines, with clear guidance on performance and storage implications.

Any one of these would allow users to extract structured metadata currently stored in meta attributes.


Alternatives

  • Manual post-processing using external scripts (e.g., Python crawlers) to re-fetch pages and update documents after indexing.
  • Re-implementing boost and published_at semantics using Elastic-native scoring, ranking, and date fields instead of Swiftype metadata.
  • Hard-coding additional meta helpers in the crawler for specific legacy conventions.

These alternatives are workable but add operational complexity and make migration more difficult for large sites.


Additional Context

  • This limitation appears to be intentional given the crawler’s current design, but it is not obvious from the documentation that attribute extraction is unsupported.
  • The issue primarily affects customers migrating from Swiftype/App Search who have years of metadata encoded in meta tags.
  • We can provide reproducible test URLs, crawler.yml configuration, and extracted HTML examples if needed.
  • This request is intended as an enhancement, not a bug report.

We are seeking confirmation on whether:

  • Attribute extraction support is planned.
  • This is an intentional long-term limitation.
  • There is a recommended migration pattern for Swiftype-style metadata.

@anish Mathur

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions