Skip to content

Open Web Crawler cannot extract Swiftype-style meta tags via extraction_rulesets (meta class filtering limitation) #412

@lcrane777

Description

@lcrane777

Bug Description

Open Web Crawler extraction_rulesets cannot extract Swiftype-style meta tags (for example ) because extraction rules only return node text and do not support HTML attribute extraction. As a result, Swiftype metadata fields such as boost and published_at are unreachable during crawl-time extraction, even though they exist in the HTML and can be extracted externally.

This behavior blocks migrations from Swiftype/App Search metadata conventions to Open Web Crawler without custom post-processing.

To Reproduce

Steps to reproduce the behavior:

  1. Configure Open Web Crawler with extraction_rules targeting Swiftype meta tags, for example:
    selector: "//meta[@Class='swiftype' and @name='boost']/@content"
  2. Crawl a page that contains valid Swiftype meta tags in the HTML head.
  3. Inspect the indexed document fields.
  4. Observe that Swiftype fields (boost, published_at, etc.) are empty or missing.

Expected behavior

Extraction rules should be able to retrieve meta tag attribute values (for example content=) or provide a supported mechanism to extract Swiftype-style metadata, similar to existing helpers like meta_keywords and meta_description.

Screenshots

Not applicable. Validation was performed via indexed documents and debug fields.

Environment

  • OS: Linux (Docker-based Open Web Crawler)
  • Browser: N/A
  • Version: Elastic Open Web Crawler (current main branch as of Jan 2026), Elasticsearch 9.x

Additional context

  • Standard crawler fields (title, body_content, links, headings) populate correctly.
  • extraction_rules execute for CSS/XPath selectors that return text nodes.
  • Swiftype meta tags do not contain inner text, only attributes, so extract_by_css_selector and extract_by_xpath_selector return empty results.
  • Ruby call path:
    Crawler::ContentEngine::Extractor.extract
    -> execute_rules
    -> extract_from_crawl_result
    -> extract_by_css_selector / extract_by_xpath_selector
  • These methods only return node text and do not expose attributes.
  • External validation using a standalone Python stdlib script confirms the HTML contains valid Swiftype meta tags and values.
  • Ingest pipelines cannot fully solve this because raw_html is not exposed by default and extraction_rules cannot access attributes directly.

This appears to be a crawler capability gap rather than a configuration issue. Potential enhancements include:

  • Supporting attribute extraction in extraction_rules.
  • Adding configurable meta tag helpers similar to meta_tags_elastic.
  • Exposing selected meta blocks or raw_html for downstream processing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions