Open Web Crawler cannot extract Swiftype-style meta tags via extraction_rulesets (meta class filtering limitation)

## Bug Description
Open Web Crawler extraction_rulesets cannot extract Swiftype-style meta tags (for example <meta class="swiftype" name="boost" content="4">) because extraction rules only return node text and do not support HTML attribute extraction. As a result, Swiftype metadata fields such as boost and published_at are unreachable during crawl-time extraction, even though they exist in the HTML and can be extracted externally.

This behavior blocks migrations from Swiftype/App Search metadata conventions to Open Web Crawler without custom post-processing.

### To Reproduce
Steps to reproduce the behavior:
1. Configure Open Web Crawler with extraction_rules targeting Swiftype meta tags, for example:
   selector: "//meta[@class='swiftype' and @name='boost']/@content"
2. Crawl a page that contains valid Swiftype meta tags in the HTML head.
3. Inspect the indexed document fields.
4. Observe that Swiftype fields (boost, published_at, etc.) are empty or missing.

## Expected behavior
Extraction rules should be able to retrieve meta tag attribute values (for example content=) or provide a supported mechanism to extract Swiftype-style metadata, similar to existing helpers like meta_keywords and meta_description.

## Screenshots
Not applicable. Validation was performed via indexed documents and debug fields.

## Environment
- OS: Linux (Docker-based Open Web Crawler)
- Browser: N/A
- Version: Elastic Open Web Crawler (current main branch as of Jan 2026), Elasticsearch 9.x

## Additional context
- Standard crawler fields (title, body_content, links, headings) populate correctly.
- extraction_rules execute for CSS/XPath selectors that return text nodes.
- Swiftype meta tags do not contain inner text, only attributes, so extract_by_css_selector and extract_by_xpath_selector return empty results.
- Ruby call path:
  Crawler::ContentEngine::Extractor.extract
    -> execute_rules
      -> extract_from_crawl_result
        -> extract_by_css_selector / extract_by_xpath_selector
- These methods only return node text and do not expose attributes.
- External validation using a standalone Python stdlib script confirms the HTML contains valid Swiftype meta tags and values.
- Ingest pipelines cannot fully solve this because raw_html is not exposed by default and extraction_rules cannot access attributes directly.

This appears to be a crawler capability gap rather than a configuration issue. Potential enhancements include:
- Supporting attribute extraction in extraction_rules.
- Adding configurable meta tag helpers similar to meta_tags_elastic.
- Exposing selected meta blocks or raw_html for downstream processing.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Open Web Crawler cannot extract Swiftype-style meta tags via extraction_rulesets (meta class filtering limitation) #412

Bug Description

To Reproduce

Expected behavior

Screenshots

Environment

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Open Web Crawler cannot extract Swiftype-style meta tags via extraction_rulesets (meta class filtering limitation) #412

Description

Bug Description

To Reproduce

Expected behavior

Screenshots

Environment

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions