Skip to content

Fix excluded tags lookup to use correct key type#417

Merged
seanstory merged 2 commits intomainfrom
seanstory/416-fix-exclude-tags-lookup
Feb 6, 2026
Merged

Fix excluded tags lookup to use correct key type#417
seanstory merged 2 commits intomainfrom
seanstory/416-fix-exclude-tags-lookup

Conversation

@seanstory
Copy link
Member

@seanstory seanstory commented Feb 6, 2026

Closes #416

The exclude_tags configuration was not being applied correctly. The config stores exclude_tags keyed by domain URL strings (e.g., "https://example.com"), but the lookup in get_body_tag was using the URL object directly as the hash key instead of url.site.

This fix changes the lookup to use url.site (which returns the scheme + host as a string) to match how the config stores the keys.

Checklists

Pre-Review Checklist

  • This PR does NOT contain credentials of any kind, such as API keys or username/passwords (double check crawler.yml.example and elasticsearch.yml.example)
  • This PR has a meaningful title
  • This PR links to all relevant GitHub issues that it fixes or partially addresses
  • this PR has a thorough description
  • Covered the changes with automated tests
  • Tested the changes locally
  • Added a label for each target release version (example: v0.1.0)
  • Considered corresponding documentation changes
    • N/A - this is a bug fix, no documentation changes needed
  • Contributed any configuration settings changes to the configuration reference
    • N/A - no configuration changes
  • Ran make notice if any dependencies have been added
    • N/A - no dependencies added

Changes Requiring Extra Attention

N/A - This is a straightforward bug fix with no security implications or new dependencies.

Release Note

Fixed exclude_tags domain configuration not being applied during crawl. Tags specified in exclude_tags for a domain are now correctly excluded from the document body.

@seanstory
Copy link
Member Author

Customer tested the changes and confirmed the fix. I think we're good to merge, after an approval.

@seanstory seanstory merged commit e22528b into main Feb 6, 2026
5 checks passed
@seanstory seanstory deleted the seanstory/416-fix-exclude-tags-lookup branch February 6, 2026 20:48
github-actions bot pushed a commit that referenced this pull request Feb 6, 2026
### Closes #416

The `exclude_tags` configuration was not being applied correctly. The
config stores exclude_tags keyed by domain URL strings (e.g.,
`"https://example.com"`), but the lookup in `get_body_tag` was using the
URL object directly as the hash key instead of `url.site`.

This fix changes the lookup to use `url.site` (which returns the scheme
+ host as a string) to match how the config stores the keys.

### Checklists

#### Pre-Review Checklist
- [x] This PR does NOT contain credentials of any kind, such as API keys
or username/passwords (double check `crawler.yml.example` and
`elasticsearch.yml.example`)
- [x] This PR has a meaningful title
- [x] This PR links to all relevant GitHub issues that it fixes or
partially addresses
    - Fixes #416
- [x] this PR has a thorough description
- [x] Covered the changes with automated tests
- [ ] Tested the changes locally
- [x] Added a label for each target release version (example: `v0.1.0`)
- [x] Considered corresponding documentation changes
    - N/A - this is a bug fix, no documentation changes needed
- [x] Contributed any configuration settings changes to the
configuration reference
    - N/A - no configuration changes
- [x] Ran `make notice` if any dependencies have been added
    - N/A - no dependencies added

#### Changes Requiring Extra Attention

N/A - This is a straightforward bug fix with no security implications or
new dependencies.

### Release Note

Fixed `exclude_tags` domain configuration not being applied during
crawl. Tags specified in `exclude_tags` for a domain are now correctly
excluded from the document body.
@github-actions
Copy link

github-actions bot commented Feb 6, 2026

💚 Backport PR(s) successfully created

Status Branch Result
0.4 #418

This backport PR will be merged automatically after passing CI.

seanstory added a commit that referenced this pull request Feb 6, 2026
Backports the following commits to 0.4:
 - Fix excluded tags lookup to use correct key type (#417)

Co-authored-by: Sean Story <sean.story@elastic.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

exclude_tags not working?

2 participants