Skip to content

exclude_tags not working? #416

@nb71

Description

@nb71

Bug Description

We are trying to use exclude_tags: to avoid including header and footer and the content is still insterted

To Reproduce

Steps to reproduce the behavior:

  1. Create config including:
domains:
  - url: https://test.example.com
    exclude_tags:
      - address
      - header
  1. create test page:
<html>
<head>
    <title> Eclude tag test </title>
<meta name="keywords" content="excludetag"/>
</head>
<body>
<header>
HEADER TEXT Should not be indexed
</header>
<nav>
    <ul>
        <li><a href="/about">About</a></li>
        <li><a href="/contact">Contact</a></li>
    </ul>
</nav>
<h2 >tittle</h2>
BODY
<article>
    <h1>Introduction to HTML</h1>
    <p>HTML is a markup language that is used for creating web pages.</p>
</article>
<address>
main street 123 to be ignored too
</address>
<footer>
FOOOOOOOOOOOOOOOOOOOOOOTERRRRRRRRR
</footer>
</body>
</html>
  1. config is seen in logs:
domains=[{:url=>\"https://test.example.com\", :exclude_tags=>[\"address\", \"header\"]}]; 
  1. crawl and check in elastic:
 "body": [
      "HEADER TEXT Should not be indexed About Contact tittle BODY Introduction to HTML HTML is a markup language that is used for creating web pages. main street 123 to be ignored too FOOOOOOOOOOOOOOOOOOOOOOTERRRRRRRRR"
    ],

Expected behavior

Header and address text should be excluded from body text.

Additional context

using <address data-elastic-exclude> works OK

product_version
0.4.2

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions