Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snippet generator documentation is incorrect #420

Open
kevinhu opened this issue Jan 28, 2025 · 0 comments
Open

Snippet generator documentation is incorrect #420

kevinhu opened this issue Jan 28, 2025 · 0 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@kevinhu
Copy link
Contributor

kevinhu commented Jan 28, 2025

The snippet generator example suggests that the offsets produced by snippet.highlighted() can be used for slicing the text of the corresponding document:

highlights = snippet.highlighted()
first_highlight = highlights[0]
assert first_highlight.start == 93
assert first_highlight.end == 97
assert hit_text[first_highlight.start:first_highlight.end] == "days"

However, looking at the source implementation of to_html, these offsets are relative to the snippet's fragment and not the document text: https://docs.rs/tantivy/latest/src/tantivy/snippet/mod.rs.html#149

Because the ranges are relative to the fragment and not the document, if the snippet is located in a later portion of the document such that the fragment itself is offset, then using these ranges will not retrieve the correct text for highlighting:

# %%
from tantivy import (
    Document,
    Index,
    SchemaBuilder,
    SnippetGenerator,
)

doc_schema = SchemaBuilder().add_text_field("text", stored=True).build()
index = Index(doc_schema)
writer = index.writer()

doc_1 = Document()
doc_1.add_text("text", "Teach a man to fish and he will eat for the rest of his life.")
_ = writer.add_document(doc_1)

doc_2 = Document()
doc_2.add_text(
    "text",
    """He was an old man who fished alone in a skiff in the Gulf Stream and he had gone eighty-four days now without taking a fish. In the first forty days a boy had been with him. But after forty days without a fish the boy's parents had told him that the old man was now definitely and finally salao, which is the worst form of unlucky, and the boy had gone at their orders in another boat which caught three good fish the first week. It made the boy sad to see the old man come in each day with his skiff empty and he always went down to help him carry either the coiled lines or the gaff and harpoon and the sail that was furled around the mast. The sail was patched with flour sacks and, furled, it looked like the flag of permanent defeat.

The old man was thin and gaunt with deep wrinkles in the back of his neck. The brown blotches of the benevolent skin cancer the sun brings from its reflection on the tropic sea were on his cheeks. The blotches ran well down the sides of his face and his hands had the deep-creased scars from handling heavy fish on the cords. But none of these scars were fresh. They were as old as erosions in a fishless desert.""",
)
_ = writer.add_document(doc_2)

_ = writer.commit()
_ = writer.wait_merging_threads()
index.reload()


def search(query_string: str) -> None:
    query = index.parse_query(query_string, ["text"])
    searcher = index.searcher()

    doc_results = searcher.search(query, limit=10).hits

    snippet_generator = SnippetGenerator.create(searcher, query, doc_schema, "text")

    for _, doc_address in doc_results:
        doc = searcher.doc(doc_address)

        doc_text = doc.get_first("text")

        if not doc_text:
            raise ValueError("Doc text not found")

        snippet = snippet_generator.snippet_from_doc(doc)

        print("Snippet HTML: ", snippet.to_html())

        for snippet_range in snippet.highlighted():
            print("Highlighted: ", doc_text[snippet_range.start : snippet_range.end])


search("fish")
"""
Snippet HTML:  Teach a man to <b>fish</b> and he will eat for the rest of his life
Highlighted:  fish
Snippet HTML:  He was an old man who fished alone in a skiff in the Gulf Stream and he had gone eighty-four days now without taking a <b>fish</b>. In the first forty days a
Highlighted:  fish
"""

search("heavy fish")
"""
Snippet HTML:  the tropic sea were on his cheeks. The blotches ran well down the sides of his face and his hands had the deep-creased scars from handling <b>heavy</b> <b>fish</b>
Highlighted:  orty 
Highlighted:  ays 
Snippet HTML:  Teach a man to <b>fish</b> and he will eat for the rest of his life
Highlighted:  fish
"""
@cjrh cjrh added bug Something isn't working help wanted Extra attention is needed labels Jan 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants