[FR] Defining custom text analyzers #408

a3lem · 2025-01-08T10:52:36Z

Hi, thanks for maintaining these bindings.

When defining my index schema I'd like to be able to specify a custom tokenizer/text analyzer using these core concepts in the Tokenizer documentation. The problem I'm trying to solve is that the default tokenizer does too much (e.g. dropping long terms), the raw tokenizer does too little (nothing), and en_stem is only appropriate for English. In the simplest case, all I need is a whitespace tokenizer, with me taking of string normalization in Python.

What are your thoughts about this? I'm new to Rust and unfamiliar with pyo3, but if this seems like a good first issue, I'll try taking a stab at it.

cjrh · 2025-01-08T11:07:58Z

At work I use a customized build of tantivy-py that includes a custom tokenizer, so this is very doable. Are you proposing to add a simple whitespace tokenizer to the default build of tantivy-py? Or are you proposing to add some kind of feature to enable custom tokenizers? The problem with custom tokenizers as a general feature is that if they are written in rust, there is not good plugin facillity to connect them into tantivy-py at runtime. I have been working on exactly this kind of problem in #200. That work started out as just supporting an additional build option that would produce an additional package containing the Lindera tokenizer support, but I stalled on that because I wanted to rather try to solve the "runtime plugin" problem.

If you are proposing to add just a basic whitespace tokenizer, that is probably fine. Make sure to include all the different unicode whitespace symbols, not only ascii spaces and newlines.

a3lem · 2025-01-08T12:44:29Z

Interesting!

So, what I'd like to see is something in between: custom tokenizers/text analyzers built from the builtin Tantivy tokenizer/filter structs.

Here's a snippet showing my 'ideal' library API design, inspired by the Rust example in the tokenizer documentation. I'm able to define and register a custom tokenizer that splits strings using Tantivy's RegexTokenizer and filters tokens with LowerCaser. Just an example. This design can probably be simplified further, e.g. with fewer struct instantiations.

import tantivy

my_text_anayzer_builder = tantivy.TextAnalyzerBuilder()
my_text_anayzer_builder.add_tokenizer(tantivy.RegexTokenizer(r"'[^\s]+'"))
# simpler maybe: my_text_anayzer_builder.add_tokenizer(type="RegexTokenizer", pattern=(r"'[^\s]+'"))
# or: my_text_analyzer.add_regex_tokenizer(r"'[^\s]+'")
my_text_anayzer_builder.add_token_filter(tantivy.LowerCaser())
my_text_anayzer = my_text_anayzer_builder.build()

schema_builder = tantivy.SchemaBuilder()
schema_builder.add_text_field("text", tokenizer_name="custom_tokenizer", stored=True)
schema = schema_builder.build()

index = tantivy.Index(schema)
index.register_tokenizer("custom_tokenizer", my_text_analyzer)

So I need a way to instantiate Tokenizer structs, TextAnalyzerBuilder, and I need a method that lets me register my text analyzers with the index.

As for the how, I'm afraid you'll need to forgive me my absolute lack of experience with creating Rust bindings. What are the implementation implications in your opinion?

cjrh · 2025-01-09T16:44:27Z

Ok yep I understand. This is doable. Your pseudocode explains the idea clearly. The Rust sequence of calls to make and use a tokenizer are these:

Thinking out loud, we'd programatically make a TextAnalyzer instance (this is only an example):

    let ta = TextAnalyzer::builder(WhitespaceTokenizer::default())
        .filter(LowerCaser)
        .filter(OuterPunctuationFilter::new(vec!['#', '@']))
        .filter(PossessiveContractionFilter)
        .build()

So you would set up a Python interface to collect each of the filter parts, exactly as you have described with your my_text_analyzer_builder. The filter types must all already be existing inside the rust code of tantivy-py. Then you can make a python interface to construct one of those and supply it to your text analyzer builder.

The second part, as you also show, is to register it.

I think this is quite tractable, even if you have little rust experience. To a large extent you can mimic the existing code to get quite far along. No doubt you will eventually get stuck in rust esoterica, but this is nothing special, everybody does. There are true rust experts (not me) that we can call in for help if we get stuck so I'm not too worried.

The most valuable tip I can give you is to read through the pyo3 docs at least once, just so you know what kinds of information can be found where. But, with what you've laid out I don't expect any complicated problems. PyO3 makes it super easy to make thin python wrappers for rust structs.

a3lem · 2025-01-14T09:10:27Z

Thanks for the tips! I'll take a stab at this. I've started familiarizing myself with PyO3 over the weekend. Lots of new concepts for a Rust newbie like myself. My plan is to see how far I can get by following the example of the existing bindings (and asking Claude for help) until my tests pass.

In terms of design, I'm going back and forth on sticking close to the Tantivy API or simplifying things so as to keep the API surface small. Which of these two would you prefer to see?

Design 1

my_text_anayzer = (
    tantivy.TextAnalyzerBuilder(WhitespaceTokenizer())
    .filter(LowerCaser())
    .build()
)

index.register_tokenizer("custom_tokenizer", my_text_analyzer)

Pros: Close correspondence with Tantivy API; Opens a path for truly custom tokenizers (as in, implemented in Rust) later on.

Cons: Each Tokenizer and Filter struct needs Python bindings. (Perhaps we should keep the new classes in a separate tantivy.tokenizer module.)

Design 2

my_text_anayzer = (
    tantivy.TextAnalyzerBuilder()
    .tokenizer("regex", pattern=r"[^\s]+")  # OR  .tokenizer("regex", args={"pattern": r"[^\s]+"})  # how feasible are variadic functions in Rust?
    .filter("lower")
    .build()
)

index.register_tokenizer("custom_tokenizer", my_text_analyzer)

Pros: Simple Python interface. Only two new Python classes: TextAnalyzerBuilder and TextAnalyzer. Less work to implement.
Cons: Less flexible. Limits users to builtin tokenizers and filters. No path to truly custom tokenizers (which I don't need).

What criteria would you apply to this decision?

cjrh · 2025-01-15T11:12:10Z

I prefer Design 1. It is closer to how it is done in the rust code, and also the path to custom tokenizers is valuable. The con of Each Tokenizer and Filter struct needs Python bindings is not too heavy, PyO3 makes this relatively easy.

cjrh · 2025-01-15T11:14:45Z

As a general design principle, so far I have preferred making the python interface look as similar to the underlying rust interface as possible, for better or worse, because my experience in using other python wrappers, such as APSW the sqlite wrapper, is that keeping the general structure and API the same makes it easy to know how to use the wrapper if you know the underlying library, and vice versa.

However, I'm not against deviation if there is a valuable reason to do so.

a3lem · 2025-02-08T16:15:43Z

Got a PR ready!

(Sorry for not responding anymore. I knew Rust would be hard to learn, but I wasn't quite expecting to be smacked over the head with the realization that, as a sheltered Python dev of 8 years, I still know nothing, haha. It's been quite the fun crisis of confidence in my own ability.)

But I'm back, and I'm happy to report that I have something that feels relatively idiomatic, passes tests, and does what
I (at least) need it to do. I'll include more info in the PR. See you there!

cjrh · 2025-02-08T21:10:41Z

The PR looks good 👍🏼

Yeah rust is a tricky one to learn. But to be fair, learning both rust and pyo3 wrapping at the same time is a kind of hard mode. It does get easier over time.

cjrh added the enhancement New feature or request label Jan 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FR] Defining custom text analyzers #408

[FR] Defining custom text analyzers #408

a3lem commented Jan 8, 2025

cjrh commented Jan 8, 2025

a3lem commented Jan 8, 2025 •

edited

Loading

cjrh commented Jan 9, 2025 •

edited

Loading

a3lem commented Jan 14, 2025 •

edited

Loading

cjrh commented Jan 15, 2025

cjrh commented Jan 15, 2025

a3lem commented Feb 8, 2025

cjrh commented Feb 8, 2025

[FR] Defining custom text analyzers #408

[FR] Defining custom text analyzers #408

Comments

a3lem commented Jan 8, 2025

cjrh commented Jan 8, 2025

a3lem commented Jan 8, 2025 • edited Loading

cjrh commented Jan 9, 2025 • edited Loading

a3lem commented Jan 14, 2025 • edited Loading

Design 1

Design 2

cjrh commented Jan 15, 2025

cjrh commented Jan 15, 2025

a3lem commented Feb 8, 2025

cjrh commented Feb 8, 2025

a3lem commented Jan 8, 2025 •

edited

Loading

cjrh commented Jan 9, 2025 •

edited

Loading

a3lem commented Jan 14, 2025 •

edited

Loading