-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FR] Defining custom text analyzers #408
Comments
At work I use a customized build of tantivy-py that includes a custom tokenizer, so this is very doable. Are you proposing to add a simple whitespace tokenizer to the default build of tantivy-py? Or are you proposing to add some kind of feature to enable custom tokenizers? The problem with custom tokenizers as a general feature is that if they are written in rust, there is not good plugin facillity to connect them into tantivy-py at runtime. I have been working on exactly this kind of problem in #200. That work started out as just supporting an additional build option that would produce an additional package containing the Lindera tokenizer support, but I stalled on that because I wanted to rather try to solve the "runtime plugin" problem. If you are proposing to add just a basic whitespace tokenizer, that is probably fine. Make sure to include all the different unicode whitespace symbols, not only ascii spaces and newlines. |
Interesting! So, what I'd like to see is something in between: custom tokenizers/text analyzers built from the builtin Tantivy tokenizer/filter structs. Here's a snippet showing my 'ideal' library API design, inspired by the Rust example in the tokenizer documentation. I'm able to define and register a custom tokenizer that splits strings using Tantivy's RegexTokenizer and filters tokens with LowerCaser. Just an example. This design can probably be simplified further, e.g. with fewer struct instantiations. import tantivy
my_text_anayzer_builder = tantivy.TextAnalyzerBuilder()
my_text_anayzer_builder.add_tokenizer(tantivy.RegexTokenizer(r"'[^\s]+'"))
# simpler maybe: my_text_anayzer_builder.add_tokenizer(type="RegexTokenizer", pattern=(r"'[^\s]+'"))
# or: my_text_analyzer.add_regex_tokenizer(r"'[^\s]+'")
my_text_anayzer_builder.add_token_filter(tantivy.LowerCaser())
my_text_anayzer = my_text_anayzer_builder.build()
schema_builder = tantivy.SchemaBuilder()
schema_builder.add_text_field("text", tokenizer_name="custom_tokenizer", stored=True)
schema = schema_builder.build()
index = tantivy.Index(schema)
index.register_tokenizer("custom_tokenizer", my_text_analyzer) So I need a way to instantiate Tokenizer structs, TextAnalyzerBuilder, and I need a method that lets me register my text analyzers with the index. As for the how, I'm afraid you'll need to forgive me my absolute lack of experience with creating Rust bindings. What are the implementation implications in your opinion? |
Ok yep I understand. This is doable. Your pseudocode explains the idea clearly. The Rust sequence of calls to make and use a tokenizer are these: Thinking out loud, we'd programatically make a let ta = TextAnalyzer::builder(WhitespaceTokenizer::default())
.filter(LowerCaser)
.filter(OuterPunctuationFilter::new(vec!['#', '@']))
.filter(PossessiveContractionFilter)
.build() So you would set up a Python interface to collect each of the filter parts, exactly as you have described with your The second part, as you also show, is to register it. I think this is quite tractable, even if you have little rust experience. To a large extent you can mimic the existing code to get quite far along. No doubt you will eventually get stuck in rust esoterica, but this is nothing special, everybody does. There are true rust experts (not me) that we can call in for help if we get stuck so I'm not too worried. The most valuable tip I can give you is to read through the pyo3 docs at least once, just so you know what kinds of information can be found where. But, with what you've laid out I don't expect any complicated problems. PyO3 makes it super easy to make thin python wrappers for rust structs. |
Thanks for the tips! I'll take a stab at this. I've started familiarizing myself with PyO3 over the weekend. Lots of new concepts for a Rust newbie like myself. My plan is to see how far I can get by following the example of the existing bindings (and asking Claude for help) until my tests pass. In terms of design, I'm going back and forth on sticking close to the Tantivy API or simplifying things so as to keep the API surface small. Which of these two would you prefer to see? Design 1my_text_anayzer = (
tantivy.TextAnalyzerBuilder(WhitespaceTokenizer())
.filter(LowerCaser())
.build()
)
index.register_tokenizer("custom_tokenizer", my_text_analyzer) Pros: Close correspondence with Tantivy API; Opens a path for truly custom tokenizers (as in, implemented in Rust) later on. Cons: Each Tokenizer and Filter struct needs Python bindings. (Perhaps we should keep the new classes in a separate tantivy.tokenizer module.) Design 2my_text_anayzer = (
tantivy.TextAnalyzerBuilder()
.tokenizer("regex", pattern=r"[^\s]+") # OR .tokenizer("regex", args={"pattern": r"[^\s]+"}) # how feasible are variadic functions in Rust?
.filter("lower")
.build()
)
index.register_tokenizer("custom_tokenizer", my_text_analyzer) Pros: Simple Python interface. Only two new Python classes: TextAnalyzerBuilder and TextAnalyzer. Less work to implement. What criteria would you apply to this decision? |
I prefer Design 1. It is closer to how it is done in the rust code, and also the path to custom tokenizers is valuable. The con of |
As a general design principle, so far I have preferred making the python interface look as similar to the underlying rust interface as possible, for better or worse, because my experience in using other python wrappers, such as APSW the sqlite wrapper, is that keeping the general structure and API the same makes it easy to know how to use the wrapper if you know the underlying library, and vice versa. However, I'm not against deviation if there is a valuable reason to do so. |
Got a PR ready! (Sorry for not responding anymore. I knew Rust would be hard to learn, but I wasn't quite expecting to be smacked over the head with the realization that, as a sheltered Python dev of 8 years, I still know nothing, haha. It's been quite the fun crisis of confidence in my own ability.) But I'm back, and I'm happy to report that I have something that feels relatively idiomatic, passes tests, and does what |
The PR looks good 👍🏼 Yeah rust is a tricky one to learn. But to be fair, learning both rust and pyo3 wrapping at the same time is a kind of hard mode. It does get easier over time. |
Hi, thanks for maintaining these bindings.
When defining my index schema I'd like to be able to specify a custom tokenizer/text analyzer using these core concepts in the Tokenizer documentation. The problem I'm trying to solve is that the default tokenizer does too much (e.g. dropping long terms), the raw tokenizer does too little (nothing), and en_stem is only appropriate for English. In the simplest case, all I need is a whitespace tokenizer, with me taking of string normalization in Python.
What are your thoughts about this? I'm new to Rust and unfamiliar with pyo3, but if this seems like a good first issue, I'll try taking a stab at it.
The text was updated successfully, but these errors were encountered: