jusquci -- a french tokenizer for postgresql text search and spacy.
text | tokens |
---|---|
jusqu'ici=> | jusqu' ici => |
celle-ci-->ici | celle -ci --> ici |
lecteur-rice-x-s | lecteur-rice-x-s |
peut-être--là | peut-être -- là |
correcteur·rices | correcteur·rices |
mais.maintenant | mais . maintenant |
[re]lecteur.rice.s | [re]lecteur.rice.s |
autre(s) | autre(s) |
(autres) | ( autres ) |
(autre(s)) | ( autre(s) ) |
www.on-tenk.com | www.on-tenk.com |
[@becker_1982,p.12] | [ @becker_1982 , p. 12 ] |
oui..? | oui ..? |
dedans/dehors | dedans / dehors |
:happy: :) pour: | :happy: :) pour : |
ô.ô^^=):-)xd | ô.ô ^^ =) :-) xd |
the primary role of this tokenizer is to be used as a text search parser in postgresql, hence it's proposed here as an postgresql extension.
make install
create extension jusquci;
select to_tsvector(
'jusquci',
'le quotidien,s''invente-t-il par mille.manière de braconner???'
);
the single provided function (tokenize
) returns three lists:
- tokens: a list of strings.
- tokens types: a list of token types ID; the types are defined as an Enum (
jusqucy.ttypes.TokenType
). - spaces: a list of boolean values that indicates if tokens are followed by a space or not (for spaCy, mostly).
- is_sent_start: a list of boolean values that's used to set
Token.is_sent_start
(based of the token types).
the tokenizer can be used in a spacy pipeline. it tokenizes the text and add a attribute to the resulting Doc
object, Doc._.ttypes
in which are store token types (assigning to each token takes much more time).
import spacy
import jusqucy
nlp = spacy.blank('fr')
nlp.tokenizer = jusqucy.JusqucyTokenizer(nlp.vocab)
# or:
nlp = spacy.load(your_model, config={
"nlp": {"tokenizer": {"@tokenizers": "jusqucy_tokenizer"}}
})
to get the token types:
from jusqucy.ttypes import TokenType
for token, ttype in zip(doc, doc._.jusqucy_ttypes):
print(token, TokenType[ttype])
to use jusquci as a simple command line tokenizer (that reads from stdin
), just compile it with the makefile in the cli
directory.
the program read a text from standard input and output tokens separated by spaces. it also add newlines after strong punctuation signs (.
, ?
, !
).
- presquci, a dictionary for postgresql to be used with the parser.
- jusqucy, a python module.
only tested on linux (debian) and postgresql 16