Skip to content

Latest commit

 

History

History
87 lines (64 loc) · 3.1 KB

README.md

File metadata and controls

87 lines (64 loc) · 3.1 KB

jusquci -- a french tokenizer for postgresql text search and spacy.

text tokens
jusqu'ici=> jusqu' ici =>
celle-ci-->ici celle -ci --> ici
lecteur-rice-x-s lecteur-rice-x-s
peut-être--là peut-être --
correcteur·rices correcteur·rices
mais.maintenant mais . maintenant
[re]lecteur.rice.s [re]lecteur.rice.s
autre(s) autre(s)
(autres) ( autres )
(autre(s)) ( autre(s) )
www.on-tenk.com www.on-tenk.com
[@becker_1982,p.12] [ @becker_1982 , p. 12 ]
oui..? oui ..?
dedans/dehors dedans / dehors
:happy: :) pour: :happy: :) pour :
ô.ô^^=):-)xd ô.ô ^^ =) :-) xd

postgresql extension

the primary role of this tokenizer is to be used as a text search parser in postgresql, hence it's proposed here as an postgresql extension.

make install
create extension jusquci;

select to_tsvector(
    'jusquci',
    'le quotidien,s''invente-t-il par mille.manière de braconner???'
);

in python

the single provided function (tokenize) returns three lists:

  • tokens: a list of strings.
  • tokens types: a list of token types ID; the types are defined as an Enum (jusqucy.ttypes.TokenType).
  • spaces: a list of boolean values that indicates if tokens are followed by a space or not (for spaCy, mostly).
  • is_sent_start: a list of boolean values that's used to set Token.is_sent_start (based of the token types).

the tokenizer can be used in a spacy pipeline. it tokenizes the text and add a attribute to the resulting Doc object, Doc._.ttypes in which are store token types (assigning to each token takes much more time).

import spacy
import jusqucy

nlp = spacy.blank('fr')
nlp.tokenizer = jusqucy.JusqucyTokenizer(nlp.vocab)

# or:
nlp = spacy.load(your_model, config={
    "nlp": {"tokenizer": {"@tokenizers": "jusqucy_tokenizer"}}
})

to get the token types:

from jusqucy.ttypes import TokenType
for token, ttype in zip(doc, doc._.jusqucy_ttypes):
    print(token, TokenType[ttype])

as a command line tool

to use jusquci as a simple command line tokenizer (that reads from stdin), just compile it with the makefile in the cli directory. the program read a text from standard input and output tokens separated by spaces. it also add newlines after strong punctuation signs (., ?, !).

sources

todo

  • presquci, a dictionary for postgresql to be used with the parser.
  • jusqucy, a python module.

os

only tested on linux (debian) and postgresql 16