tokenyze is a Python tokenizer
It uses generators to do a look-ahead tokenizing of an input string.
Tokens are defined as names or strings, and can be nested using brackets. Names are made up of sequential non-whitespace characters. Brackets are special single letter tokens. Strings are delimited by either single or double quotes.
Backslashes can escape these characters.
The text
"fr33(the p1zza c@t)n0w_",
will result in the following (generated) token list:
['fr33', '(', 'the', 'p1zza', 'c@t', ')', 'n0w_']
The code uses a generator getchars
to deliver character from the text
to the gettokens
consumer. The consumer will pass on responsibility
for parsing the text to either a whitespace consumer eatwhitespace
or a token consumer, which will in turn defer to a name consumner
eatname
or string consumner eatstring
.
The gettokens
consumer itself is a generator, which will yield each
found token in turn until there are no more tokens left.
$ python
>>> import tokenyze
>>> for token in tokenyze.gettokens("fr33(the p1zza c@t)n0w_"):
... print token
...
fr33
(
the
p1zza
c@t
)
n0w_
>>>
I have been using Python's shlex
for a bit, but while it is fine when parsing a text
into names and strings, it is lacking once brackets are added to the mix.
I needed something with a bit more lookahead, and writing generators in python is always fun.