-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Contributing to your re1.5 project #9
Comments
Amir, I should start with saying that I'd appreciate contributions to this project. At the same time, the reason it exists is foremost because we needed very lean regexp implementation for https://github.com/micropython/micropython . So, the idea was to take re1, which is mostly demo/teaching/experimentation project, and make it a bit more suitable for real-world usage, while still keeping it as minimal as reasonably possible (I do accept that code size would grow, just the growth should be controlled). So, if you think you appreciate that aim, and other traits of re1.5 look good for you, getting patches would be very nice. |
Let's go over specific points you list:
Well, I'm indifferent to this, as long as it doesn't affect library size. If you implemented them, would be nice to merge.
If these are optional (require uncommenting/special make target), sounds good.
If this decrease, not increase code size, sounds good.
Would need to look into this. But for MicroPython, we need to have dependency injection for memory allocation, i.e. be able to pre-size area needed for compiled regex, allocate it in client, and then pass pointer to fill in to compile API.
Well, I guess this can go in "level 2 patches" category, to be done after "level 1" ;-). I'd be glad to make re1.5 be better in some aspects than other engines, but if that increases code size, I'd ask it to be #ifdef'able.
Well, multiple backends is a nice feature inherited from re1, and I'd like to maintain it, as that's clearly a differentiating point. But of course, it requires extra maintenance effort. |
That's very nice, feel free to submit first thing ;-).
Level 2 feature.
Yes, apparently that's the way to do it. Key point is that these "counters" should be part of data state, not code state, and should be maintained per-match (otherwise it's not thread-safe, etc.). Other aspects need experimenting, like if it would make sense to have common data "area" and common indexes for sub's and counters, or they should be separate.
Would be very nice to implement, but would need to be lean and optional (i.e. make "nextchar" function to be either a define/inline function/normal function).
Funny corner case, didn't even think about it ;-). Are you sure empty charclasses are banned?
Not sure I got this, do you mean treating unknown escape sequences as a normal char?
Would be nice to have this #ifdef'able
Well, that will cost us precious bytes in MicroPython. I have a magic spell for all such cases: #ifdef ;-)
Well, initial parser was hand-written to be as small as possible, without too much attention to error handling. It might improve since that, but I still guess it may miss to report errors or even crash. Original parser was kinda left for reference, and indeed I'm not sure I updated it consistently. So, well, if your independent review shows that hand-written parser is good enough up to dropping original one, we can do it ;-).
Again, our idea is that memory is dependency-injected into all functions. Would need to look at specific cases to see what you mean.
I'm positive about this, and if it's only a single global var which is culprit, it doesn't even require #ifdef'ing. |
Summing up, I'm positive about almost all changes you propose, but they would have different priorities for merging, and some may need more detailed look to see if they worth to have them in re1.5 upstream or if they can be specific to your project. Thanks! |
Just some concentrated comments on some of your points:
I will check if I can reimplement it in a compatible way.
BTW, I found that the code in thompson.c is redundant. The code of pick.c is essentially the same, plus the added saved submatches. I say "essentially" because there are two minor differences:
So to ease future maintenance, thompson.c can be removed, and a comment added to pike.c, like:
Empty I have also implemented a fix to support '
I mean being able to put
I think such an implementation has the potential to save code (no need to implement named class in the VM)... I may try it to see. BTW, in main.c the ifdef's are indented, and in compile.c it is not. Can I use non-indented ifdef's?
In MicroPython uses only recursiveloop.c, which doesn't contain memory allocation, so it doesn't matter to it. On the other hand, pike.c malloc's memory, some of it is not freed (or reused on later invocations). main.c also alloc's memory that is not getting freed. Of course this doesn't really matter,. |
Yes. Indented #ifdef likely leaked from micropython, where we have many ifdef's, and having them indented really helps readability. (will answer rest of question as time permits) |
Well, about such case I know, but always took care to allow it to be specified at the beginning.
So, as our discussion in the latest PRs showed, it may be not immediately obvious how this or that change affects code size, especially if you're not get used to think in terms of mere bytes. But rule of thumb is that data-driven approach makes it more easier to extend, but likely code size bigger. Hardcoding simple cases makes them smaller and also more performant (CPU branch predictor works better). @dpgeorge implemented that code, see #2 and #3, and as you can see, my intuitive assessment of what would be shorter faulted there. All in all, I wrote the above not to block you to add table-driven, extensible implementation if you really need extensibility, just please check code sizes, and add #ifdef between old and new code then if warranted. |
Sorry for the delay with looking into this matter, I pooled time to look into code myself to skip asking "dumb" question, but so far wasn't able, so let's discuss it speculatively. First of all, I hope you read Russ Cox' articles which discuss regex engines implementation choices/strategies, they talk about Thompson algo, and then about improvement over it, Pike algo. re1 is demonstration of all these algos, and by default I wanted to preserve it while makes sense. Then, how much "they're are essentially the same"? Can for example pike.c be made to produce the same code as thompson.c with simple, clean #ifdef's? If so, I'm in favor of merging them too. But just dropping thompson.c doesn't seem too good - it's simpler and uses less resources. Granted, most high-level APIs requires not just boolean match/nomatch status, but captures as result, but there're still good usages of yes/no algo, and then it's usually high-volume data, where getting highest performance is important. That's ideas I have about it, if you have other arguments, please share them. |
Of course.
I realized they are essentially the same by adding, as an exercise, sub capturing to thompson.c. When I then did a diff (ignoring whitespace), to my surprise the diff was very small, and it was in addthread() in these lines:
In pike.c they are under
Yes. The sub capturing just needs to be ifdef'ed.
There are 10 such lines in pike.c.
Several other lines need to be ifdef'ed too.
It uses less resources only because it doesn't do sub capturing, and because it does a capturing of the start of the whole match in a buggy way (that fixing it will add more code). Note that any of the other VM implementations can be fixed to take less resources this way, by just omitting capturing!
BTW, one of the differences between those VM implementations is the amount of stack space that each uses. Hence I wonder why recursiveloop() was chosen for MicroPython(), since in the pikevm the needed "repetition" space is allocated on the heap. However, for simple regexes it indeed doesn't matter. |
I'd think that addthread() example above could be massaged into such form too (i.e. not ifdef each invocation, but the declaration).
Yes, except that yet needs to be done, while for thompson/pike case it's already done, and as I mentioned above, it's all above preserving somebody's (hard) work while it makes sense. Amir, you now know code very well, so I'll trust your opinion if you think it makes sense to just drop thompson.c. |
Deciding factor for initial code-drop into uPy then was pure code size. That's something to revisit in due time for sure, along with other pending issues on uPy side. |
Since the placement of the cleanmarks() code has been resolved, I think that for now there is no need to do any extra work regarding the thompson.c. file (even not fixing the known bug in the reported match start position). This means not removing it, and only maybe eventually adding it a comment saying that since it is similar to pike.c without capturing, further work has not been done to add fixes and features to it (e.g vm-trace and {n,m}). |
Hi Paul,
For another project that I contribute to, I need a flexible regexp package that I can easily change for my needs (natural-language sentence tokenizing).
It looks like "re1" is appropriate for that. Since I liked your modified VM code more than the original one, I plan to base my modifications on your re1.5.
Before I add the very specific features that I need, I plan to add some more general ones that I also need.
Some of my changes may seem to be too specific for a your purpose, but maybe some are not.
While preparing my specific version, I can send you pull requests for features you think that are appropriate (or may be appropriate) for your re1.5.
Here is what I plan to do (some I already have done, as indicated):
The rational was to ease the debug cycle.
Instead, I used _compilecode() also for code sizing.
The rational was to prevent the need to make parallel changes in re1_5_sizecode() when features are added in _compilecode(), to keep both synchronized regarding the generated code size.
The idea is to be similar to existing regex packages, i which you can pre-compile a list of regexp strings, and use the compiled ones whenever needed, like objects.
But I guess such changes may be inappropriate for your project.
Such a feature is needed for my said other project. Most (if not all) of the existing free regex packages lack it (although it can be done, with some effort, using PCRE.)
For now I implemented it only for "backtrack" , and I still need to write an API for it (the implementation for the other methods is similar.)
This is also a feature I must have for my other project. AFAIK, only .NET has such a feature (see: http://www.rexegg.com/regex-csharp.html#quantgroups)
I already figured out a way to implement that, but I'm not sure it is the ideal one.
The idea is to add count "registers" (similar to the "sub" implementation) and new VM commands to inspect and jump on it.
This is also a must-have feature for my said project.
My idea is to use a table like:
that will be used by the code generator, also for things like [\d\s], instead of hard-coding all classes in _re1_5_namedclassmatch().
This way it will be easy to add any additional desired class.
(Such single-char definitions, like \t, will of course generate char and not class.)
BTW, currently \s in re1.5 is missing 0x0C (formfeed).
(For not including obsolete code, or have a need to maintain it for new functionality that is added.)
BTW, in the current program
?:
anywhere in the regexp (when compiled with DEBUG) gives an error due to the old code which has an incomplete support for it.Currently some memory remained allocated after functions are called.
(This is a must for my said other project.)
Currently the global variable
Sub *freesub
is problematic in this regard.If you indicate which feature is desired also for your code, I will try to isolate it and send you a pull request.
The text was updated successfully, but these errors were encountered: