-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NUL-char clean #22
Comments
That means that subject string NUL-clean, not that regexp pattern NUL-clean. |
The Python3 manual says: However, in both Python3 and uPy, \x00 in a string is translated to NUL-char, as expected:
So it looks there is a real NUL-char there, and the strings are NUL-char clean, because also b is printed. But Python3 supports Null-char in regex'es, while uPy doesn't, because compilecode() is not NUL-char clean:
Only the last one matches. Note that Python3 matches the NUL-char in the pattern to the NUl-char in the subject string. But:
So we see that compilecode()'s stopping on NUL-char cause incompatibility with Python3, and is not able at all to search for NUL-char in the subject string - i.e. the feature of NUL-clean Subject string cannot be directly used (can only be matched to a dot). Since making compilecode() NUl-char clean is easy, and will make uPy regex match compatible to Python3 in this respect, and is also a useful feature (to search binary data), so why not to do it? |
The README says:
And indeed they're:
I also thought re1.5 supports quoted hex syntax, but it doesn't. But then adding this feature is certainly more important than making re1.5 nul-in-regex-clean. Note that even PCRE doesn't support NULs in regex, necessitating such patches: micropython/micropython-lib@0373045 (PCRE via FFI is alternative regex engine supported by uPy, and one which is used for real-world cases like stdlib). I'm not against making it "better than PCRE", but what about devising way to do generalized assertions for finite-automaton matchers? ;-)
I'm not against it, but making re1.5 being able to have \0 in regex, at the price of less convenient API, won't make it "better". Adding missing features would. If this lies on your critical path to these new features, let's fix it, but otherwise, I'd downprioritize it until later. |
Yes, in yPy you can currently only match a dot to NUL-char in the subject string, as opposed to Python3, in which you can match to it a NUL-char in the subject string, as I demonstrated.
I the above example, the \x00 is directly translated by uPy (and also Python3) to a NUL-char in the regex, which actually causes copmpilecode() to terminate the pattern processing when seeing it, ignoring the
It will make it "better" in the aspect of being compatible to what Python3 does in the exact same situation.
But Python3 actually can - according to the examples I brought that compares the same search in Python3 and uPy (unless I missed something - please indicate). Is not the goal of uPy is to be compatible with Python3?
I have no problem to add this feature.
Do you mean look-ahead/behind? BTW, currently re1.5 doesn't implement $ correctly. It implements it like \Z.
I can fix that and add \Z. |
Yes, but the goal of re1.5 is to be easily reusable, small regex library ;-). It won't be able to support all Python3 features anyway, and between spending next +50 bytes on X or Y features, X and Y should be prioritized. That was the hint - from my outside look, you seem to work on elaborating corner cases, instead of adding something exciting (which was in your original list). But well, if you keen to work on this, please do - all this would need to be done eventually of course, and if you feel like doing t now, well, maybe it's only better ;-).
Yes, that's know TODO from README too:
So, if you feel like implementing it, please do. |
To support that, I would like to add a flags argument to the matching functions, but I don't want to add just another argument to the recursive functions. To that end I need your input on my proposal of a parameter block ( in #18), that is needed also in order to support adding recursive-limit counter and regex loop-prevention mechanism without adding more function arguments (both already implemented in my development code). |
More on:
A flag parameter to support some of these features may need to be added to compilecode(). |
Form the README:
However, this feature cannot actually be used, because compilecode() is not NUL-char clean!
I propose to change it to be NUL-char clean.
To that end, I propose to change its first argument from
char *re
toSubject *re
.This will also allow it to return negative
int
on errors (as I proposed in issue #18).If this sounds fine, I will send a pull request.
The text was updated successfully, but these errors were encountered: