-
Notifications
You must be signed in to change notification settings - Fork 342
DcmCharString: add some support for multi-byte characters #124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
7547784 to
446136b
Compare
dcmdata/libsrc/dcchrstr.cc
Outdated
| OFCondition DcmCharString::getIndexOfPosition(const long pos, const char*& start, const char*& end, unsigned long& vm) | ||
| { | ||
| OFBool result = OFFalse; | ||
| // the vast majority of values have VM 0 or 1, so optimize for these |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not too happy with this helper method, but I also didn't want to copy the relevant code into both getVM() and getOFString().
I'm also not sure about the performance impact. My assumption is that the actual character set checks are only needed for very few tags (multi-valued tags with a single-value multi-byte encodings or with code extensions).
I've also tried to cache the positions of the values in the string, but this got ugly and I reverted it. Another idea would be to always convert the string to utf-8 before calling any of the involved functions, and cache this value. That would make the handling easier. Not sure about this, either, though.
| // CHECK_BAD ( "LO-07", DcmLongString::checkStringValueu("OFFIS e.V., Escherweg 2, 26121 Oldenburg, Germany, http://www.offis.de/", "1") ) | ||
| CHECK_GOOD( "LO-08", DcmLongString::checkStringValue("\\ _2_ \\ _3_ \\ _4_ \\ _5_ \\", "6") ) | ||
| CHECK_GOOD( "LO-09", DcmLongString::checkStringValue("ESC\033aping", "1") ) | ||
| CHECK_BAD( "LO-09", DcmLongString::checkStringValue("ESC only allowed for charset extension \033", "1") ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed these tests, as the ESCAPE character is only allowed for the escape sequence, as per PS 3.5, 6.1.3:
The ESC character shall be used only for ISO 2022 character set control sequences, in accordance with Section 6.1.2.5.
| /* only check if parameter is true since derived VRs are not affected | ||
| by the attribute SpecificCharacterSet (0008,0005) */ | ||
| if (checkAllStrings) | ||
| if (checkAllStrings || isAffectedBySpecificCharacterSet()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this check makes the separate method in DcmCharString obsolete
cbfa232 to
c1479ed
Compare
|
I keep getting unrelated changes in the commit, due to the RCS tags... have rebased it a couple of times therefore. |
c1479ed to
f00883c
Compare
|
@mrbean-bremen Thank you for your PR. As I already wrote in the DCMTK forum, I like the idea and your approach in principle. And, it seems to work, which is also nice :-) However, what I don't like is the "bulkiness" of the There are also other minor issues (e.g. naming conventions, source code formatting, use of C++ type casts, missing Autoconf support) that we could fix when the patch would be incorporated into the "testing" branch of the DCMTK. |
I completely agree, I didn't like it either. I started with refactoring it, but didn't get to a point where it was much better yet (and I have been traveling meanwhile and didn't get back to it).
I did this in the beginning, then moved it back as this has been only very trivial stuff used privately, but this certainly makes sense.
Sure. I tend to forget that this still has to support some ancient compilers, and while I'm trying to use the same format as the rest of the code, I sometimes slip. As for naming conventions and autoconf support - I have to check that, and may need some support/suggestions at a later point. I hope to be able to put something together over the weekend. |
|
@jriesmeier - I did some refactoring without any functional change, but this is still work in progress. I removed the bulky I could not find a nice way to move stuff to About formatting: Thank you for any feedback and ideas! |
f00883c to
b61e80b
Compare
Thanks to GitHub user "mrbean-bremen" for the report (see PR #124).
|
Thank you for the revision. We'll have a look at it (as time permits).
Unfortunately, we don't have that, mainly for historical reasons, but also because we could not agree on a common source code formatting / coding style. Our "golden rules" are.
This is far from perfect, of course, but it's not the most important thing when it comes to a toolkit that has been developed and maintained for about 30 years. Ideally, we should have a "development howto", but an up-to-date version is also still in the pipe... |
Thank you - there's no rush, of course.
Thanks - I forgot that I changed that :)
Uh, formatting wars... I generally don't care much about a concrete formatting style (maybe with a the exception of Python, where an official style exists), but I try to get some styleguide agreed on, which can be enforced automatically (for example using
That's pretty much what I'm trying to do (though I may not always succeed). |
b61e80b to
5fecb35
Compare
- add DcmCharString::getVM(), getOFString() and putOFStringAtPos(), which handle multi-byte charsets - DcmByteString::containsExtendedCharacters(): add check for ESCAPE characters (only allowed in code extensions) - removed obsolete DcmCharString::containsExtendedCharacters()
5fecb35 to
6bfea47
Compare
|
I've got back to this PR after ignoring it for some time and made a few changes, notably added support for multi-byte strings in I'm still ignoring potential performance impacts (don't want to do premature optimization if not needed), though would appreciate any feedback. |
|
CC @jriesmeier |
|
Thank you for the reminder, but I already receive notification mails on this discussion :-) It's not that I didn't want to respond more quickly, but I'm currently very busy with other things. So I ask for your patience (once again). And thank you again for your contribution! |
|
Sure, no problem - I wasn't sure about the notifications, sometimes people switch them on only for mentions. Didn't want to pressure you, sorry for the spam :) |
This is not finished, there are some more methods that probably need to be adapted (
checkStringValue,verify,getOFStringArray,putOFStringAtPos,writeJson), but I wanted to get some feedback if this is the correct way to do this.EDIT: added support for
putOFStringAtPos,getOFStringArrayandwriteJsonshould work out of the box, and the static functionscheckStringValueandverifyare somewhat out of scope here.As mentioned in a comment - another way to handle this would be to always convert the string value before calling these methods (and cache the converted value). This would probably make the handling easier, but I'm somewhat reluctant because of the needed caching and cache invalidation, and the possible performance impact (though I haven't done any performance tests).