Skip to content

Conversation

@mrbean-bremen
Copy link

@mrbean-bremen mrbean-bremen commented Aug 8, 2025

  • add DcmCharString::getVM() and getOFString(), which handle multi-byte charsets
  • DcmByteString::containsExtendedCharacters(): add check for ESCAPE characters (only allowed in code extensions)
  • removed obsolete DcmCharString::containsExtendedCharacters()

This is not finished, there are some more methods that probably need to be adapted (checkStringValue, verify, getOFStringArray, putOFStringAtPos, writeJson), but I wanted to get some feedback if this is the correct way to do this.

EDIT: added support for putOFStringAtPos, getOFStringArray and writeJson should work out of the box, and the static functions checkStringValue and verify are somewhat out of scope here.

As mentioned in a comment - another way to handle this would be to always convert the string value before calling these methods (and cache the converted value). This would probably make the handling easier, but I'm somewhat reluctant because of the needed caching and cache invalidation, and the possible performance impact (though I haven't done any performance tests).

OFCondition DcmCharString::getIndexOfPosition(const long pos, const char*& start, const char*& end, unsigned long& vm)
{
OFBool result = OFFalse;
// the vast majority of values have VM 0 or 1, so optimize for these
Copy link
Author

@mrbean-bremen mrbean-bremen Aug 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not too happy with this helper method, but I also didn't want to copy the relevant code into both getVM() and getOFString().
I'm also not sure about the performance impact. My assumption is that the actual character set checks are only needed for very few tags (multi-valued tags with a single-value multi-byte encodings or with code extensions).

I've also tried to cache the positions of the values in the string, but this got ugly and I reverted it. Another idea would be to always convert the string to utf-8 before calling any of the involved functions, and cache this value. That would make the handling easier. Not sure about this, either, though.

// CHECK_BAD ( "LO-07", DcmLongString::checkStringValueu("OFFIS e.V., Escherweg 2, 26121 Oldenburg, Germany, http://www.offis.de/", "1") )
CHECK_GOOD( "LO-08", DcmLongString::checkStringValue("\\ _2_ \\ _3_ \\ _4_ \\ _5_ \\", "6") )
CHECK_GOOD( "LO-09", DcmLongString::checkStringValue("ESC\033aping", "1") )
CHECK_BAD( "LO-09", DcmLongString::checkStringValue("ESC only allowed for charset extension \033", "1") )
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed these tests, as the ESCAPE character is only allowed for the escape sequence, as per PS 3.5, 6.1.3:

The ESC character shall be used only for ISO 2022 character set control sequences, in accordance with Section 6.1.2.5.

/* only check if parameter is true since derived VRs are not affected
by the attribute SpecificCharacterSet (0008,0005) */
if (checkAllStrings)
if (checkAllStrings || isAffectedBySpecificCharacterSet())
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this check makes the separate method in DcmCharString obsolete

@mrbean-bremen mrbean-bremen force-pushed the charstr_vm branch 3 times, most recently from cbfa232 to c1479ed Compare August 8, 2025 15:00
@mrbean-bremen
Copy link
Author

mrbean-bremen commented Aug 8, 2025

I keep getting unrelated changes in the commit, due to the RCS tags... have rebased it a couple of times therefore.

@jriesmeier
Copy link
Member

@mrbean-bremen Thank you for your PR. As I already wrote in the DCMTK forum, I like the idea and your approach in principle. And, it seems to work, which is also nice :-)

However, what I don't like is the "bulkiness" of the DcmCharString::getIndexOfPosition() function, i.e. its length and that it serves as a multi-purpose function, although I understand that it helps to avoid code duplication. What I also do not like is that character set-specific code is "hidden" in this bulky function, e.g. the special handling of "GBK" and "GB18030". Maybe, this could be moved to DcmSpecificCharacterSet?

There are also other minor issues (e.g. naming conventions, source code formatting, use of C++ type casts, missing Autoconf support) that we could fix when the patch would be incorporated into the "testing" branch of the DCMTK.

@mrbean-bremen
Copy link
Author

However, what I don't like is the "bulkiness" of the DcmCharString::getIndexOfPosition() function, i.e. its length and that it serves as a multi-purpose function, although I understand that it helps to avoid code duplication

I completely agree, I didn't like it either. I started with refactoring it, but didn't get to a point where it was much better yet (and I have been traveling meanwhile and didn't get back to it).
I guess we may have to live with some code duplication, and check for any performance impact afterwards.

What I also do not like is that character set-specific code is "hidden" in this bulky function, e.g. the special handling of "GBK" and "GB18030". Maybe, this could be moved to DcmSpecificCharacterSet?

I did this in the beginning, then moved it back as this has been only very trivial stuff used privately, but this certainly makes sense.

There are also other minor issues (e.g. naming conventions, source code formatting, use of C++ type casts, missing Autoconf support) that we could fix when the patch would be incorporated into the "testing" branch of the DCMTK.

Sure. I tend to forget that this still has to support some ancient compilers, and while I'm trying to use the same format as the rest of the code, I sometimes slip. As for naming conventions and autoconf support - I have to check that, and may need some support/suggestions at a later point.

I hope to be able to put something together over the weekend.

@mrbean-bremen
Copy link
Author

@jriesmeier - I did some refactoring without any functional change, but this is still work in progress.

I removed the bulky getIndexOfPosition, and use a few other helper functions instead.
I added the virtual function findNextComponentPosition to do most of the character set specific stuff, with the intention to use it also in a few other places that currently don't support this (like checkStringValue, verify, getOFStringArray), though I haven't done this yet. Maybe this can be done in a separate PR later to keep this one manageable.
I'm not too happy about the interface yet, glad for any ideas.

I could not find a nice way to move stuff to DcmSpecificCharacterSet yet.
I also did not think about optimization yet (well, I did think about it, but decided to wait with that until the implementation is more clear).

About formatting:
I tried to use the formatting as in most in the rest of dcchrstr.cc (there are some other styles in some other files, so I just checked here). There are still different styles used in this file, so I just picked the one used more (indentation 4, not 2, curly brackets on a separate line, and a few others). I didn't see any coding guideline or style guide - if there is one, please point me to it.
A .clang-format file would also be nice, as this is supported by most IDEs.

Thank you for any feedback and ideas!

michaelonken pushed a commit that referenced this pull request Sep 10, 2025
Thanks to GitHub user "mrbean-bremen" for the report (see PR #124).
@jriesmeier
Copy link
Member

jriesmeier commented Sep 10, 2025

Thank you for the revision. We'll have a look at it (as time permits).
By the way, I already fixed the typo that you spotted ;) See commit 7d7c8c5.

About formatting:
I tried to use the formatting as in most in the rest of dcchrstr.cc (there are some other styles in some other files, so I just checked here). There are still different styles used in this file, so I just picked the one used more (indentation 4, not 2, curly brackets on a separate line, and a few others). I didn't see any coding guideline or style guide - if there is one, please point me to it.
A .clang-format file would also be nice, as this is supported by most IDEs.

Unfortunately, we don't have that, mainly for historical reasons, but also because we could not agree on a common source code formatting / coding style.

Our "golden rules" are.

  1. When modifying an existing file, try to stick to the existing formatting/style, i.e. the one that is predominantly used.
  2. When adding a new file to an existing module, use the formatting/style that is typically used in the existing files of this module.
  3. When creating a new module, you are "free" to use your preferred formatting/style, but try to be consistent over time and not to change your style for each new module :)

This is far from perfect, of course, but it's not the most important thing when it comes to a toolkit that has been developed and maintained for about 30 years.

Ideally, we should have a "development howto", but an up-to-date version is also still in the pipe...

@mrbean-bremen
Copy link
Author

Thank you for the revision. We'll have a look at it (as time permits).

Thank you - there's no rush, of course.

By the way, I already fixed the typo that you spotted ;)

Thanks - I forgot that I changed that :)

Unfortunately, we don't have that, mainly for historical reasons, but also because we could not agree on a common source code formatting / coding style.

Uh, formatting wars... I generally don't care much about a concrete formatting style (maybe with a the exception of Python, where an official style exists), but I try to get some styleguide agreed on, which can be enforced automatically (for example using clang-format). I just don't want to think about formatting, I'm lazy...

Our "golden rules" are. ...

That's pretty much what I'm trying to do (though I may not always succeed).

- add DcmCharString::getVM(), getOFString() and putOFStringAtPos(), which handle multi-byte charsets
- DcmByteString::containsExtendedCharacters():
  add check for ESCAPE characters (only allowed in code extensions)
- removed obsolete DcmCharString::containsExtendedCharacters()
@mrbean-bremen
Copy link
Author

I've got back to this PR after ignoring it for some time and made a few changes, notably added support for multi-byte strings in putOFStringAtPos(). The rest of the member functions should not need changes in this respect, with the exception of the static functions verify() and checkStringValue(), which I consider out of context for this PR.
I also rebased against master, so this should be up to date.

I'm still ignoring potential performance impacts (don't want to do premature optimization if not needed), though would appreciate any feedback.

@mrbean-bremen
Copy link
Author

CC @jriesmeier

@jriesmeier
Copy link
Member

Thank you for the reminder, but I already receive notification mails on this discussion :-)

It's not that I didn't want to respond more quickly, but I'm currently very busy with other things. So I ask for your patience (once again).

And thank you again for your contribution!

@mrbean-bremen
Copy link
Author

mrbean-bremen commented Nov 14, 2025

Sure, no problem - I wasn't sure about the notifications, sometimes people switch them on only for mentions. Didn't want to pressure you, sorry for the spam :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants