DcmCharString: add some support for multi-byte characters #124

mrbean-bremen · 2025-08-08T09:26:25Z

add DcmCharString::getVM() and getOFString(), which handle multi-byte charsets
DcmByteString::containsExtendedCharacters(): add check for ESCAPE characters (only allowed in code extensions)
removed obsolete DcmCharString::containsExtendedCharacters()

This is not finished, there are some more methods that probably need to be adapted (checkStringValue, verify, getOFStringArray, putOFStringAtPos, writeJson), but I wanted to get some feedback if this is the correct way to do this.

EDIT: added support for putOFStringAtPos, getOFStringArray and writeJson should work out of the box, and the static functions checkStringValue and verify are somewhat out of scope here.

As mentioned in a comment - another way to handle this would be to always convert the string value before calling these methods (and cache the converted value). This would probably make the handling easier, but I'm somewhat reluctant because of the needed caching and cache invalidation, and the possible performance impact (though I haven't done any performance tests).

dcmdata/libsrc/dcchrstr.cc

dcmdata/libsrc/dcbytstr.cc

dcmdata/libsrc/dcchrstr.cc

mrbean-bremen · 2025-08-08T09:30:22Z

dcmdata/libsrc/dcchrstr.cc

+OFCondition DcmCharString::getIndexOfPosition(const long pos, const char*& start, const char*& end, unsigned long& vm)
 {
-    OFBool result = OFFalse;
+    // the vast majority of values have VM 0 or 1, so optimize for these


I'm not too happy with this helper method, but I also didn't want to copy the relevant code into both getVM() and getOFString().
I'm also not sure about the performance impact. My assumption is that the actual character set checks are only needed for very few tags (multi-valued tags with a single-value multi-byte encodings or with code extensions).

I've also tried to cache the positions of the values in the string, but this got ugly and I reverted it. Another idea would be to always convert the string to utf-8 before calling any of the involved functions, and cache this value. That would make the handling easier. Not sure about this, either, though.

mrbean-bremen · 2025-08-08T09:41:15Z

dcmdata/tests/tchval.cc

 //  CHECK_BAD ( "LO-07", DcmLongString::checkStringValueu("OFFIS e.V., Escherweg 2, 26121 Oldenburg, Germany, http://www.offis.de/", "1") )
  CHECK_GOOD( "LO-08", DcmLongString::checkStringValue("\\ _2_ \\ _3_ \\ _4_ \\ _5_ \\", "6") )
-  CHECK_GOOD( "LO-09", DcmLongString::checkStringValue("ESC\033aping", "1") )
+  CHECK_BAD( "LO-09", DcmLongString::checkStringValue("ESC only allowed for charset extension \033", "1") )


I changed these tests, as the ESCAPE character is only allowed for the escape sequence, as per PS 3.5, 6.1.3:

The ESC character shall be used only for ISO 2022 character set control sequences, in accordance with Section 6.1.2.5.

mrbean-bremen · 2025-08-08T09:44:10Z

dcmdata/libsrc/dcbytstr.cc

    /* only check if parameter is true since derived VRs are not affected
       by the attribute SpecificCharacterSet (0008,0005) */
-    if (checkAllStrings)
+    if (checkAllStrings || isAffectedBySpecificCharacterSet())


this check makes the separate method in DcmCharString obsolete

mrbean-bremen · 2025-08-08T15:02:52Z

I keep getting unrelated changes in the commit, due to the RCS tags... have rebased it a couple of times therefore.

jriesmeier · 2025-09-05T08:48:00Z

@mrbean-bremen Thank you for your PR. As I already wrote in the DCMTK forum, I like the idea and your approach in principle. And, it seems to work, which is also nice :-)

However, what I don't like is the "bulkiness" of the DcmCharString::getIndexOfPosition() function, i.e. its length and that it serves as a multi-purpose function, although I understand that it helps to avoid code duplication. What I also do not like is that character set-specific code is "hidden" in this bulky function, e.g. the special handling of "GBK" and "GB18030". Maybe, this could be moved to DcmSpecificCharacterSet?

There are also other minor issues (e.g. naming conventions, source code formatting, use of C++ type casts, missing Autoconf support) that we could fix when the patch would be incorporated into the "testing" branch of the DCMTK.

mrbean-bremen · 2025-09-05T11:56:51Z

However, what I don't like is the "bulkiness" of the DcmCharString::getIndexOfPosition() function, i.e. its length and that it serves as a multi-purpose function, although I understand that it helps to avoid code duplication

I completely agree, I didn't like it either. I started with refactoring it, but didn't get to a point where it was much better yet (and I have been traveling meanwhile and didn't get back to it).
I guess we may have to live with some code duplication, and check for any performance impact afterwards.

What I also do not like is that character set-specific code is "hidden" in this bulky function, e.g. the special handling of "GBK" and "GB18030". Maybe, this could be moved to DcmSpecificCharacterSet?

I did this in the beginning, then moved it back as this has been only very trivial stuff used privately, but this certainly makes sense.

There are also other minor issues (e.g. naming conventions, source code formatting, use of C++ type casts, missing Autoconf support) that we could fix when the patch would be incorporated into the "testing" branch of the DCMTK.

Sure. I tend to forget that this still has to support some ancient compilers, and while I'm trying to use the same format as the rest of the code, I sometimes slip. As for naming conventions and autoconf support - I have to check that, and may need some support/suggestions at a later point.

I hope to be able to put something together over the weekend.

mrbean-bremen · 2025-09-07T18:15:51Z

@jriesmeier - I did some refactoring without any functional change, but this is still work in progress.

I removed the bulky getIndexOfPosition, and use a few other helper functions instead.
I added the virtual function findNextComponentPosition to do most of the character set specific stuff, with the intention to use it also in a few other places that currently don't support this (like checkStringValue, verify, getOFStringArray), though I haven't done this yet. Maybe this can be done in a separate PR later to keep this one manageable.
I'm not too happy about the interface yet, glad for any ideas.

I could not find a nice way to move stuff to DcmSpecificCharacterSet yet.
I also did not think about optimization yet (well, I did think about it, but decided to wait with that until the implementation is more clear).

About formatting:
I tried to use the formatting as in most in the rest of dcchrstr.cc (there are some other styles in some other files, so I just checked here). There are still different styles used in this file, so I just picked the one used more (indentation 4, not 2, curly brackets on a separate line, and a few others). I didn't see any coding guideline or style guide - if there is one, please point me to it.
A .clang-format file would also be nice, as this is supported by most IDEs.

Thank you for any feedback and ideas!

Thanks to GitHub user "mrbean-bremen" for the report (see PR #124).

jriesmeier · 2025-09-10T15:56:01Z

Thank you for the revision. We'll have a look at it (as time permits).
By the way, I already fixed the typo that you spotted ;) See commit 7d7c8c5.

About formatting:
I tried to use the formatting as in most in the rest of dcchrstr.cc (there are some other styles in some other files, so I just checked here). There are still different styles used in this file, so I just picked the one used more (indentation 4, not 2, curly brackets on a separate line, and a few others). I didn't see any coding guideline or style guide - if there is one, please point me to it.
A .clang-format file would also be nice, as this is supported by most IDEs.

Unfortunately, we don't have that, mainly for historical reasons, but also because we could not agree on a common source code formatting / coding style.

Our "golden rules" are.

When modifying an existing file, try to stick to the existing formatting/style, i.e. the one that is predominantly used.
When adding a new file to an existing module, use the formatting/style that is typically used in the existing files of this module.
When creating a new module, you are "free" to use your preferred formatting/style, but try to be consistent over time and not to change your style for each new module :)

This is far from perfect, of course, but it's not the most important thing when it comes to a toolkit that has been developed and maintained for about 30 years.

Ideally, we should have a "development howto", but an up-to-date version is also still in the pipe...

mrbean-bremen · 2025-09-10T17:10:04Z

Thank you for the revision. We'll have a look at it (as time permits).

Thank you - there's no rush, of course.

By the way, I already fixed the typo that you spotted ;)

Thanks - I forgot that I changed that :)

Unfortunately, we don't have that, mainly for historical reasons, but also because we could not agree on a common source code formatting / coding style.

Uh, formatting wars... I generally don't care much about a concrete formatting style (maybe with a the exception of Python, where an official style exists), but I try to get some styleguide agreed on, which can be enforced automatically (for example using clang-format). I just don't want to think about formatting, I'm lazy...

Our "golden rules" are. ...

That's pretty much what I'm trying to do (though I may not always succeed).

- add DcmCharString::getVM(), getOFString() and putOFStringAtPos(), which handle multi-byte charsets - DcmByteString::containsExtendedCharacters(): add check for ESCAPE characters (only allowed in code extensions) - removed obsolete DcmCharString::containsExtendedCharacters()

mrbean-bremen · 2025-11-09T08:06:46Z

I've got back to this PR after ignoring it for some time and made a few changes, notably added support for multi-byte strings in putOFStringAtPos(). The rest of the member functions should not need changes in this respect, with the exception of the static functions verify() and checkStringValue(), which I consider out of context for this PR.
I also rebased against master, so this should be up to date.

I'm still ignoring potential performance impacts (don't want to do premature optimization if not needed), though would appreciate any feedback.

mrbean-bremen · 2025-11-13T10:10:11Z

CC @jriesmeier

jriesmeier · 2025-11-14T10:31:19Z

Thank you for the reminder, but I already receive notification mails on this discussion :-)

It's not that I didn't want to respond more quickly, but I'm currently very busy with other things. So I ask for your patience (once again).

And thank you again for your contribution!

mrbean-bremen · 2025-11-14T12:03:03Z

Sure, no problem - I wasn't sure about the notifications, sometimes people switch them on only for mentions. Didn't want to pressure you, sorry for the spam :)

michaelonken reviewed Aug 8, 2025

View reviewed changes

dcmdata/libsrc/dcchrstr.cc Outdated Show resolved Hide resolved

dcmdata/libsrc/dcbytstr.cc Outdated Show resolved Hide resolved

dcmdata/libsrc/dcchrstr.cc Outdated Show resolved Hide resolved

mrbean-bremen force-pushed the charstr_vm branch from 7547784 to 446136b Compare August 8, 2025 09:43

mrbean-bremen commented Aug 8, 2025

View reviewed changes

mrbean-bremen force-pushed the charstr_vm branch 3 times, most recently from cbfa232 to c1479ed Compare August 8, 2025 15:00

mrbean-bremen force-pushed the charstr_vm branch from c1479ed to f00883c Compare August 14, 2025 05:34

mrbean-bremen force-pushed the charstr_vm branch from f00883c to b61e80b Compare September 7, 2025 18:30

michaelonken pushed a commit that referenced this pull request Sep 10, 2025

Fixed typo in comment.

7d7c8c5

Thanks to GitHub user "mrbean-bremen" for the report (see PR #124).

mrbean-bremen force-pushed the charstr_vm branch from b61e80b to 5fecb35 Compare November 7, 2025 15:17

mrbean-bremen force-pushed the charstr_vm branch from 5fecb35 to 6bfea47 Compare November 9, 2025 07:56

DcmCharString: add some support for multi-byte characters #124

Are you sure you want to change the base?

DcmCharString: add some support for multi-byte characters #124

Uh oh!

Conversation

mrbean-bremen commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mrbean-bremen Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrbean-bremen Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

mrbean-bremen Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

mrbean-bremen commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jriesmeier commented Sep 5, 2025

Uh oh!

mrbean-bremen commented Sep 5, 2025

Uh oh!

mrbean-bremen commented Sep 7, 2025

Uh oh!

jriesmeier commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrbean-bremen commented Sep 10, 2025

Uh oh!

mrbean-bremen commented Nov 9, 2025

Uh oh!

mrbean-bremen commented Nov 13, 2025

Uh oh!

jriesmeier commented Nov 14, 2025

Uh oh!

mrbean-bremen commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mrbean-bremen commented Aug 8, 2025 •

edited

Loading

mrbean-bremen Aug 8, 2025 •

edited

Loading

mrbean-bremen commented Aug 8, 2025 •

edited

Loading

jriesmeier commented Sep 10, 2025 •

edited

Loading

mrbean-bremen commented Nov 14, 2025 •

edited

Loading