Add configurable truncation for string columns#19146
Add configurable truncation for string columns#19146jaykanakiya wants to merge 3 commits intoapache:masterfrom
Conversation
| @JsonProperty("createBitmapIndex") Boolean createBitmapIndex, | ||
| @JsonProperty("maxStringLength") @Nullable Integer maxStringLength |
There was a problem hiding this comment.
drive by comment (i'll have a closer look at rest of PR later)
instead of adding additional arguments here, I was hoping to deprecate these arguments in favor of adding a column format spec similar to was done for auto/json columns in #17762, which could serve as a reference for how this should be wired up. I was planning to move the existing createBitmapIndex and multiValueHandling into such a spec, but just haven't got to it yet. I think this would be much cleaner and less disruptive to call sites going forward. It also allows wiring up to IndexSpec to be able to define job level defaults as a middle place between per column and system wide.
There was a problem hiding this comment.
Thanks for taking a look @clintropolis. Adding something like StringCommonFormatColumnFormatSpec would make it cleaner and makes sense to consolidate the configs there. Since it seems like a bigger refactor, does it make sense to do it in a follow up? Let me know what you think.
| */ | ||
| private static DimensionSchema.MultiValueHandling STRING_MV_MODE = DimensionSchema.MultiValueHandling.SORTED_ARRAY; | ||
| private static IndexSpec DEFAULT_INDEX_SPEC = IndexSpec.builder().build(); | ||
| private static int MAX_STRING_LENGTH = 0; |
There was a problem hiding this comment.
Would it make sense to set this to Integer max value? In case this is used elsewhere in the future there wouldn't need to explicit handling for 0 like you have in truncateIfNeeded
| public SideEffectRegisterer initDimensionHandlerAndMvHandlingMode(DefaultColumnFormatConfig formatsConfig) | ||
| { | ||
| setStringMultiValueHandlingModeIfConfigured(formatsConfig.getStringMultiValueHandlingMode()); | ||
| setMaxStringLengthIfConfigured(formatsConfig.getMaxStringLength()); |
There was a problem hiding this comment.
You can take a look at druid.indexing.formats.stringMultiValueHandlingMode in BuiltInTypesModuleTest It would be good to have some test coverage for the new property
| this.stringMultiValueHandlingMode = validateMultiValueHandlingMode(stringMultiValueHandlingMode); | ||
| this.nestedColumnFormatVersion = nestedColumnFormatVersion; | ||
| this.indexSpec = indexSpec; | ||
| this.maxStringLength = maxStringLength; |
There was a problem hiding this comment.
Here we should validate that configured maxStringLength > 0, otherwise we can throw an exception or log that we are falling back to the default value
| ); | ||
| } | ||
|
|
||
| private long verifyEncodedValues( |
There was a problem hiding this comment.
Leaving a comment for us to revisit later:
I'm curious what the truncation behavior would be like for MVD
Summary
Adds a configurable maximum string length for string dimension columns. Strings exceeding the limit are truncated during ingestion.
druid.indexing.formats.maxStringLengthmaxStringLengthfield in the dimension specRelease note
Added a new maxStringLength configuration for string dimensions that truncates values exceeding the specified length during ingestion. Can be set globally via druid.indexing.formats.maxStringLength or per-dimension in the ingestion spec.
Key changed/added classes in this PR
This PR has: