Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Java] Different semantic of lengths for CHAR(n) with C++ #1973

Open
SiasDoming opened this issue Jul 11, 2024 · 1 comment
Open

[Java] Different semantic of lengths for CHAR(n) with C++ #1973

SiasDoming opened this issue Jul 11, 2024 · 1 comment

Comments

@SiasDoming
Copy link

I'm migrating from Core-C++ to Core-Java. But while reading data of type CHAR(n), I found the BytesColumnVector.length in Java has a different semantic compared with StringVectorBatch.length in C++. In Java, with the following code, it refers to the number of bytes with padding blanks trimmed, while length in C++ refers to the total number of bytes including padding blanks. For example, reading value 'ABC' of CHAR(10) in Java will get a length 3 instead of 10 in C++. I'm wondering why trimmed lengths are preferred in Java.
PS: Maybe any one of these implementation is acceptable for you, as long as the semantics are same among APIs of different programming languages, but I have to say that the 'redundant' processing in Java did annoy me. I have to reallocate a byte array and pad the bytes again manually for further usage. And the trimmed lengths prevent me from using direct memory copy (although this is still achievable if I'd like to depend on the internal implementation).

  public static class CharTreeReader extends StringTreeReader {
  ...
    @Override
    public void nextVector(ColumnVector previousVector,
                           boolean[] isNull,
                           final int batchSize,
                           FilterContext filterContext,
                           ReadPhase readPhase) throws IOException {
      ...
        // TreeReaderFactory.java:2474
        // TreeReaderFactory.java:2483
        // TreeReaderFactory.java:2493
        adjustedDownLen = StringExpr
            .rightTrimAndTruncate(result.vector[i], result.start[i], result.length[i], maxLength);
        if (adjustedDownLen < result.length[i]) {
          result.setRef(i, result.vector[i], result.start[i], adjustedDownLen);
        }
      ...
    }
  }
@ffacs
Copy link
Contributor

ffacs commented Jul 11, 2024

Hi @SiasDoming , it seems there is a proposal in 2015 to provide a option, but was not implemented yet. FYI:
https://issues.apache.org/jira/browse/ORC-35

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants