Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] The date-processor doesn't parse times in AM/PM format #4564

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

FedericoBrignola
Copy link

@FedericoBrignola FedericoBrignola commented May 22, 2024

Description

The date-processor doesn't parse times in AM/PM format. For example the following timestamp format should be successfully parsed by the pattern below, but instead it gets ignored.

Example timestamp in AM/PM format:
05/20/2024 01:11:06:264 PM UTC

That timestamp should be parsed by this pattern:
MM/dd/yyyy KK:mm:ss:SSS a z

So to parse times in AM/PM format you need to use "h" [ clock-hour-of-am-pm (1-12) ] or "K" [ hour-of-am-pm (0-11) ] pattern letters. However, using these pattern letters conflicts with the default configuration (HOUR_OF_DAY = 0) of the
DateTimeFormatterBuilder, throwing the following exception:

Caused by: java.time.DateTimeException: Conflict found: HourOfDay 0 differs from HourOfDay 13 while resolving  AmPmOfDay

Note: The exceptions thrown by formatters are ignored by the date-processor.

This file main.java.tar.gz is a Java class, which I used to study and patch that bug, with the "copy-pasted" method used by date-processor to build the formatters. There is also a patched version of that method getSourceFormatter_patched.

Steps to reporduce

Try to parse the following Json lines:

{ "date": "05/20/2024 01:11:06:264 PM UTC" }
{ "date": "05/20/2024 11:11:06:264 AM UTC" }

with this pipeline:

pipeline:
  source:
    file:
      path: /usr/share/data-prepper/pipelines/logs.txt
      format: json
      record_type: event

  processor:
    - date:
        match:
          - key: date
            patterns: ["MM/dd/yyyy KK:mm:ss:SSS a z"]

  sink:
    - stdout:

I used this docker command:

docker run --rm -v $PWD/pipeline.yaml:/usr/share/data-prepper/pipelines/pipeline.yaml -v $PWD/logs.txt:/usr/share/data-prepper/pipelines/logs.txt opensearchproject/data-prepper

The output should contain the @timestamp field, but it doesn't.

Issues Resolved

No open issue

Check List

  • New functionality includes testing.
  • New functionality has a documentation issue. Please link to it in this PR.
    • New functionality has javadoc added
  • Commits are signed with a real name per the DCO

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Federico Brignola <160846842+federicobrignola@users.noreply.github.com>
@serbozanrevd
Copy link

I am crossing the same bug.
And I am definitely not a Java fluent.

Why did you remove completly the parseDefaulting(ChronoField.HOUR_OF_DAY, 0) ?
It might generate some others issues ?

Something like that might be less problematic ?

            if (pattern.contains("a")) {
              dateTimeFormatterBuilder.parseDefaulting(ChronoField.HOUR_OF_AMPM, 0);
            } else {
              dateTimeFormatterBuilder.parseDefaulting(ChronoField.HOUR_OF_DAY, 0);
            }

@FedericoBrignola
Copy link
Author

I didn't completely remove the line parseDefaulting(ChronoField.HOUR_OF_DAY, 0). I made it conditional to the user's pattern.

Looking at your code:

if (pattern.contains("a")) {
    /* The following line is not necessary because we know that the pattern contains the ‘a’ key. 
    Therefore, the user wants to parse strings that contain an ‘hours’ field in the AM/PM format, 
    so there is no need to set the default hour to 0. */
    dateTimeFormatterBuilder.parseDefaulting(ChronoField.HOUR_OF_AMPM, 0);
} else {
    dateTimeFormatterBuilder.parseDefaulting(ChronoField.HOUR_OF_DAY, 0);
}

Removing the unnecessary line we obtain the same code as mine.

@serbozanrevd
Copy link

@FedericoBrignola Thanks for your clarification. This is now clear for me.

Copy link
Collaborator

@KarstenSchnitter KarstenSchnitter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for providing this fix. Would you mind adding a test case, that forces this change?

@FedericoBrignola
Copy link
Author

I took a look at the tests and they are not so straightforward. I don't have time right now to study how they implemented those test cases and write a new compliant one.
Maybe next month, or i don't know if @serbozanrevd would give it a try.

@dlvenable
Copy link
Member

@FedericoBrignola , I do think that this makes sense overall. Can you add a couple unit test cases? One would be to parse without the a and the other to parse with the a. You can write them to match the scenario you wrote above in the description.

@FedericoBrignola
Copy link
Author

I added 4 test cases:

  1. hh:mm:ss a tests hours in AM/PM 1-12 format
  2. KK:mm:ss a tests hours in AM/PM 0-11 format
  3. kk:mm:ss tests hours in 1-24 format
  4. HH:mm:ss tests hours in 0-23 format

While developing those test cases I found another problem in the default setting of HOUR_OF_DAY to 0 when the pattern contains the key k (clock-time-of-day 1-24), obviously because 0 is not in the range 1-24, so I added the check !pattern.contains("k") to the following code:

if(!pattern.contains("a") && !pattern.contains("k"))
            dateTimeFormatterBuilder.parseDefaulting(ChronoField.HOUR_OF_DAY, 0);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants