Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]Incorrect Behavior of Obfuscate Processor with Predefined Pattern "%{CREDIT_CARD_NUMBER}" #4340

Closed
anudasari20 opened this issue Mar 26, 2024 · 7 comments · Fixed by #4476
Labels
bug Something isn't working
Milestone

Comments

@anudasari20
Copy link

anudasari20 commented Mar 26, 2024

Describe the bug
The issue arises when utilizing the predefined pattern "%{CREDIT_CARD_NUMBER}" with the obfuscate processor in the OSI pipeline. The expected behavior is for the processor to exclusively mask credit card information within logs while leaving non-personally identifiable information (non-PII) fields untouched. However, in our current environment, we have observed that the obfuscate processor is erroneously masking non-PII fields such as trackingId and sdsStayGuid. This unintended behavior complicates troubleshooting efforts for application teams as critical data points become obscured.

Attaching some sceenshots where the data has been masked,
image

image

Expected behavior
When employing the patterns configuration option, users expect seamless integration with a predefined set of obfuscation patterns for common fields. Specifically, the obfuscate processor should seamlessly implement the predefined pattern "%{CREDIT_CARD_NUMBER}" without encountering errors. It is imperative that this processor selectively masks only credit card values within logs, while abstaining from obscuring any other field values that may resemble credit card patterns.

The trackingId's should not be masked as shown in this screenshot,
image

Resolution:
To rectify this issue, the implementation of the obfuscate processor requires refinement. The processor should be updated to accurately discern and mask solely credit card numbers within logs, adhering strictly to the predefined "%{CREDIT_CARD_NUMBER}" pattern. This necessitates a thorough review and potential adjustment of the pattern matching algorithm employed by the processor. Furthermore, comprehensive testing is essential to validate the updated processor's efficacy across diverse log scenarios, ensuring that it effectively safeguards credit card information while preserving the integrity of non-PII fields.

Steps to Reproduce:

  1. Configure the obfuscate processor within the OSI pipeline, utilizing the predefined pattern "%{CREDIT_CARD_NUMBER}".
  2. Analyze logs containing a mixture of credit card numbers and non-PII fields.
  3. Observe whether non-PII fields are erroneously masked alongside credit card numbers, impeding the troubleshooting process for application teams.

Example confgiuration

- obfuscate:
        source: 'data'
        patterns:
          - '%{CREDIT_CARD_NUMBER}'
        action:
          mask:
            mask_character: "&"
            mask_character_length: 10

Environment (please complete the following information):

  • OS: Amazon EC2 - Linux/UNIX
  • Version : AML 2.0
    Additional context
    Add any other context about the problem here.
@anudasari20 anudasari20 added bug Something isn't working untriaged labels Mar 26, 2024
@dlvenable dlvenable added this to the v2.7.1 milestone Apr 2, 2024
@dlvenable dlvenable modified the milestones: v2.7.1, v2.8 Apr 2, 2024
@dlvenable
Copy link
Member

@Utkarsh-Aga
Copy link
Contributor

Hello @dlvenable,
Just wanted to check, Would modifying the current pattern "(\\d[ -]*?){13,16}" to "\\b(?:\\d[ -]*?){13,16}\\b", help in this particular scenario ?

@Utkarsh-Aga
Copy link
Contributor

Tested the scenario at my end and could observe the following -

Using Pattern - (\\d[ -]*?){13,16}

Input Data Output Data
fd55555069-e7a9-11ee4111111111111111 fd55555069-e7a9-11ee##########
4111111111111111 ##########
fd55555069-e7a9-11ee-91 fd55555069-e7a9-11ee-91

Using Pattern - \\b(?:\\d[ -]*?){13,16}\\b

Input Data Output Data
fd55555069-e7a9-11ee4111111111111111 fd55555069-e7a9-11ee4111111111111111
4111111111111111 ##########
fd55555069-e7a9-11ee-91 fd55555069-e7a9-11ee-91

So, based on the above, I feel that we can update the CREDIT_CARD_NUMBER pattern from (\\d[ -]*?){13,16} to \\b(?:\\d[ -]*?){13,16}\\b.

@dlvenable - Any comments on this ?

@dlvenable dlvenable removed this from the v2.8 milestone Apr 16, 2024
@dlvenable
Copy link
Member

@Utkarsh-Aga , Thank you for looking into this.

It seems the root of your solution is to add the word boundary (\b). But, what if there is a concatenation?

e.g.

visa4111111111111111

or

creditcard4111111111111111

I believe this would not match.

One option would be to add a configuration in the obfuscate processor itself to allow for word boundaries (e.g. single_word_only). Then any pattern could have this setting.

- obfuscate:
        source: "log"
        target: "new_log"
        single_word_only: true
        patterns:
          - '%{CREDIT_CARD_NUMBER}'

@dlvenable
Copy link
Member

The solution for this will be to use single_word_only: true starting in Data Prepper 2.8.

@dlvenable dlvenable modified the milestones: v2.8, v2.9 May 15, 2024
@dlvenable
Copy link
Member

We are backporting this to 2.8 to include in that release.

@dlvenable
Copy link
Member

#4550

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Development

Successfully merging a pull request may close this issue.

3 participants