Skip to content

[PLGN-405] ExtractIT - Adding in extra logic to better handle wrapping of lines in pdf#2089

Merged
cmcnally-r7 merged 5 commits intodevelopfrom
PLGN-405
Nov 2, 2023
Merged

[PLGN-405] ExtractIT - Adding in extra logic to better handle wrapping of lines in pdf#2089
cmcnally-r7 merged 5 commits intodevelopfrom
PLGN-405

Conversation

@rbowden-r7
Copy link
Copy Markdown
Collaborator

Proposed Changes

https://issues.corp.rapid7.com/browse/PLGN-405

Description

Describe the proposed changes:

  • To add in logic that as well as checking if a line is wrapped to also check if the pre wrapped or post wrap is also valid

eg for ips

as the new lines is a valid ip do not wrap

here is some text we thing will wrap
1.1.1.1

as the current line is a valid ip do not wrap

her is some text that will wrap 1.1.1.1
some text 

as neither is a valid ip we will combine and try to see if the combined is a valid ip

her is some text that will wrap 1.1.
1.1

note for less exact regex such as for domains, as both this_is_a_test_url.com and test_domain.com there may still be some edge cases wrapping may not be perfect

her is some text that will wrap this_is_a_
test_domain.com
  • to add in new unit tests to cover these new wraps

PR Requirements

Developers, verify you have completed the following items by checking them off:

Testing

Unit Tests

new unit tests have been added and are passing

integration tests have been updated to reflect the new behaviour

Review our documentation on generating and writing plugin unit tests

  • Unit tests written for any new or updated code

In-Product Tests

If you are an InsightConnect customer or have access to an InsightConnect instance, the following in-product tests should be done:

  • Screenshot of job output with the plugin changes
  • Screenshot of the changed connection, actions, or triggers input within the InsightConnect workflow builder

Style

Review the style guide

  • For dependencies, pin OS package and Python package versions
  • For security, set least privileged account with USER nobody in the Dockerfile when possible
  • For size, use the slim SDK images when possible: rapid7/insightconnect-python-3-38-slim-plugin:{sdk-version-num} and rapid7/insightconnect-python-3-38-plugin:{sdk-version-num}
  • For error handling, use of PluginException and ConnectionTestException
  • For logging, use self.logger
  • For docs, use changelog style
  • For docs, validate markdown with insight-plugin validate which calls icon_validate to lint help.md

Functional Checklist

  • Work fully completed
  • Functional
    • Any new actions/triggers include JSON test files in the tests/ directory created with insight-plugin samples
    • Tests should all pass unless it's a negative test. Negative tests have a naming convention of tests/$action_bad.json
    • Unsuccessful tests should fail by raising an exception causing the plugin to die and an object should be returned on successful test
    • Add functioning test results to PR, sanitize any output if necessary
      • Single action/trigger insight-plugin run -T tests/example.json --debug --jq
      • All actions/triggers shortcut insight-plugin run -T all --debug --jq (use PR format at end)
    • Add functioning run results to PR, sanitize any output if necessary
      • Single action/trigger insight-plugin run -R tests/example.json --debug --jq
      • All actions/triggers shortcut insight-plugin run --debug --jq (use PR format at end)

Assessment

You must validate your work to reviewers:

  1. Run insight-plugin validate and make sure everything passes
  2. Run the assessment tool: insight-plugin run -A. For single action validation: insight-plugin run tests/{file}.json -A
  3. Copy (insight-plugin ... | pbcopy) and paste the output in a new post on this PR
  4. Add required screenshots from the In-Product Tests section

@rbowden-r7 rbowden-r7 self-assigned this Oct 30, 2023
@rbowden-r7 rbowden-r7 changed the title Plgn 405 [PLGN-405] - Adding in extra logic to better handle wrapping of lines in pdf Oct 30, 2023
@cmcnally-r7 cmcnally-r7 changed the title [PLGN-405] - Adding in extra logic to better handle wrapping of lines in pdf [PLGN-405] ExtractIT - Adding in extra logic to better handle wrapping of lines in pdf Oct 30, 2023
@rbowden-r7
Copy link
Copy Markdown
Collaborator Author

Bumping validators to validators==0.22.0 also after looking at snyk

note as part of the update in versions the have add in extra logic for how emails/domains are validated in python-validators/validators@0.20.0...0.21.0 that was added on 0.21.0

this looks to not only validate against the django standards as before but also against the following standards

    [1]: https://github.com/django/django/blob/main/django/core/validators.py#L174
    [2]: https://www.rfc-editor.org/rfc/rfc1034
    [3]: https://www.rfc-editor.org/rfc/rfc5321
    [4]: https://www.rfc-editor.org/rfc/rfc5322

as a result the test email that is used throughout the unit tests
"dgsfsdfghcvhdgfhkdhhefrhwfjhfmhtest@usere.com123e4567-e89b-12d3-a456-426614174000" is failing the new stricter validation and is being removed from the tests

ValidationError(func=domain, args={'value': 'usere.com123e4567-e89b-12d3-a456-426614174000', 'rfc_1034': False, 'rfc_2782': False})

@rbowden-r7
Copy link
Copy Markdown
Collaborator Author

Last commit to edit the unit tests was due to the way the unit tests are run on jenkins

python -m pip install --user insightconnect_plugin_runtime --no-cache-dir --no-warn-script-location

as a result there is a conflict with the new version of validators that is added

 icon-integrations-ci 2.11.16 requires validators==0.14.2, but you have validators 0.22.0 which is incompatible.

as this package is not added when making the container via docker and all of the integration tests are passing with the new code I removed that one test to allow the git-hub checks to pass

note when running the unit tests with the correct version of 0.22.0 locally all tests including the removed test were all passing

Copy link
Copy Markdown
Contributor

@cmcnally-r7 cmcnally-r7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one small correction but other than that it LGTM 😄

:param page: The PDF page from which to extract wrapped words.
:type: Page

:param provided_regex: The regex for the type of words to be searched for, eg emial/domain format. Defaults to "".
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:param provided_regex: The regex for the type of words to be searched for, eg emial/domain format. Defaults to "".
:param provided_regex: The regex for the type of words to be searched for, e.g. email/domain format.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change has been made

@cmcnally-r7 cmcnally-r7 merged commit b56642b into develop Nov 2, 2023
@cmcnally-r7 cmcnally-r7 deleted the PLGN-405 branch November 2, 2023 10:26
rbowden-r7 added a commit that referenced this pull request Nov 2, 2023
…g of lines in pdf (#2089)

* PLGN-405-Adding in extra logic to better handle wrapping of lines in pdf

* PLGN-405-Reformatting to to black format

* PLGN-405-Bumping version of validators and making changes to unit tests to reflect changes

* PLGN-405-Removing unit test, that is not working with validators 2.20.0

* PLGN-405-Updating the docstring message to make it clearer
rbowden-r7 added a commit that referenced this pull request Nov 2, 2023
…g of lines in pdf (#2089) (#2096)

* PLGN-405-Adding in extra logic to better handle wrapping of lines in pdf

* PLGN-405-Reformatting to to black format

* PLGN-405-Bumping version of validators and making changes to unit tests to reflect changes

* PLGN-405-Removing unit test, that is not working with validators 2.20.0

* PLGN-405-Updating the docstring message to make it clearer
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants