New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Add Dataset integration tests - S3 Share requests #1389

Merged

SofiaSazonova merged 97 commits into data-dot-all:main from SofiaSazonova:share-int-tests

Sep 25, 2024

Contributor

SofiaSazonova commented Jul 5, 2024

Feature or Bugfix

Tests

Detail

module share_base
bugfix delete_env requires env_object not envUri
TEMPORARY: hardcoded dataset_uri --> I wait for dataset module

Relates

Integration tests executed on a real deployment as part of the CICD - Shares #1376

Security

Please answer the questions below briefly where applicable, or write N/A. Based on
OWASP 10.

Does this PR introduce or modify any input fields or queries - this includes
fetching data from storage outside the application (e.g. a database, an S3 bucket)?
- Is the input sanitized?
- What precautions are you taking before deserializing the data you consume?
- Is injection prevented by parametrizing queries?
- Have you ensured no eval or similar functions are used?
Does this PR introduce any functionality or component that requires authorization?
- How have you ensured it respects the existing AuthN/AuthZ mechanisms?
- Are you logging failed auth attempts?
Are you using or adding any cryptographic features?
- Do you use a standard proven implementations?
- Are the used keys controlled by the customer? Where are they stored?
Are you introducing any new policies/roles/users?
- Have you used the least-privilege principle? How?

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

dlpzx and others added 13 commits

July 1, 2024 19:08


          Add integration tests for datasets - basic queries and conftest

b188538


          add list + get queries, add persistent datasets, begin create/update/…

45d1407

…delete tests


          Add integration test role in Environment stack + session in conftest …

cd27097

…+ aws clients for dataset


          simplified conftests for datasets

d04b525


          create integration role with region in name

5e5507e


          New environment type: IntegrationTests + ssm param with tooling accou…

fa69dde

…nt id


          Error on cdk add_to_policy

3e19596


          Add filter term include tags datasets

c05de67


          Add sample data and tests for dataset role access

8f2a918


          Add sample data and tests for dataset role access

9b2c711


          set up structure

d30c121


          in get_or_create_persistent_env argument of delete_env must be env-ob…

2bad8bd

…ject


          create|update|submit simple share

SofiaSazonova changed the title ~~Share int tests~~ Share Integration tests

Sofia Sazonova and others added 11 commits

July 5, 2024 19:01


          testdata.json -> gitignore

3972e05


          Add assume role permissions to codebuild role

2dcd60f


          Add naming checks in clients + create table

c261da7


          Add permissions, confidentiality and commented tests

1e9732b


          second share + autoapproval test

cf8553c


          revert persistent environment

5ea8b6b


          Fix check_stack_ready in dataset creation

520a34e


          full cycle test

0cf4ea7


          Revert session environment and add tests

972c883


          fix integration role datasets

7b1c942


          Fix presigned URL upload test

d9042dc

dlpzx reviewed

View reviewed changes

tests_new/integration_tests/modules/share_base/fresh_deployment.py Outdated

Contributor

dlpzx Jul 9, 2024

The naming does not match any of the other tests, we could call them something closer to the other modules test_shares and test_shares_backwards_compatibility

dlpzx reviewed

View reviewed changes

tests_new/integration_tests/modules/share_base/fresh_deployment.py Outdated



		def test_create_share_object(share1):
		assert_that(share1.status).is_equal_to(ShareObjectStatus.Draft.value)

Contributor

dlpzx Jul 9, 2024

We could also add assertions on the principal of the share and a new test for create_share_object for consumption roles that makes sure that the groupUri and the principal are correct

dlpzx reviewed

View reviewed changes

tests_new/integration_tests/modules/share_base/fresh_deployment.py Outdated

		'ShareItemsFound', 'The request is empty'
		)

Contributor

dlpzx Jul 9, 2024

Once the dataset PRs are merged we should add tests on getShareObject with the filters used to list shared items. Making sure it lists the tables, folders and bucket defining them explicitly instead of depending on a list of items

dlpzx reviewed

View reviewed changes

tests_new/integration_tests/modules/share_base/fresh_deployment.py Outdated

+                  assert_that(items).is_length(1)
+                  assert_that(items[0].shareItemUri).is_equal_to(share_item_uri)
+                  assert_that(items[0].status).is_equal_to(ShareItemStatus.PendingApproval.value)

Contributor

dlpzx Jul 9, 2024

My main concern is that we cannot test test_reject_share without running test_submit_object_no_auto_approval before. The tests are coupled, which in this case it might be alright, we are testing test_share_workflow succeed in parts. I would just clearly indicate that as a comment and group them all together (all tests on share1 together and all tests on share3 together to show the path). @petrkalos wdyt?

dlpzx reviewed

View reviewed changes

tests_new/integration_tests/modules/share_base/fresh_deployment.py Outdated

+                  items = updated_share['items'].nodes
+                  assert_that(updated_share.status).is_equal_to(ShareObjectStatus.Processed.value)
+                  for item in items:

Contributor

dlpzx Jul 9, 2024

Concern: how do we ensure there are items and this is not an empty list? As a next step after datasets PR, for the validation we would 1) s3 get object to the bucket and access point shares 2) run an athena query on the glue tables

Sofia Sazonova added 5 commits

September 17, 2024 09:28


          Merge branch 'mda-main' into share-int-tests

b3179d7


          refactoring

245ee77


          check folder access and access point

d1b78fb


          s3 account clean up script

cf1f1ce


          verify share item access

bfb0ede

SofiaSazonova marked this pull request as ready for review

September 18, 2024 16:12

SofiaSazonova requested a review from dlpzx

September 19, 2024 08:51


          parametrize

c0e7308

petrkalos requested changes

View reviewed changes

tests_new/integration_tests/modules/share_base/test_s3_share_cons_roles.py Outdated

Comment on lines 266 to 268

+                  assert_that(items).extracting('itemType').contains(ShareableType.Table.name)
+                  assert_that(items).extracting('itemType').contains(ShareableType.S3Bucket.name)
+                  assert_that(items).extracting('itemType').contains(ShareableType.StorageLocation.name)

Contributor

petrkalos Sep 19, 2024

Maybe assert_that(items).extracting('itemType').contains('foo1', 'foo2', 'foo3') docs?

tests_new/integration_tests/modules/share_base/test_s3_share_cons_roles.py Outdated

Comment on lines 264 to 265

		for item in items:
		assert_that(item.status).is_equal_to(ShareItemStatus.Revoke_Succeeded.value)

Contributor

petrkalos Sep 19, 2024

Maybe assert_that(items).extracting('status').contains_only(ShareItemStatus.Revoke_Succeeded.value)?

tests_new/clean_up_s3.sh Outdated

Contributor

petrkalos Sep 19, 2024

I'd guess you need that because the account exceeded the number of buckets. Here is a script I wrote that can be run as a lambda which will clean up all the "orphan" (buckets that do not belong to any cfn stack) buckets. We can later extend this to other resources that are being left behind and maybe run it as part of the pytest teardown or as part of the pipeline.

p.s One piece that is currently missing is that if S3 buckets have access points it will fail to delete them.

import logging
import sys
from concurrent.futures.thread import ThreadPoolExecutor

import boto3
from botocore.exceptions import ClientError

logging.getLogger().setLevel(logging.INFO)
if not logging.getLogger().hasHandlers():
    logging.getLogger().addHandler(logging.StreamHandler(sys.stdout))

logger = logging.getLogger(__name__)

session = boto3.session.Session()
s3client = session.client('s3')
s3resource = session.resource('s3')


def is_dataall_bucket(bucket) -> bool:
    try:
        tags = {tag['Key']: tag['Value'] for tag in bucket.Tagging().tag_set}
        return 'testUser' in tags.get('Creator', '') and tags.get('Environment', '').startswith('test')
    except ClientError as e:
        return False


def is_orphan_bucket(bucket):
    region = s3client.get_bucket_location(Bucket=bucket.name)['LocationConstraint'] or 'us-east-1'
    cfnclient = session.client('cloudformation', region_name=region)
    try:
        return not cfnclient.describe_stack_resources(PhysicalResourceId=bucket.name)
    except ClientError as e:
        return 'does not exist' in e.response['Error']['Message']


def delete_bucket(bucket):
    bucket_versioning = bucket.Versioning()
    if bucket_versioning.status == 'Enabled':
        bucket.object_versions.delete()
    else:
        bucket.objects.all().delete()
    bucket.delete()


def cleanup_bucket(bucket):
    try:
        logger.info(f'checking {bucket.name=}')
        if is_dataall_bucket(bucket) and is_orphan_bucket(bucket):
            logger.info(f'deleting {bucket.name}')
            delete_bucket(bucket)
    except Exception:
        logger.exception(f'something went wrong when deleting {bucket.name=}')


def run():
    with ThreadPoolExecutor(max_workers=8) as tpe:
        for _ in tpe.map(cleanup_bucket, s3resource.buckets.all()):
            ...


def lambda_handler(event, context):
    run()


if __name__ == '__main__':
    lambda_handler(None, None)

tests_new/integration_tests/README.md

-                  }
-                  ```
+                     - For this deployment the `config.json` flag `cdk_pivot_role_multiple_environments_same_account` must be set to `true` if an AWS account is going to be reused for multiple environments,
+                       - Second test account is bootstraped, and first account is added to trusted policy in target regions

Contributor

petrkalos Sep 19, 2024

The term second and first account are a bit confusing here. We have 3 types of accounts

DevOps/Tooling account
Service/Deployment account
Environment accounts

All environment accounts must trust the service/deployment account and not the first account. Although environment and service account can be the same account we neither encourage this nor use it in our own pipeline.

Would you mind make it clearer in the doc?

tests_new/integration_tests/aws_clients/athena.py Outdated

Comment on lines 11 to 26

+                  def run_query(self, query, workgroup='primary', output_location=None):
+                      if output_location:
+                          result = self._client.start_query_execution(
+                              QueryString=query, ResultConfiguration={'OutputLocation': output_location}
+                          )
+                      else:
+                          result = self._client.start_query_execution(QueryString=query, WorkGroup=workgroup)
+                      return result['QueryExecutionId']
+                  def wait_for_query(self, query_id):
+                      for i in range(self.retries):
+                          result = self._client.get_query_execution(QueryExecutionId=query_id)
+                          state = result['QueryExecution']['Status']['State']
+                          if state not in ['QUEUED', 'RUNNING']:
+                              return state
+                          time.sleep(self.timeout)

Contributor

petrkalos Sep 19, 2024

nit: I'd make those 2 private and provide a higher level blocking method for queries. Perhaps you can use the boto3 waiter as well.

tests_new/integration_tests/README.md Outdated

Comment on lines 59 to 61

+                               "aws_profiles": {
+                                   "second": "second_int_test_profile"
+                               },

Contributor

petrkalos Sep 19, 2024

I think the use of aws_profiles will not play very well with CodeBuild, instead I propose to use the existing infrastracture (see session_env1_aws_client).

As discussed offline you need this account to test consumption roles. By using the integration test account you can solve it using the two following patterns

(simpler) add the already created (during env deployment) integration-test role directly as a consumption role. Current (CodeBuild/Local) account have already permissions to assume this role so you can use STS to assume it and then run S3 queries to make sure that the share was succesful.
(more complex) use the integration-test role to create new roles in the target account that you will register as consumption roles. Then proceed with assuming those roles and testing for S3 access.

Contributor Author

SofiaSazonova Sep 20, 2024

I implemeted via AssumeRole

petrkalos reviewed

View reviewed changes

tests_new/integration_tests/modules/share_base/test_new_crossacc_s3_share.py Outdated

Comment on lines 35 to 41

+              @pytest.mark.parametrize(
+                  'principal_type',
+                  ['Group', 'ConsumptionRole'],
+              )
+              def test_create_and_delete_share_object(
+                  client5, persistent_cross_acc_env_1, session_s3_dataset1, consumption_role_1, group5, principal_type
+              ):

Contributor

petrkalos Sep 19, 2024

Instead of parametrizing the tests you can parametrize a fixture and then all tests that are using this fixture will run as many times as the parameters of the fixture.

For example in this case you can do something like...

@pytest.fixture(params=["Group", "ConsumptionRole"])
def principal1(request, group5, consumption_role_1):
    """
    :return: tuple with (principalUri, principalType)
    """
    if request.param is 'Group':
        yield (group5, request.param)
    else:
        yield (consumption_role_1.consumptionRoleUri, request.param)

Contributor Author

SofiaSazonova Sep 20, 2024

implemented

petrkalos reviewed

View reviewed changes

tests_new/integration_tests/modules/share_base/test_new_crossacc_s3_share.py Outdated

Comment on lines 160 to 163

+              @pytest.mark.parametrize(
+                  'share_fixture_name',
+                  ['session_share_1', 'session_share_consrole_1'],
+              )

Contributor

petrkalos Sep 19, 2024

Similar to my previous comment do you think it's possible to use parametrized fixtures here as well to avoid the getfixturevalue?

Contributor Author

SofiaSazonova Sep 20, 2024

implemented

Sofia Sazonova added 3 commits

September 20, 2024 14:45


          use env test role

1fb405c


          add assumed role to trust relations

2f2ca9a


          redo parametrization

5b36c5d

SofiaSazonova requested a review from petrkalos

September 20, 2024 17:16


          execute query -- hide logic in client

f46a6bd

petrkalos requested changes

View reviewed changes

tests_new/integration_tests/aws_clients/iam.py Outdated

+                      session = boto3.Session()
+                      param_client = session.client('ssm', os.environ.get('AWS_REGION', 'us-east-1'))
+                      parameter_path = f"/dataall/{os.environ.get('ENVNAME', 'dev')}/toolingAccount"
+                      print(parameter_path)

Contributor

petrkalos Sep 23, 2024

nit: logging

Contributor Author

SofiaSazonova Sep 23, 2024

removed

tests_new/integration_tests/aws_clients/iam.py Outdated

-                                      "AWS": "arn:aws:iam::{account_id}:root"
+                                      "AWS": ["arn:aws:iam::{account_id}:root",
+                                      "arn:aws:iam::{IAMClient.get_tooling_account_id()}:root",
+                                      "arn:aws:sts::{account_id}:assumed-role/{test_role_name}/{test_role_name}"]

Contributor

petrkalos Sep 23, 2024

I am no IAM expert but do we need this? I think the first principal (line 46) will allow all roles from account_id to assume this role. Check this

Contributor Author

SofiaSazonova Sep 23, 2024

No, it doesn't work. Assumed role is processed differently. I tried without it and got Access Denied, so I had to explicitly add this.

tests_new/integration_tests/aws_clients/sts.py Outdated

Comment on lines 5 to 7

+                  def __init__(self, session, region):
+                      if not session:
+                          session = boto3.Session()

Contributor

petrkalos Sep 23, 2024

nit: shorthand...

    def __init__(self, region, session = boto3.Session()):
       ...

Contributor Author

SofiaSazonova Sep 23, 2024

done

tests_new/integration_tests/modules/share_base/test_new_crossacc_s3_share.py

Contributor

petrkalos Sep 23, 2024

this looks awesome 👍

tests_new/integration_tests/aws_clients/iam.py Outdated

		@@ -34,7 +43,9 @@ def create_role(self, account_id, role_name):
		{{

Contributor

petrkalos Sep 23, 2024

nit: I'd make this a dict and then do a json.dumps

Contributor Author

SofiaSazonova Sep 23, 2024

done

petrkalos reviewed

View reviewed changes

tests_new/integration_tests/README.md Outdated

+                                     "accountId": "...",
+                                     "region": "us-east-1"
+                             },
+                            "persistent_cross_acc_env_1": {

Contributor

petrkalos Sep 23, 2024

did you decide to have a persistent environment for speed or there are other reasons?

Contributor Author

SofiaSazonova Sep 23, 2024

Speed up.
Later we will need persistent shares as well

Contributor

petrkalos Sep 23, 2024

I think @dlpzx made a very good point in another PR about using persistent env/shares.
If we use them without forcing an update (and wait for it to complete) then we might not be testing the latest changes but if we do that then we might as well create a new env every time.

I think we should be able to use persistent envs with the argument of speed only for not very significant features AND obviously to test backwards compatibility (but for that we should still force update and wait).


          PR changes

9bf30c3

SofiaSazonova requested a review from petrkalos

September 23, 2024 15:31

Sofia Sazonova and others added 7 commits

September 24, 2024 10:40


          additional logging

9125e05


          session env instead of persistent env

417f80d

cdk

29dfca7


          typos

384317b


          ruff

5bf4106


          cdk.json back to default

6e9e9fd


          Merge branch 'main' into share-int-tests

34d40f0

petrkalos approved these changes

View reviewed changes

SofiaSazonova merged commit 2005863 into data-dot-all:main

9 checks passed

SofiaSazonova deleted the share-int-tests branch

October 3, 2024 13:42

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet