Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade from V1.5.6 to V1.6.2 issues: AccessDenied for pivot role in ECS tasks + CodeArtifact vpc endpoint networking timeout in CodeBuild migration stage #826

Closed
idpdevops opened this issue Oct 24, 2023 · 7 comments

Comments

@idpdevops
Copy link

idpdevops commented Oct 24, 2023

Upgrade from V1.5.6 to V1.6.2 fails if using baseline_codebuild_role.without_policy_updates

This is technically not a bug as the as the data.all code was used outside it's intended purpose.

The fix #774 was implemented in the main branch so didn't help with our upgrade from V1.5.6 to V1.6.2. So I merged the pipeline.py changes into the V1.6.2 code (code as per "How to reproduce") and ran the pipeline. This worked well initially and it made it past the quality gate stage, the ecr-stage and also the dev-backend-stage. However, the DB migration stage failed (see logs in "Additional context").

The command that failed (error 255 hints at a permissions issue) is

aws codeartifact login --tool pip --domain ***********-domain-master --domain-owner YYYYYYYYYYYY --repository ***********-pypi-store

Interestingly, this command executed perfectly fine in the QualityGate ValidateDBMigrations stage (see logs).

I tried to execute this manually, assuming the relevant role that is used in the pipeline and that also worked fine!?

Since the pipeline ran, I have also received tons of emails with data.all alarms for various accounts relating to the ecr-stage (I think), although they stopped after about 2 days.


You are receiving this email because your DATAALL platdev environment in the eu-west-2 region has entered the ALARM state, because it failed to synchronize Dataset AAAAAAAAAAAA-risk-and-control tables from AWS Glue to the Search Catalog.

Alarm Details:
- State Change: OK -> ALARM
- Reason for State Change: An error occurred (AccessDenied) when calling the AssumeRole operation: User: arn:aws:sts::XXXXXXXXXXXX:assumed-role/-platdev-ecs-tasks-role/6ceeec3*********8b08bce9 is not authorized to perform: sts:AssumeRole on resource: arn:aws:iam::ZZZZZZZZZZZZ:role/dataallPivotRole-cdk
- Timestamp: 2023-10-14 00:55:02.304823
Dataset
- Dataset URI: rnsh6s27
- AWS Account: ZZZZZZZZZZZZ
- Region: eu-west-2
- Glue Database: lbx_member_db

The emails stopped after 2 days, data.all may have given up on whatever it was trying to do.

Here are the Cloudtrail logs:

{
"eventVersion": "1.08",
"userIdentity": {
"type": "AssumedRole",
"principalId": ":149dfc83a8ff464abfd0ffd63d62deaf",
"arn": "arn:aws:sts::XXXXXXXXXXXX:assumed-role/
-platdev-ecs-tasks-role/149dfc83a8ff464abfd0ffd63d62deaf",
"accountId": "XXXXXXXXXXXX",
"accessKeyId": "
",
"sessionContext": {
"sessionIssuer": {
"type": "Role",
"principalId": "
****************",
"arn": "arn:aws:iam::XXXXXXXXXXXX:role/-platdev-ecs-tasks-role",
"accountId": "XXXXXXXXXXXX",
"userName": "
-platdev-ecs-tasks-role"
},
"webIdFederationData": {},
"attributes": {
"creationDate": "2023-10-13T14:09:31Z",
"mfaAuthenticated": "false"
}
}
},
"eventTime": "2023-10-13T14:10:01Z",
"eventSource": "sts.amazonaws.com",
"eventName": "AssumeRole",
"awsRegion": "eu-west-2",
"sourceIPAddress": "10.0.32.173",
"userAgent": "Boto3/1.24.85 Python/3.8.16 Linux/5.10.192-183.736.amzn2.x86_64 exec-env/AWS_ECS_FARGATE Botocore/1.27.85 data.all/0.5.0",
"errorCode": "AccessDenied",
"errorMessage": "User: arn:aws:sts::XXXXXXXXXXXX:assumed-role/***********-platdev-ecs-tasks-role/149dfc83a8ff464abfd0ffd63d62deaf is not authorized to perform: sts:AssumeRole on resource: arn:aws:iam::WWWWWWWWWWWW:role/dataallPivotRole-cdk",
"requestParameters": null,
"responseElements": null,

}
}

So some aspects are a bit of a mystery, maybe you can figure out what exactly has gone wrong.

DB Migration log:


[Container] 2023/10/16 14:52:25 Entering phase BUILD
[Container] 2023/10/16 14:52:25 Running command mkdir ~/.aws/ && touch ~/.aws/config

[Container] 2023/10/16 14:52:25 Running command echo "[profile buildprofile]" > ~/.aws/config

[Container] 2023/10/16 14:52:25 Running command echo "role_arn = arn:aws:iam::XXXXXXXXXXXX:role/***********-platdev-cb-dbmigration-role" >> ~/.aws/config

[Container] 2023/10/16 14:52:25 Running command echo "credential_source = EcsContainer" >> ~/.aws/config

[Container] 2023/10/16 14:52:25 Running command aws sts get-caller-identity --profile buildprofile
{
"UserId": "*************:botocore-session-1697467958",
"Account": "XXXXXXXXXXXX",
"Arn": "arn:aws:sts::XXXXXXXXXXXX:assumed-role/
-platdev-cb-dbmigration-role/botocore-session-1697467958"
}

[Container] 2023/10/16 14:52:39 Running command aws codebuild start-build --project-name ***********-platdev-dbmigration --profile buildprofile --region eu-west-2 > codebuild-id.json

[Container] 2023/10/16 14:52:39 Running command aws codebuild batch-get-builds --ids $(jq -r .build.id codebuild-id.json) --profile buildprofile --region eu-west-2 > codebuild-output.json

[Container] 2023/10/16 14:52:40 Running command while [ "$(jq -r .builds[0].buildStatus codebuild-output.json)" != "SUCCEEDED" ] && [ "$(jq -r .builds[0].buildStatus codebuild-output.json)" != "FAILED" ]; do echo "running migration"; aws codebuild batch-get-builds --ids $(jq -r .build.id codebuild-id.json) --profile buildprofile --region eu-west-2 > codebuild-output.json; echo "$(jq -r .builds[0].buildStatus codebuild-output.json)"; sleep 5; done
running migration
IN_PROGRESS
running migration

IN_PROGRESS
running migration
FAILED

[Container] 2023/10/16 15:08:44 Running command if [ "$(jq -r .builds[0].buildStatus codebuild-output.json)" = "FAILED" ]; then echo "Failed"; cat codebuild-output.json; exit -1; fi
Failed
{
"builds": [
{
"id": "-platdev-dbmigration:556525ab-375d-481c-8eab-20358cfb3ec8",
"arn": "arn:aws:codebuild:eu-west-2:XXXXXXXXXXXX:build/
-platdev-dbmigration:556525ab-375d-481c-8eab-20358cfb3ec8",
"buildNumber": 12,
"startTime": 1697467959.665,
"endTime": 1697468915.942,
"currentPhase": "COMPLETED",
"buildStatus": "FAILED",
"projectName": "***********-platdev-dbmigration",
"phases": [

{
"phaseType": "BUILD",
"phaseStatus": "FAILED",
"startTime": 1697467990.638,
"endTime": 1697468915.594,
"durationInSeconds": 924,
"contexts": [
{
"statusCode": "COMMAND_EXECUTION_ERROR",
"message": "Error while executing command: aws codeartifact login --tool pip --domain -domain-master --domain-owner YYYYYYYYYYYY --repository -pypi-store. Reason: exit status 255"
}
]
},
{
"phaseType": "POST_BUILD",
"phaseStatus": "SUCCEEDED",
"startTime": 1697468915.594,
"endTime": 1697468915.63,
"durationInSeconds": 0,
"contexts": [
{
"statusCode": "",
"message": ""
}
]
},
{
"phaseType": "UPLOAD_ARTIFACTS",
"phaseStatus": "SUCCEEDED",
"startTime": 1697468915.63,
"endTime": 1697468915.708,
"durationInSeconds": 0,
"contexts": [
{
"statusCode": "",
"message": ""
}
]
},
{
"phaseType": "FINALIZING",
"phaseStatus": "SUCCEEDED",
"startTime": 1697468915.708,
"endTime": 1697468915.942,
"durationInSeconds": 0,
"contexts": [
{
"statusCode": "",
"message": "RequestError: send request failed\ncaused by: Post "[https://logs.eu-west-2.amazonaws.com/](https://logs.eu-west-2.amazonaws.com/)": dial tcp 10.82.2.164:443: i/o timeout"
}
]
},
{
"phaseType": "COMPLETED",
"startTime": 1697468915.942
}
],
"source": {
"type": "NO_SOURCE",
"buildspec": "{\n "version": "0.2",\n "phases": {\n "build": {\n "commands": [\n "aws s3api get-object --bucket -master-code-YYYYYYYYYYYY-eu-west-2 --key source_build.zip source_build.zip",\n "unzip source_build.zip",\n "python -m venv env",\n ". env/bin/activate",\n "aws codeartifact login --tool pip --domain -domain-master --domain-owner YYYYYYYYYYYY --repository -pypi-store",\n "pip install -r backend/requirements.txt",\n "pip install alembic",\n "export PYTHONPATH=backend",\n "export envname=platdev",\n "alembic -c backend/alembic.ini upgrade head"\n ]\n }\n }\n}",
"insecureSsl": false
},
"secondarySources": [],
"secondarySourceVersions": [],
"artifacts": {
"location": ""
},
"cache": {
"type": "NO_CACHE"
},
"environment": {
"type": "LINUX_CONTAINER",
"image": "aws/codebuild/amazonlinux2-x86_64-standard:3.0",
"computeType": "BUILD_GENERAL1_SMALL",
"environmentVariables": [],
"privilegedMode": false,
"imagePullCredentialsType": "CODEBUILD"
},
"serviceRole": "arn:aws:iam::XXXXXXXXXXXX:role/
-platdev-cb-dbmigration-role",
"logs": {
"groupName": "/aws/codebuild/
-platdev-dbmigration",
"streamName": "556525ab-375d-481c-8eab-20358cfb3ec8",
"deepLink": "https://console.aws.amazon.com/cloudwatch/home?region=eu-west-2#logsV2:log-groups/log-group/$252Faws$252Fcodebuild$252F
-platdev-dbmigration/log-events/556525ab-375d-481c-8eab-20358cfb3ec8",
"cloudWatchLogsArn": "arn:aws:logs:eu-west-2:XXXXXXXXXXXX:log-group:/aws/codebuild/
-platdev-dbmigration:log-stream:556525ab-375d-481c-8eab-20358cfb3ec8"
},
"timeoutInMinutes": 60,
"queuedTimeoutInMinutes": 480,
"buildComplete": true,
"initiator": "
-platdev-cb-dbmigration-role/botocore-session-1697467958",
"vpcConfig": {
"vpcId": "vpc-000382333791308c1",
"subnets": [
"subnet-01baa5a50fc364b02",
"subnet-0a98df0ce73be5447"
],
"securityGroupIds": [
"sg-0e68460a73d0ac50d"
]
},
"networkInterface": {
"subnetId": "subnet-0a98df0ce73be5447",
"networkInterfaceId": "eni-052532d0cd1bf9110"
},
"encryptionKey": "arn:aws:kms:eu-west-2:XXXXXXXXXXXX:alias/aws/s3"
}
],
"buildsNotFound": []
}

[Container] 2023/10/16 15:08:44 Command did not exit successfully if [ "$(jq -r .builds[0].buildStatus codebuild-output.json)" = "FAILED" ]; then echo "Failed"; cat codebuild-output.json; exit -1; fi exit status 255
[Container] 2023/10/16 15:08:44 Phase complete: BUILD State: FAILED
[Container] 2023/10/16 15:08:44 Phase context status code: COMMAND_EXECUTION_ERROR Message: Error while executing command: if [ "$(jq -r .builds[0].buildStatus codebuild-output.json)" = "FAILED" ]; then echo "Failed"; cat codebuild-output.json; exit -1; fi. Reason: exit status 255
[Container] 2023/10/16 15:08:44 Entering phase POST_BUILD
[Container] 2023/10/16 15:08:44 Phase complete: POST_BUILD State: SUCCEEDED
[Container] 2023/10/16 15:08:44 Phase context status code: Message:

Quality Gate DB Validation log:



[Container] 2023/10/13 13:32:05 Entering phase BUILD
[Container] 2023/10/13 13:32:05 Running command aws codeartifact login --tool pip --repository -pypi-store --domain -domain-master --domain-owner YYYYYYYYYYYY
Successfully configured pip to use AWS CodeArtifact repository [https://
-domain-master-YYYYYYYYYYYY.d.codeartifact.eu-west-2.amazonaws.com/pypi/
-pypi-store/](https://-domain-master- YYYYYYYYYYYY.d.codeartifact.eu-west-2.amazonaws.com/pypi/-pypi-store/)
Login expires in 12 hours at 2023-10-14 01:32:18+00:00

[Container] 2023/10/13 13:33:41 Phase complete: BUILD State: SUCCEEDED
[Container] 2023/10/13 13:33:41 Phase context status code: Message:
[Container] 2023/10/13 13:33:41 Entering phase POST_BUILD
[Container] 2023/10/13 13:33:41 Phase complete: POST_BUILD State: SUCCEEDED
[Container] 2023/10/13 13:33:41 Phase context status code: Message:

@dlpzx
Copy link
Contributor

dlpzx commented Oct 24, 2023

Hi @idpdevops, thanks for opening an issue. Even if it is "outside the intended use" we will try to help you on fixing your deployment. From the logs I can distinguish 2 different issues:

  • Issue 1 --> ECS glue sync task failure
  • Issue 2 --> Migration CodeBuild stage failure

Issue 1 --> ECS glue sync task failure

Root cause of the issue: access denied for ecs-task-role to AssumeRole arn:aws:iam::ZZZZZZZZZZZZ:role/dataallPivotRole-cdk

This type of error (Access Denied for AssumeRole) has 2 possible causes.

  1. the ecs-task-role lacks AssumeRole permissions
  2. the arn:aws:iam::ZZZZZZZZZZZZ:role/dataallPivotRole-cdk lacks permissions for the ecs-task-role in its trust policy

In v1.6 we focused on security hardening features, including the hardening of the trust policies on the pivot role, so that is the first thing that I would verify. We moved the external ID used in the trusted accounts to SSM, so its value has been updated.

In case you don't know what the external ID is, here is some documentation of why cross-account role assumption should use external IDs

  • Has the environment for account ZZZZZZZZZZZZ been updated after the upgrade?
  • If the environment was not updated, then the trust policy contains an external ID that is outdated. The ECS task is trying to assume the role with a new external ID, but the trust policy only allows assumptions verified with an old external ID.
  • To verify this hypothesis, you need to compare the external ID defined in the trust policy in the environment account with the SSM parameter in the central account (look for a parameter called externalId or something similar) and see if they match

Issue 2 --> Migration CodeBuild stage failure

There are better logs to debug this issue. Migrations are the way data.all has to update the RDS tables schema when new features are introduced. We want to include the update of RDS as part of our CICD pipeline (tooling account), but our RDS database is deployed in the central deployment account. To be able to modify the RDS database we deploy a CodeBuild project in the central deployment account, something like prefix-env-dbmigration, this is the "real migration codebuild stage" where we run the alembic commands. In the tooling account the migration stage (the "false migration stage" is just calling this real CodeBuild stage in the other account. Your logs are the fetching of the status, but as you probably noticed, they do not provide much info about the actual error.

What we need is to go to the central account > Codebuild> Projects > search for prefix-env-dbmigration and check the "real migration" logs

I hope this long test was helpful, please reach out with any new findings, logs or questions that might arise :)

@dlpzx
Copy link
Contributor

dlpzx commented Oct 25, 2023

Update from offline troubleshooting

Issue 1: Solved 👍

The first hypothesis was that:

  • The customer upgraded the code but the environment (and the pivot role trust policy external ID was not updated)
  • Sync tasks during that day failed
  • During the night an ECS triggered the environment update
  • They stopped receiving error alarms
    You can confirm this theory by going into the ZZZZ account and checking the cloudformation stack of the environment. In the events, check for the last updates to see if there is any update of the pivot role nested stack.

However, they updated the environment stacks in another way, they set the parameter "enable_update_dataall_stacks_in_cicd_pipeline": true in the cdk.json file. After this change, they did not receive more errors of access denied.
This is another way of forcing updates in environment and dataset stacks as part of the CICD. It ensures integrity of the application.

@dlpzx dlpzx changed the title Upgrade from V1.5.6 to V1.6.2 fails if using baseline_codebuild_role.without_policy_updates Upgrade from V1.5.6 to V1.6.2 issues: AccessDenied for pivot role in ECS tasks + CodeArtifact vpc endpoint networking timeout in CodeBuild migration stage Oct 27, 2023
@dlpzx
Copy link
Contributor

dlpzx commented Oct 27, 2023

Hi @idpdevops I renamed the issue to reflect the actual issues in same someone runs into the same challenges.

Update from offline troubleshooting

Issue 2: Understood - requires custom development for customer particular networking 👍

When checking the logs in deployment account > Codebuild > migration project, we could not see any logs. Instead in the CodeBuild phase details we can check that the issue is on the networking.
Screenshot 2023-10-27 at 13 15 38

From v1.5 to v1.6 there are changes in the way packages are installed, v1.6 ensures that all packages are always installed through AWS CodeArtifact. To log in, there is a command that tries to hit a Codeartifact VPC endpoint.

Given the logs and this particular change between versions, we could conclude that it was a networking issue between the CodeBuild migration project and the CodeArtifact VPC endpoint.

For the default cdk.json configuration, data.all creates those VPC endpoints and configures the VPC and the CodeBuild security group with outbound rules to the security group of the VPC endpoints. In this case however, the customer had its own internal process to create VPCs and VPC endpoints. The VPC was created in the CodeBuild (data.allDeploymentAccount) account. The VPC endpoints are deployed in a VPC in a different account (SharedVPCEAccount). Traffic between both VPCs is handled by transit Gateway.

In the cdk.json the created VPC in data.allDeploymentAccount was introduced as vpc_id in the deployment environment. The problem is that for the parameter vpc_endpoints_sg, the security group in the SharedVPCEAccount is not accessible because it is in another account VPC. Instead the customer introduced a generic security group.

We manually added the IP range for the VPC in the SharedVPCEAccount in the outbound rules for the CodeBuild security group and that solved the issue. Nevertheless, this is a workaround and we want to address this scenario in a more consistent way.

Option 1: VPC-peering and NO changes to data.all

  • We established peering between the VPCs in the SharedVPCEAccount and in the data.allDeploymentAccount. --> docs. From what I see in the pricing section, it would not incur in additional costs.
  • We can define the correct VPC endpoints security group in the cdk.json file

Option 2: new cdk.json parameter in data.all for VPC endpoints - VPC range outbound rules

  • We introduce a new cdk.json parameter 'external_vpcendpoints_vpc_ip_ranges'
  • We implement changes in the CodeBuild CDK stack to create an outbound rule to those IP ranges if the parameter is present.

I personally like working with security groups more than with IP ranges. It is more restrictive and readable, but maybe you got limitations and option 1 is not possible.

@idpdevops let me know your thoughts.

@dlpzx dlpzx added type: enhancement Feature enhacement status: in-review This issue has been implemented and is currently in review and waiting for next release priority: medium priority: low effort: low and removed status: needs more info labels Oct 27, 2023
@idpdevops
Copy link
Author

@dlpzx

Thank you very much for the analysis and the proposed options.

I prefer the VPC peering option but will have to check how this could work for us.

One thing also to note is that this command in the build project:

aws codeartifact login --tool pip --domain ********-domain-master --domain-owner ***************** --repository **********-pypi-store --endpoint-url *****************.api.codeartifact.eu-west-2.vpce.amazonaws.com

needed the addition of the --endpoint-url parameter to work.

However, even with the changes to the code-artifact command and the DB migration security group egress rules, the next command

pip install -r backend/requirements.txt

still failed because pip tries to access -domain-master-*********.d.codeartifact.eu-west-2.amazonaws.com and that request was rejected:

WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7f2c440b9820>, 'Connection to access -domain-master-.d.codeartifact.eu-west-2.amazonaws.com timed out. (connect timeout=15)')': /pypi/idpdataall-pypi-store/simple/ariadne/
...
WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7f2c440b9fa0>, 'Connection to access -domain-master-
.d.codeartifact.eu-west-2.amazonaws.com timed out. (connect timeout=15)')': /pypi/idpdataall-pypi-store/simple/ariadne/
ERROR: Could not find a version that satisfies the requirement ariadne==0.17.0 (from versions: none)
ERROR: No matching distribution found for ariadne==0.17.0

The DNS name gets resolved as follows:

PIP networking issue

That is obviously outside the IP address range for the VPC Endpoints we opened up in the DB migration security group egress rules, so the request is rejected:

2 392868065641 eni-*************************** 10.0.32.137 35...143 57952 443 6 1 60 1698403869 1698403895 REJECT OK

Interestingly, a reverse lookup on 35...143 points to ec2-35---143.eu-west-2.compute.amazonaws.com, so this seems to be the EC2 resource that runs the codeartifact stuff.

This resource seems to be separate to the 2 codeartifact VPC Endpoints.

I then added an outgoing rule to the DB migration SG to let all HTTPS/443 traffic out and that fixed the problem (I am aware that this unlikely to be the appropriate solution to the problem but at least it verified that there was a problem and showed that there is a networking issue:

buildSuccess

So overall, there seem to be 3 issues that need to be addressed:

  1. Access to VPC Endpoint that are not in the main dataall VPC (VPC Peering or changes to SG rules)
  2. Configuration of endpoints-url parmeter for codeartifact command and its use in the build script
  3. Routing of the requests to the codeartifact EC2s

@dlpzx
Copy link
Contributor

dlpzx commented Mar 12, 2024

Hi @idpdevops are you still facing issues?

@dlpzx dlpzx added status: closing-soon and removed status: in-review This issue has been implemented and is currently in review and waiting for next release priority: low effort: low type: enhancement Feature enhacement labels Mar 12, 2024
@idpdevops
Copy link
Author

idpdevops commented Mar 14, 2024 via email

@dlpzx
Copy link
Contributor

dlpzx commented Mar 14, 2024

Thanks for responding to the issue. Do not hesitate to reach out if you need any support.

@dlpzx dlpzx closed this as completed Mar 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants