-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade from V1.5.6 to V1.6.2 issues: AccessDenied for pivot role in ECS tasks + CodeArtifact vpc endpoint networking timeout in CodeBuild migration stage #826
Comments
Hi @idpdevops, thanks for opening an issue. Even if it is "outside the intended use" we will try to help you on fixing your deployment. From the logs I can distinguish 2 different issues:
Issue 1 --> ECS glue sync task failureRoot cause of the issue: access denied for This type of error (Access Denied for AssumeRole) has 2 possible causes.
In v1.6 we focused on security hardening features, including the hardening of the trust policies on the pivot role, so that is the first thing that I would verify. We moved the external ID used in the trusted accounts to SSM, so its value has been updated. In case you don't know what the external ID is, here is some documentation of why cross-account role assumption should use external IDs
Issue 2 --> Migration CodeBuild stage failureThere are better logs to debug this issue. Migrations are the way data.all has to update the RDS tables schema when new features are introduced. We want to include the update of RDS as part of our CICD pipeline (tooling account), but our RDS database is deployed in the central deployment account. To be able to modify the RDS database we deploy a CodeBuild project in the central deployment account, something like What we need is to go to the central account > Codebuild> Projects > search for I hope this long test was helpful, please reach out with any new findings, logs or questions that might arise :) |
Update from offline troubleshootingIssue 1: Solved 👍The first hypothesis was that:
However, they updated the environment stacks in another way, they set the parameter "enable_update_dataall_stacks_in_cicd_pipeline": true in the cdk.json file. After this change, they did not receive more errors of access denied. |
Hi @idpdevops I renamed the issue to reflect the actual issues in same someone runs into the same challenges. Update from offline troubleshootingIssue 2: Understood - requires custom development for customer particular networking 👍When checking the logs in deployment account > Codebuild > migration project, we could not see any logs. Instead in the CodeBuild phase details we can check that the issue is on the networking. From v1.5 to v1.6 there are changes in the way packages are installed, v1.6 ensures that all packages are always installed through AWS CodeArtifact. To log in, there is a command that tries to hit a Codeartifact VPC endpoint. Given the logs and this particular change between versions, we could conclude that it was a networking issue between the CodeBuild migration project and the CodeArtifact VPC endpoint. For the default cdk.json configuration, data.all creates those VPC endpoints and configures the VPC and the CodeBuild security group with outbound rules to the security group of the VPC endpoints. In this case however, the customer had its own internal process to create VPCs and VPC endpoints. The VPC was created in the CodeBuild (data.allDeploymentAccount) account. The VPC endpoints are deployed in a VPC in a different account (SharedVPCEAccount). Traffic between both VPCs is handled by transit Gateway. In the cdk.json the created VPC in data.allDeploymentAccount was introduced as We manually added the IP range for the VPC in the SharedVPCEAccount in the outbound rules for the CodeBuild security group and that solved the issue. Nevertheless, this is a workaround and we want to address this scenario in a more consistent way. Option 1: VPC-peering and NO changes to data.all
Option 2: new cdk.json parameter in data.all for VPC endpoints - VPC range outbound rules
I personally like working with security groups more than with IP ranges. It is more restrictive and readable, but maybe you got limitations and option 1 is not possible. @idpdevops let me know your thoughts. |
Thank you very much for the analysis and the proposed options. I prefer the VPC peering option but will have to check how this could work for us. One thing also to note is that this command in the build project: aws codeartifact login --tool pip --domain ********-domain-master --domain-owner ***************** --repository **********-pypi-store --endpoint-url *****************.api.codeartifact.eu-west-2.vpce.amazonaws.com needed the addition of the --endpoint-url parameter to work. However, even with the changes to the code-artifact command and the DB migration security group egress rules, the next command pip install -r backend/requirements.txt still failed because pip tries to access -domain-master-*********.d.codeartifact.eu-west-2.amazonaws.com and that request was rejected: WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7f2c440b9820>, 'Connection to access -domain-master-.d.codeartifact.eu-west-2.amazonaws.com timed out. (connect timeout=15)')': /pypi/idpdataall-pypi-store/simple/ariadne/ The DNS name gets resolved as follows: That is obviously outside the IP address range for the VPC Endpoints we opened up in the DB migration security group egress rules, so the request is rejected: 2 392868065641 eni-*************************** 10.0.32.137 35...143 57952 443 6 1 60 1698403869 1698403895 REJECT OK Interestingly, a reverse lookup on 35...143 points to ec2-35---143.eu-west-2.compute.amazonaws.com, so this seems to be the EC2 resource that runs the codeartifact stuff. This resource seems to be separate to the 2 codeartifact VPC Endpoints. I then added an outgoing rule to the DB migration SG to let all HTTPS/443 traffic out and that fixed the problem (I am aware that this unlikely to be the appropriate solution to the problem but at least it verified that there was a problem and showed that there is a networking issue: So overall, there seem to be 3 issues that need to be addressed:
|
Hi @idpdevops are you still facing issues? |
Hi,
our requirement that drove the use of Data.all has gone away, so the issue has gone away, too.
Kind regards,
Steffen
From: dlpzx ***@***.***>
Sent: Tuesday, March 12, 2024 4:05 PM
To: data-dot-all/dataall ***@***.***>
Cc: iDP DevOps Alerts ***@***.***>; Mention ***@***.***>
Subject: Re: [data-dot-all/dataall] Upgrade from V1.5.6 to V1.6.2 issues: AccessDenied for pivot role in ECS tasks + CodeArtifact vpc endpoint networking timeout in CodeBuild migration stage (Issue #826)
Hi @idpdevops<https://github.com/idpdevops> are you still facing issues?
—
Reply to this email directly, view it on GitHub<#826 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/BBFKNXDBX3LTGU6DBGGRQETYX4RRXAVCNFSM6AAAAAA6NVZLECVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJSGAYTGMZTGQ>.
You are receiving this because you were mentioned.Message ID: ***@***.******@***.***>>
|
Thanks for responding to the issue. Do not hesitate to reach out if you need any support. |
Upgrade from V1.5.6 to V1.6.2 fails if using baseline_codebuild_role.without_policy_updates
This is technically not a bug as the as the data.all code was used outside it's intended purpose.
The fix #774 was implemented in the main branch so didn't help with our upgrade from V1.5.6 to V1.6.2. So I merged the pipeline.py changes into the V1.6.2 code (code as per "How to reproduce") and ran the pipeline. This worked well initially and it made it past the quality gate stage, the ecr-stage and also the dev-backend-stage. However, the DB migration stage failed (see logs in "Additional context").
The command that failed (error 255 hints at a permissions issue) is
aws codeartifact login --tool pip --domain ***********-domain-master --domain-owner YYYYYYYYYYYY --repository ***********-pypi-store
Interestingly, this command executed perfectly fine in the QualityGate ValidateDBMigrations stage (see logs).
I tried to execute this manually, assuming the relevant role that is used in the pipeline and that also worked fine!?
Since the pipeline ran, I have also received tons of emails with data.all alarms for various accounts relating to the ecr-stage (I think), although they stopped after about 2 days.
You are receiving this email because your DATAALL platdev environment in the eu-west-2 region has entered the ALARM state, because it failed to synchronize Dataset AAAAAAAAAAAA-risk-and-control tables from AWS Glue to the Search Catalog.
Alarm Details:
- State Change: OK -> ALARM
- Reason for State Change: An error occurred (AccessDenied) when calling the AssumeRole operation: User: arn:aws:sts::XXXXXXXXXXXX:assumed-role/-platdev-ecs-tasks-role/6ceeec3*********8b08bce9 is not authorized to perform: sts:AssumeRole on resource: arn:aws:iam::ZZZZZZZZZZZZ:role/dataallPivotRole-cdk
- Timestamp: 2023-10-14 00:55:02.304823
Dataset
- Dataset URI: rnsh6s27
- AWS Account: ZZZZZZZZZZZZ
- Region: eu-west-2
- Glue Database: lbx_member_db
The emails stopped after 2 days, data.all may have given up on whatever it was trying to do.
Here are the Cloudtrail logs:
{
"eventVersion": "1.08",
"userIdentity": {
"type": "AssumedRole",
"principalId": ":149dfc83a8ff464abfd0ffd63d62deaf",
"arn": "arn:aws:sts::XXXXXXXXXXXX:assumed-role/-platdev-ecs-tasks-role/149dfc83a8ff464abfd0ffd63d62deaf",
"accountId": "XXXXXXXXXXXX",
"accessKeyId": "",
"sessionContext": {
"sessionIssuer": {
"type": "Role",
"principalId": "****************",
"arn": "arn:aws:iam::XXXXXXXXXXXX:role/-platdev-ecs-tasks-role",
"accountId": "XXXXXXXXXXXX",
"userName": "-platdev-ecs-tasks-role"
},
"webIdFederationData": {},
"attributes": {
"creationDate": "2023-10-13T14:09:31Z",
"mfaAuthenticated": "false"
}
}
},
"eventTime": "2023-10-13T14:10:01Z",
"eventSource": "sts.amazonaws.com",
"eventName": "AssumeRole",
"awsRegion": "eu-west-2",
"sourceIPAddress": "10.0.32.173",
"userAgent": "Boto3/1.24.85 Python/3.8.16 Linux/5.10.192-183.736.amzn2.x86_64 exec-env/AWS_ECS_FARGATE Botocore/1.27.85 data.all/0.5.0",
"errorCode": "AccessDenied",
"errorMessage": "User: arn:aws:sts::XXXXXXXXXXXX:assumed-role/***********-platdev-ecs-tasks-role/149dfc83a8ff464abfd0ffd63d62deaf is not authorized to perform: sts:AssumeRole on resource: arn:aws:iam::WWWWWWWWWWWW:role/dataallPivotRole-cdk",
"requestParameters": null,
"responseElements": null,
…
}
}
So some aspects are a bit of a mystery, maybe you can figure out what exactly has gone wrong.
DB Migration log:
…
[Container] 2023/10/16 14:52:25 Entering phase BUILD
[Container] 2023/10/16 14:52:25 Running command mkdir ~/.aws/ && touch ~/.aws/config
[Container] 2023/10/16 14:52:25 Running command echo "[profile buildprofile]" > ~/.aws/config
[Container] 2023/10/16 14:52:25 Running command echo "role_arn = arn:aws:iam::XXXXXXXXXXXX:role/***********-platdev-cb-dbmigration-role" >> ~/.aws/config
[Container] 2023/10/16 14:52:25 Running command echo "credential_source = EcsContainer" >> ~/.aws/config
[Container] 2023/10/16 14:52:25 Running command aws sts get-caller-identity --profile buildprofile
{
"UserId": "*************:botocore-session-1697467958",
"Account": "XXXXXXXXXXXX",
"Arn": "arn:aws:sts::XXXXXXXXXXXX:assumed-role/-platdev-cb-dbmigration-role/botocore-session-1697467958"
}
[Container] 2023/10/16 14:52:39 Running command aws codebuild start-build --project-name ***********-platdev-dbmigration --profile buildprofile --region eu-west-2 > codebuild-id.json
[Container] 2023/10/16 14:52:39 Running command aws codebuild batch-get-builds --ids $(jq -r .build.id codebuild-id.json) --profile buildprofile --region eu-west-2 > codebuild-output.json
[Container] 2023/10/16 14:52:40 Running command while [ "$(jq -r .builds[0].buildStatus codebuild-output.json)" != "SUCCEEDED" ] && [ "$(jq -r .builds[0].buildStatus codebuild-output.json)" != "FAILED" ]; do echo "running migration"; aws codebuild batch-get-builds --ids$(jq -r .build.id codebuild-id.json) --profile buildprofile --region eu-west-2 > codebuild-output.json; echo "$ (jq -r .builds[0].buildStatus codebuild-output.json)"; sleep 5; done
running migration
IN_PROGRESS
running migration
…
IN_PROGRESS
running migration
FAILED
[Container] 2023/10/16 15:08:44 Running command if [ "$(jq -r .builds[0].buildStatus codebuild-output.json)" = "FAILED" ]; then echo "Failed"; cat codebuild-output.json; exit -1; fi
Failed
{
"builds": [
{
"id": "-platdev-dbmigration:556525ab-375d-481c-8eab-20358cfb3ec8",
"arn": "arn:aws:codebuild:eu-west-2:XXXXXXXXXXXX:build/-platdev-dbmigration:556525ab-375d-481c-8eab-20358cfb3ec8",
"buildNumber": 12,
"startTime": 1697467959.665,
"endTime": 1697468915.942,
"currentPhase": "COMPLETED",
"buildStatus": "FAILED",
"projectName": "***********-platdev-dbmigration",
"phases": [
…
{
"phaseType": "BUILD",
"phaseStatus": "FAILED",
"startTime": 1697467990.638,
"endTime": 1697468915.594,
"durationInSeconds": 924,
"contexts": [
{
"statusCode": "COMMAND_EXECUTION_ERROR",
"message": "Error while executing command: aws codeartifact login --tool pip --domain -domain-master --domain-owner YYYYYYYYYYYY --repository -pypi-store. Reason: exit status 255"
}
]
},
{
"phaseType": "POST_BUILD",
"phaseStatus": "SUCCEEDED",
"startTime": 1697468915.594,
"endTime": 1697468915.63,
"durationInSeconds": 0,
"contexts": [
{
"statusCode": "",
"message": ""
}
]
},
{
"phaseType": "UPLOAD_ARTIFACTS",
"phaseStatus": "SUCCEEDED",
"startTime": 1697468915.63,
"endTime": 1697468915.708,
"durationInSeconds": 0,
"contexts": [
{
"statusCode": "",
"message": ""
}
]
},
{
"phaseType": "FINALIZING",
"phaseStatus": "SUCCEEDED",
"startTime": 1697468915.708,
"endTime": 1697468915.942,
"durationInSeconds": 0,
"contexts": [
{
"statusCode": "",
"message": "RequestError: send request failed\ncaused by: Post "[https://logs.eu-west-2.amazonaws.com/](https://logs.eu-west-2.amazonaws.com/)": dial tcp 10.82.2.164:443: i/o timeout"
}
]
},
{
"phaseType": "COMPLETED",
"startTime": 1697468915.942
}
],
"source": {
"type": "NO_SOURCE",
"buildspec": "{\n "version": "0.2",\n "phases": {\n "build": {\n "commands": [\n "aws s3api get-object --bucket -master-code-YYYYYYYYYYYY-eu-west-2 --key source_build.zip source_build.zip",\n "unzip source_build.zip",\n "python -m venv env",\n ". env/bin/activate",\n "aws codeartifact login --tool pip --domain -domain-master --domain-owner YYYYYYYYYYYY --repository -pypi-store",\n "pip install -r backend/requirements.txt",\n "pip install alembic",\n "export PYTHONPATH=backend",\n "export envname=platdev",\n "alembic -c backend/alembic.ini upgrade head"\n ]\n }\n }\n}",
"insecureSsl": false
},
"secondarySources": [],
"secondarySourceVersions": [],
"artifacts": {
"location": ""
},
"cache": {
"type": "NO_CACHE"
},
"environment": {
"type": "LINUX_CONTAINER",
"image": "aws/codebuild/amazonlinux2-x86_64-standard:3.0",
"computeType": "BUILD_GENERAL1_SMALL",
"environmentVariables": [],
"privilegedMode": false,
"imagePullCredentialsType": "CODEBUILD"
},
"serviceRole": "arn:aws:iam::XXXXXXXXXXXX:role/-platdev-cb-dbmigration-role",
"logs": {
"groupName": "/aws/codebuild/-platdev-dbmigration",
"streamName": "556525ab-375d-481c-8eab-20358cfb3ec8",
"deepLink": "https://console.aws.amazon.com/cloudwatch/home?region=eu-west-2#logsV2:log-groups/log-group/$252Faws$252Fcodebuild$252F-platdev-dbmigration/log-events/556525ab-375d-481c-8eab-20358cfb3ec8",
"cloudWatchLogsArn": "arn:aws:logs:eu-west-2:XXXXXXXXXXXX:log-group:/aws/codebuild/-platdev-dbmigration:log-stream:556525ab-375d-481c-8eab-20358cfb3ec8"
},
"timeoutInMinutes": 60,
"queuedTimeoutInMinutes": 480,
"buildComplete": true,
"initiator": "-platdev-cb-dbmigration-role/botocore-session-1697467958",
"vpcConfig": {
"vpcId": "vpc-000382333791308c1",
"subnets": [
"subnet-01baa5a50fc364b02",
"subnet-0a98df0ce73be5447"
],
"securityGroupIds": [
"sg-0e68460a73d0ac50d"
]
},
"networkInterface": {
"subnetId": "subnet-0a98df0ce73be5447",
"networkInterfaceId": "eni-052532d0cd1bf9110"
},
"encryptionKey": "arn:aws:kms:eu-west-2:XXXXXXXXXXXX:alias/aws/s3"
}
],
"buildsNotFound": []
}
[Container] 2023/10/16 15:08:44 Command did not exit successfully if [ "$(jq -r .builds[0].buildStatus codebuild-output.json)" = "FAILED" ]; then echo "Failed"; cat codebuild-output.json; exit -1; fi exit status 255
[Container] 2023/10/16 15:08:44 Phase complete: BUILD State: FAILED
[Container] 2023/10/16 15:08:44 Phase context status code: COMMAND_EXECUTION_ERROR Message: Error while executing command: if [ "$(jq -r .builds[0].buildStatus codebuild-output.json)" = "FAILED" ]; then echo "Failed"; cat codebuild-output.json; exit -1; fi. Reason: exit status 255
[Container] 2023/10/16 15:08:44 Entering phase POST_BUILD
[Container] 2023/10/16 15:08:44 Phase complete: POST_BUILD State: SUCCEEDED
[Container] 2023/10/16 15:08:44 Phase context status code: Message:
Quality Gate DB Validation log:
…
[Container] 2023/10/13 13:32:05 Entering phase BUILD
[Container] 2023/10/13 13:32:05 Running command aws codeartifact login --tool pip --repository -pypi-store --domain -domain-master --domain-owner YYYYYYYYYYYY
Successfully configured pip to use AWS CodeArtifact repository [https://-domain-master-YYYYYYYYYYYY.d.codeartifact.eu-west-2.amazonaws.com/pypi/-pypi-store/](https://-domain-master- YYYYYYYYYYYY.d.codeartifact.eu-west-2.amazonaws.com/pypi/-pypi-store/)
Login expires in 12 hours at 2023-10-14 01:32:18+00:00
…
[Container] 2023/10/13 13:33:41 Phase complete: BUILD State: SUCCEEDED
[Container] 2023/10/13 13:33:41 Phase context status code: Message:
[Container] 2023/10/13 13:33:41 Entering phase POST_BUILD
[Container] 2023/10/13 13:33:41 Phase complete: POST_BUILD State: SUCCEEDED
[Container] 2023/10/13 13:33:41 Phase context status code: Message:
The text was updated successfully, but these errors were encountered: