Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: Service Lambda timeouts cause user-facing 5xx responses (#6284) #6624

Open
wants to merge 14 commits into
base: develop
Choose a base branch
from

Conversation

dsotirho-ucsc
Copy link
Contributor

@dsotirho-ucsc dsotirho-ucsc commented Oct 9, 2024

Connected issues: #6284

Checklist

Author

  • PR is a draft
  • Target branch is develop
  • Name of PR branch matches issues/<GitHub handle of author>/<issue#>-<slug>
  • On ZenHub, PR is connected to all issues it (partially) resolves
  • PR description links to connected issues
  • PR title matches1 that of a connected issue or comment in PR explains why they're different
  • PR title references all connected issues
  • For each connected issue, there is at least one commit whose title references that issue

1 when the issue title describes a problem, the corresponding PR
title is Fix: followed by the issue title

Author (partiality)

  • Added p tag to titles of partial commits
  • This PR is labeled partial or completely resolves all connected issues
  • This PR partially resolves each of the connected issues or does not have the partial label

Author (chains)

  • This PR is blocked by previous PR in the chain or is not chained to another PR
  • The blocking PR is labeled base or this PR is not chained to another PR
  • This PR is labeled chained or is not chained to another PR

Author (reindex, API changes)

  • Added r tag to commit title or the changes introduced by this PR will not require reindexing of any deployment
  • This PR is labeled reindex:dev or the changes introduced by it will not require reindexing of dev
  • This PR is labeled reindex:anvildev or the changes introduced by it will not require reindexing of anvildev
  • This PR is labeled reindex:anvilprod or the changes introduced by it will not require reindexing of anvilprod
  • This PR is labeled reindex:prod or the changes introduced by it will not require reindexing of prod
  • This PR is labeled reindex:partial and its description documents the specific reindexing procedure for dev, anvildev, anvilprod and prod or requires a full reindex or carries none of the labels reindex:dev, reindex:anvildev, reindex:anvilprod and reindex:prod
  • This PR and its connected issues are labeled API or this PR does not modify a REST API
  • Added a (A) tag to commit title for backwards (in)compatible changes or this PR does not modify a REST API
  • Updated REST API version number in app.py or this PR does not modify a REST API

Author (upgrading deployments)

  • Ran make docker_images.json and committed the resulting changes or this PR does not modify azul_docker_images, or any other variables referenced in the definition of that variable
  • Documented upgrading of deployments in UPGRADING.rst or this PR does not require upgrading deployments
  • Added u tag to commit title or this PR does not require upgrading deployments
  • This PR is labeled upgrade or does not require upgrading deployments
  • This PR is labeled deploy:shared or does not modify docker_images.json, and does not require deploying the shared component for any other reason
  • This PR is labeled deploy:gitlab or does not require deploying the gitlab component
  • This PR is labeled deploy:runner or does not require deploying the runner image

Author (hotfixes)

  • Added F tag to main commit title or this PR does not include permanent fix for a temporary hotfix
  • Reverted the temporary hotfixes for any connected issues or the none of the stable branches (anvilprod and prod) have temporary hotfixes for any of the issues connected to this PR

Author (before every review)

  • Rebased PR branch on develop, squashed old fixups
  • Ran make requirements_update or this PR does not modify requirements*.txt, common.mk, Makefile and Dockerfile
  • Added R tag to commit title or this PR does not modify requirements*.txt
  • This PR is labeled reqs or does not modify requirements*.txt
  • make integration_test passes in personal deployment or this PR does not modify functionality that could affect the IT outcome

Peer reviewer (after approval)

  • PR is not a draft
  • Ticket is in Review requested column
  • PR is awaiting requested review from system administrator
  • PR is assigned to only the system administrator

System administrator (after approval)

  • Actually approved the PR
  • Labeled connected issues as demo or no demo
  • Commented on connected issues about demo expectations or all connected issues are labeled no demo
  • Decided if PR can be labeled no sandbox
  • A comment to this PR details the completed security design review
  • PR title is appropriate as title of merge commit
  • N reviews label is accurate
  • Moved connected issues to Approved column
  • PR is assigned to only the operator

Operator (before pushing merge the commit)

  • Checked reindex:… labels and r commit title tag
  • Checked that demo expectations are clear or all connected issues are labeled no demo
  • Squashed PR branch and rebased onto develop
  • Sanity-checked history
  • Pushed PR branch to GitHub
  • Ran _select dev.shared && CI_COMMIT_REF_NAME=develop make -C terraform/shared apply_keep_unused or this PR is not labeled deploy:shared
  • Ran _select dev.gitlab && CI_COMMIT_REF_NAME=develop make -C terraform/gitlab apply or this PR is not labeled deploy:gitlab
  • Ran _select anvildev.shared && CI_COMMIT_REF_NAME=develop make -C terraform/shared apply_keep_unused or this PR is not labeled deploy:shared
  • Ran _select anvildev.gitlab && CI_COMMIT_REF_NAME=develop make -C terraform/gitlab apply or this PR is not labeled deploy:gitlab
  • Checked the items in the next section or this PR is labeled deploy:gitlab
  • PR is assigned to only the system administrator or this PR is not labeled deploy:gitlab

System administrator

  • Background migrations for dev.gitlab are complete or this PR is not labeled deploy:gitlab
  • Background migrations for anvildev.gitlab are complete or this PR is not labeled deploy:gitlab
  • PR is assigned to only the operator

Operator (before pushing merge the commit)

  • Ran _select dev.gitlab && make -C terraform/gitlab/runner or this PR is not labeled deploy:runner
  • Ran _select anvildev.gitlab && make -C terraform/gitlab/runner or this PR is not labeled deploy:runner
  • Added sandbox label or PR is labeled no sandbox
  • Pushed PR branch to GitLab dev or PR is labeled no sandbox
  • Pushed PR branch to GitLab anvildev or PR is labeled no sandbox
  • Build passes in sandbox deployment or PR is labeled no sandbox
  • Build passes in anvilbox deployment or PR is labeled no sandbox
  • Reviewed build logs for anomalies in sandbox deployment or PR is labeled no sandbox
  • Reviewed build logs for anomalies in anvilbox deployment or PR is labeled no sandbox
  • Deleted unreferenced indices in sandbox or this PR does not remove catalogs or otherwise causes unreferenced indices in dev
  • Deleted unreferenced indices in anvilbox or this PR does not remove catalogs or otherwise causes unreferenced indices in anvildev
  • Started reindex in sandbox or this PR is not labeled reindex:dev
  • Started reindex in anvilbox or this PR is not labeled reindex:anvildev
  • Checked for failures in sandbox or this PR is not labeled reindex:dev
  • Checked for failures in anvilbox or this PR is not labeled reindex:anvildev
  • The title of the merge commit starts with the title of this PR
  • Added PR # reference to merge commit title
  • Collected commit title tags in merge commit title but only included p if the PR is also labeled partial
  • Moved connected issues to Merged lower column in ZenHub
  • Moved blocked issues to Triage or no issues are blocked on the connected issues
  • Pushed merge commit to GitHub

Operator (chain shortening)

  • Changed the target branch of the blocked PR to develop or this PR is not labeled base
  • Removed the chained label from the blocked PR or this PR is not labeled base
  • Removed the blocking relationship from the blocked PR or this PR is not labeled base
  • Removed the base label from this PR or this PR is not labeled base

Operator (after pushing the merge commit)

  • Pushed merge commit to GitLab dev
  • Pushed merge commit to GitLab anvildev
  • Build passes on GitLab dev
  • Reviewed build logs for anomalies on GitLab dev
  • Build passes on GitLab anvildev
  • Reviewed build logs for anomalies on GitLab anvildev
  • Ran _select dev.shared && make -C terraform/shared apply or this PR is not labeled deploy:shared
  • Ran _select anvildev.shared && make -C terraform/shared apply or this PR is not labeled deploy:shared
  • Deleted PR branch from GitHub
  • Deleted PR branch from GitLab dev
  • Deleted PR branch from GitLab anvildev

Operator (reindex)

  • Deindexed all unreferenced catalogs in dev or this PR is neither labeled reindex:partial nor reindex:dev
  • Deindexed all unreferenced catalogs in anvildev or this PR is neither labeled reindex:partial nor reindex:anvildev
  • Deindexed specific sources in dev or this PR is neither labeled reindex:partial nor reindex:dev
  • Deindexed specific sources in anvildev or this PR is neither labeled reindex:partial nor reindex:anvildev
  • Indexed specific sources in dev or this PR is neither labeled reindex:partial nor reindex:dev
  • Indexed specific sources in anvildev or this PR is neither labeled reindex:partial nor reindex:anvildev
  • Started reindex in dev or this PR does not require reindexing dev
  • Started reindex in anvildev or this PR does not require reindexing anvildev
  • Checked for, triaged and possibly requeued messages in both fail queues in dev or this PR does not require reindexing dev
  • Checked for, triaged and possibly requeued messages in both fail queues in anvildev or this PR does not require reindexing anvildev
  • Emptied fail queues in dev or this PR does not require reindexing dev
  • Emptied fail queues in anvildev or this PR does not require reindexing anvildev

Operator

  • Propagated the deploy:shared, deploy:gitlab, deploy:runner, API, reindex:partial, reindex:anvilprod and reindex:prod labels to the next promotion PRs or this PR carries none of these labels
  • Propagated any specific instructions related to the deploy:shared, deploy:gitlab, deploy:runner, API, reindex:partial, reindex:anvilprod and reindex:prod labels, from the description of this PR to that of the next promotion PRs or this PR carries none of these labels
  • PR is assigned to no one

Shorthand for review comments

  • L line is too long
  • W line wrapping is wrong
  • Q bad quotes
  • F other formatting problem

@github-actions github-actions bot added the orange [process] Done by the Azul team label Oct 9, 2024
@coveralls
Copy link

coveralls commented Oct 9, 2024

Coverage Status

coverage: 85.591% (-0.02%) from 85.613%
when pulling 8c128b0 on issues/dsotirho-ucsc/6284-lambda-timeouts
into efe016f on develop.

Copy link

codecov bot commented Oct 9, 2024

Codecov Report

Attention: Patch coverage is 85.91549% with 10 lines in your changes missing coverage. Please review.

Project coverage is 85.57%. Comparing base (efe016f) to head (8c128b0).

Files with missing lines Patch % Lines
src/azul/chalice.py 77.41% 7 Missing ⚠️
src/azul/terraform.py 0.00% 3 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #6624      +/-   ##
===========================================
- Coverage    85.59%   85.57%   -0.03%     
===========================================
  Files          155      154       -1     
  Lines        20903    20899       -4     
===========================================
- Hits         17892    17884       -8     
- Misses        3011     3015       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dsotirho-ucsc dsotirho-ucsc force-pushed the issues/dsotirho-ucsc/6284-lambda-timeouts branch 2 times, most recently from 47cdf38 to e703ece Compare October 10, 2024 16:52
@dsotirho-ucsc dsotirho-ucsc added the API API change affecting callers label Oct 11, 2024
@dsotirho-ucsc dsotirho-ucsc force-pushed the issues/dsotirho-ucsc/6284-lambda-timeouts branch from d54245e to db02ce0 Compare October 11, 2024 21:18
@dsotirho-ucsc
Copy link
Contributor Author

6624_IT_2024-10-11.txt

Testing 504 response, note the included Retry-After header and custom response message:

$ curl -v 'https://service.daniel.dev.singlecell.gi.ucsc.edu/error-timeout'
…
> GET /error-timeout HTTP/2
> Host: service.daniel.dev.singlecell.gi.ucsc.edu
> User-Agent: curl/8.7.1
> Accept: */*
>
* Request completely sent off
< HTTP/2 504
< content-type: application/json
< content-length: 125
< date: Fri, 11 Oct 2024 22:30:40 GMT
< retry-after: 10
< x-amzn-requestid: a57d9d62-9178-4010-b228-77cdf7620ec0
< referrer-policy: strict-origin-when-cross-origin
< x-xss-protection: 1; mode=block
< strict-transport-security: max-age=63072000; includeSubDomains; preload
< x-frame-options: DENY
< x-amzn-errortype: InternalServerErrorException
< content-security-policy: default-src 'self'
< x-amz-apigw-id: fgcGFErhoAMEW6g=
< x-content-type-options: nosniff
< x-cache: Error from cloudfront
< via: 1.1 51ef2d5f52dad26e2bbdf93520deaaee.cloudfront.net (CloudFront)
< x-amz-cf-pop: SFO53-P3
< x-amz-cf-id: 4_pIJq9jIYDdksEjZmGCygt6G9t9_1x0ei3iSPsOqGmy-0XlB-ckiA==
<
* Connection #0 to host service.daniel.dev.singlecell.gi.ucsc.edu left intact
{"message": "504 Gateway Timeout. Wait the number of seconds given in the `Retry-After` header before retrying the request."}

Copy link
Member

@achave11-ucsc achave11-ucsc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! ✅

A few observations.

Commit, [a] Add a retry-after header to 504 responses (#6284), informs of a API change but no associated API version bump is visible.

You should reconsider your approach in commits, 'Remove metric alarm threshold and period default values' & 'Set indexer and service metric alarm threshold to one per day (#6284)'. It seem to me, like unnecessary changes (specifically, removing the defaults) to latter introduce a constriction on a generic helper handler. Which I think it's better being specified at the call site. Think about going through the routes in the app.py file, and trying to determine what the alarm rate, period or threshold may be for a given Lambda, you're current approach may take some clicking around to determine what these values may be. However, I'm not 100% sure of the intent here so perhaps I'm missing something.

Finally, you're drop! commit is thoughtful, but do consider adding test coverage, perhaps a small unit test, mock it to timeout right away and return the desired status code.

Comment on lines 819 to 849
**{
f'DEFAULT_{response_type}': {
'responseParameters': {
# Static value response header parameters must be enclosed
# within a pair of single quotes.
#
# https://docs.aws.amazon.com/apigateway/latest/developerguide/request-response-data-mappings.html#mapping-response-parameters
# https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-swagger-extensions-gateway-responses.html
#
# Note that azul.strings.single_quote() is not used here
# since API Gateway allows internal single quotes in the
# value, which that function would prohibit.
#
f'gatewayresponse.header.{k}': f"'{v}'"
for k, v in AzulChaliceApp.security_headers.items()
}
} for response_type in ['4XX', '5XX']
},
**{
response_type: {
'responseParameters': {
**{
f'gatewayresponse.header.{k}': f"'{v}'"
for k, v in AzulChaliceApp.security_headers.items()
},
'gatewayresponse.header.Retry-After': "'10'"
},
'responseTemplates': {
"application/json": json.dumps({
'message': '504 Gateway Timeout. Wait the number of'
' seconds given in the `Retry-After`'
' header before retrying the request.'
})
}
} for response_type in ['INTEGRATION_TIMEOUT', 'INTEGRATION_FAILURE']
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider employing the changes in the following patch, make the diff smaller.

Index: src/azul/terraform.py
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/azul/terraform.py b/src/azul/terraform.py
--- a/src/azul/terraform.py	(revision 1f2033d1536b4b60bc2d1632e597fb3c74aa2675)
+++ b/src/azul/terraform.py	(date 1728947754159)
@@ -816,7 +816,7 @@
         openapi_spec[key] = config.minimum_compression_size
         assert 'aws_api_gateway_gateway_response' not in resources, resources
         openapi_spec['x-amazon-apigateway-gateway-responses'] = {
-            **{
+            {
                 f'DEFAULT_{response_type}': {
                     'responseParameters': {
                         # Static value response header parameters must be enclosed
@@ -833,8 +833,7 @@
                         for k, v in AzulChaliceApp.security_headers.items()
                     }
                 } for response_type in ['4XX', '5XX']
-            },
-            **{
+            } | {
                 response_type: {
                     'responseParameters': {
                         **{

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PyCharm wants to format

some_variable = {
    'foo': 'FOO'
} | {
   'bar': 'BAR'
}

as

some_variable = {
                    'foo': 'FOO'
                } | {
                    'bar': 'BAR'
                }

so I wrapped it in parens to get

some_variable = (
    {
        'foo': 'FOO'
    } | {
        'bar': 'BAR'
    }
)

@achave11-ucsc achave11-ucsc removed their assignment Oct 14, 2024
@dsotirho-ucsc dsotirho-ucsc force-pushed the issues/dsotirho-ucsc/6284-lambda-timeouts branch 2 times, most recently from 1cebdf5 to ebee142 Compare October 16, 2024 23:17
@dsotirho-ucsc
Copy link
Contributor Author

You should reconsider your approach in commits, 'Remove metric alarm threshold and period default values' & 'Set indexer and service metric alarm threshold to one per day (#6284)'. It seem to me, like unnecessary changes (specifically, removing the defaults) to latter introduce a constriction on a generic helper handler. Which I think it's better being specified at the call site. Think about going through the routes in the app.py file, and trying to determine what the alarm rate, period or threshold may be for a given Lambda, you're current approach may take some clicking around to determine what these values may be. However, I'm not 100% sure of the intent here so perhaps I'm missing something.

These changes (removing the defaults & increasing the period to one day) were requested in the ticket.

Finally, you're drop! commit is thoughtful, but do consider adding test coverage, perhaps a small unit test, mock it to timeout right away and return the desired status code.

Such a test wouldn't be able to verify the response headers, since they come from the API Gateway.

@dsotirho-ucsc dsotirho-ucsc force-pushed the issues/dsotirho-ucsc/6284-lambda-timeouts branch from ebee142 to 56a65c9 Compare October 16, 2024 23:56
@dsotirho-ucsc
Copy link
Contributor Author

6624_IT_2024-10-16.txt

achave11-ucsc
achave11-ucsc previously approved these changes Oct 17, 2024
Copy link
Member

@achave11-ucsc achave11-ucsc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes (removing the defaults & increasing the period to one day) were #6284 (comment).

I see, I originally missed that, apologies.

Approved ✅

@achave11-ucsc achave11-ucsc marked this pull request as ready for review October 17, 2024 17:53
Copy link
Member

@hannes-ucsc hannes-ucsc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Next time, no fixups, please.

'responseTemplates': {
"application/json": json.dumps({
'message': '504 Gateway Timeout. Wait the number of'
' seconds given in the `Retry-After`'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
' seconds given in the `Retry-After`'
' seconds specified in the `Retry-After`'

return {
'504': {
'description': 'Request timed out. When handling this response,'
' clients should wait the number of seconds given in'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
' clients should wait the number of seconds given in'
' clients should wait the number of seconds specified in'

{
f'DEFAULT_{response_type}': {
'responseParameters': {
# Static value response header parameters must be enclosed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment should be moved up so it is evident that it applies to both dictionaries.

@@ -41,3 +41,14 @@ def header(type_: TYPE, **kwargs: PrimitiveJSON) -> JSON:
'schema': schema.make_type(type_),
**kwargs
}


def http_504_response() -> JSON:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong place. There is already precedent for shared specs.

@hannes-ucsc hannes-ucsc added 0 reviews [process] Lead didn't request any changes 1 review [process] Lead requested changes once and removed 0 reviews [process] Lead didn't request any changes labels Oct 18, 2024
return {
'504': {
'description': 'Request timed out. When handling this response,'
' clients should wait the number of seconds'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we typically place the space at the beginning of the continuation. If you disagree, please provide evidence.

'responseTemplates': {
"application/json": json.dumps({
'message': '504 Gateway Timeout. Wait the number of'
' seconds specified in the `Retry-After`'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

@hannes-ucsc hannes-ucsc removed their assignment Nov 7, 2024
@dsotirho-ucsc dsotirho-ucsc force-pushed the issues/dsotirho-ucsc/6284-lambda-timeouts branch from ae04ba7 to 4a4f6ba Compare November 7, 2024 22:25
@dsotirho-ucsc
Copy link
Contributor Author

6624_IT_2024-11-07.txt

@@ -232,3 +232,15 @@ def version(self) -> JSON:
}
}
}

@classmethod
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that this is a class method while the other methods are not, is a smell. The method serves the same purpose as the other methods but the usage pattern is different, and that is the smell.

@@ -267,6 +267,10 @@ def route(self,
methods = kwargs['methods']
self.non_interactive_routes.update((path, method) for method in methods)
methods = kwargs.get('methods', ())
if method_spec:
import azul.openapi.spec
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also smells.

Please dissolve CommonEndpointSpecs into AzulChaliceApp. That should happen as the first commit in this PR. The methods from CommonEndpointSpecs should be appended at the end of AzulChaliceApp.

Copy link
Member

@hannes-ucsc hannes-ucsc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please ignore the previous review.

@hannes-ucsc hannes-ucsc force-pushed the issues/dsotirho-ucsc/6284-lambda-timeouts branch 2 times, most recently from 71002ad to 4a1ae7e Compare November 14, 2024 17:00
@hannes-ucsc hannes-ucsc removed their assignment Nov 14, 2024
Copy link
Member

@hannes-ucsc hannes-ucsc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did the refactoring mentioned in my previous commit review. It passes unit and integration tests. Please rebase your changes on top of mine. I have a backup of your branch if you need it.

@dsotirho-ucsc dsotirho-ucsc force-pushed the issues/dsotirho-ucsc/6284-lambda-timeouts branch from 965a968 to 8c128b0 Compare November 15, 2024 18:37
@dsotirho-ucsc
Copy link
Contributor Author

6624_IT_2024-11-15.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 reviews [process] Lead requested changes twice API API change affecting callers orange [process] Done by the Azul team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants