Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using Generated Presigned URLs with CRC32C checksums results in 400 from S3 #3216

Closed
richardnpaul opened this issue Jul 3, 2024 · 10 comments
Assignees
Labels
bug This issue is a confirmed bug. p2 This is a standard priority issue response-requested Waiting on additional info and feedback. s3

Comments

@richardnpaul
Copy link

richardnpaul commented Jul 3, 2024

Describe the bug

When trying to upload a large object to S3 using the multipart upload process with presigned urls with crc32c checksums the response from S3 is a 400 error with an error message.

Expected Behavior

I would expect that the provided checksum headers would be expected and so the type would be the checksum type not a type of null which would then mean that the upload to S3 would succeed.

Current Behavior

The following type of error message is returned instead of success:

Failed to upload part, status: 400, response: <?xml version="1.0" encoding="UTF-8"?>
<Error><Code>InvalidRequest</Code><Message>Checksum Type mismatch occurred, expected checksum Type: null, actual checksum Type: crc32c</Message><RequestId>SOMEREQID</RequestId><HostId>SOME/HOSTID</HostId></Error>

Reproduction Steps

Change all the AWS credentials for valid values for your testing and provide a file on the testfile assignment line (I was using a path in ~/Downloads/)

#!/usr/bin/env python3
import base64
import pathlib
from zlib import crc32

import boto3
import requests


# AWS credentials
access_key_id = 'access_key_here'
secret_access_key = 'secret_key_here'
aws_session_token = 'session_token_here'
region = 'region_here'
bucket_name = 'bucket_name_here'
object_key = 'prefix_here/object_key_here'

# Create a session using your AWS credentials
session = boto3.Session(
    aws_access_key_id=access_key_id,
    aws_secret_access_key=secret_access_key,
    aws_session_token=aws_session_token,
)

# Create an S3 client with the specified region
s3_client = session.client('s3', region_name=region)

# Initialize a multipart upload
response = s3_client.create_multipart_upload(
    Bucket=bucket_name,
    Key=object_key
)
upload_id = response['UploadId']

part_number = 1
chunk_size = 10 * 1024 * 1024  # 10 MB

testfile = pathlib.Path('file 10MB or greater in size here').expanduser()

with open(testfile, 'rb') as f:
    content = f.read(chunk_size)

# Calculate ChecksumCRC32C   (I'm not 100% certain about this as we use the crc32c package normally)
checksum_crc32c = base64.b64encode(crc32(content).to_bytes(4, byteorder='big')).decode('utf-8')

# Generate the presigned URL
presigned_url = s3_client.generate_presigned_url(
    'upload_part',
    Params={
        'Bucket': bucket_name,
        'Key': object_key,
        'PartNumber': part_number,
        'UploadId': upload_id,
        'ChecksumCRC32C': checksum_crc32c,
        'ChecksumAlgorithm': 'CRC32C',  # Added this after posting after feedback from Tim
    },
    ExpiresIn=3600
)

headers = {
    'Content-Length': str(len(content)),
    'x-amz-checksum-crc32c': checksum_crc32c,
    'Content-Type': 'application/octet-stream',
}

response = requests.put(presigned_url, data=content, headers=headers)

if response.status_code == 200:
    print("Part uploaded successfully!")
else:
    print(f"Failed to upload part, status: {response.status_code}, response: {response.text}")

Possible Solution

I feel like the checksum header is not being passed to be included in the signing process but to be honest I got a bit lost in the library's code and couldn't make head nor tail of it in the end.

Additional Information/Context

Docs page for generating the urls:
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3/client/generate_presigned_url.html
Docs page with acceptable params to be passed to generate_presigned_url when using upload_part as the ClientMethod:
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3/client/upload_part.html

SDK version used

1.34.138

Environment details (OS name and version, etc.)

Ubuntu 22.04.4, Python 3.10.12

@richardnpaul richardnpaul added bug This issue is a confirmed bug. needs-triage This issue or PR still needs to be triaged. labels Jul 3, 2024
@tim-finnigan tim-finnigan self-assigned this Jul 8, 2024
@tim-finnigan
Copy link
Contributor

Thanks for reaching out. In your upload_part request have you tried setting the ChecksumAlgorithm to CRC32C and specifying a string for ChecksumCRC32C? You could also try another approach like using put_object, although it was noted that installing the CRT was required. Otherwise if you want to share your debug logs (with any sensitive info redacted) by adding boto3.set_stream_logger('') to your script then we could investigate this further.

@tim-finnigan tim-finnigan added response-requested Waiting on additional info and feedback. s3 p2 This is a standard priority issue and removed needs-triage This issue or PR still needs to be triaged. labels Jul 8, 2024
@richardnpaul
Copy link
Author

richardnpaul commented Jul 8, 2024

Hi @tim-finnigan, thanks for getting back in touch so quickly.

We did try the ChecksumAlgorithm set to CRC32C approach, which required then setting the x-amz-sdk-checksum-crc32c header I believe, but we were getting an error with using this method too (I'll need to check the docs again, but we were reading these where we were following the points for the REST API rather than the SDK, and I'll need to check with the person that was testing this with me tomorrow.) The code for this is abstracted out behind a set of APIs and a calling CLI (not Python based) installed by our end users.

Thus our workflow is this, CLI calls Initiate endpoint to initiate an upload. On success the CLI can then call a generate pre-signed URLs endpoint which should take the parts and the checksums and return the part numbers with the pre-signed URLs for those parts (and this is the call which is using generate_presigned_url with the upload_part client method.) At this point the CLI uses the pre-signed URLs to PUT the file parts directly to S3 with the CRC32C checksum in the header and once that's complete it can call a complete endpoint submitting the parts, ETags and CRC32C checksums.

So with the description above out of the way, put_object is not suitable for our workflow because the end users are using the CLI package; which is also the why of the need to use pre-signed URLs.. Sorry for the confusion that might have led you to suggest this as the above code was just a minimal amount of boiler plate code to duplicate the issue that we were seeing.

I will note that we do already have awscrt as part of our dependency chain.

We have run this through successfully by removing the need for the checksums and it all works, so worst case we could fall back to the historic way of doing this using ContentMD5 but we were hoping to use the same approach that we're using for smaller unitary uploads which uses presigned_POST which we do have working with the CRC32C checksums; I'm well aware that we seem to be on the outer fringes of what we're trying to achieve here with boto so all help is greatly appreciated.

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. label Jul 9, 2024
@richardnpaul
Copy link
Author

I've done some testing today and here's a table of what I get back from the put to s3. So I've tested every combination of the ChecksumAlgorithm and ChecksumCRC32C on the upload_part side and x-amz-checksum-crc32c and x-amz-sdk-checksum-algorithm on the PUT headers side of things (we didn't get any different results with passing content-type and/or content-length as well as these):

▼headers/params► Nothing ChecksumCRC32C Only ChecksumAlgorithm Only Both
Nothing 200 403: SignatureDoesNotMatch*1 403: SignatureDoesNotMatch*1 403: SignatureDoesNotMatch*1
x-amz-checksum-crc32c 403: AccessDenied*2 400: InvalidRequest*3 403: AccessDenied*2 403: SignatureDoesNotMatch*1
x-amz-checksum-algorithm 403: AccessDenied*2 403: AccessDenied*2 400: InvalidRequest*4 403: SignatureDoesNotMatch*1
Both 403: AccessDenied*2 403: AccessDenied*2 403: AccessDenied*2 400: InvalidRequest*3

*1: The request signature we calculated does not match the signature you provided. Check your key and signing method.
*2: There were headers present in the request which were not signed
*3: Checksum Type mismatch occurred, expected checksum Type: null, actual checksum Type: crc32c
*4: x-amz-sdk-checksum-algorithm specified, but no corresponding x-amz-checksum-* or x-amz-trailer headers were found.

@tim-finnigan
Copy link
Contributor

Hi @richardnpaul, thanks for following up here. Going back to your original snippet, you are using CRC32 and not CRC32C (from zlib import crc32). It looks like there are not plans to support CRC32C in zlib: madler/zlib#981. Have you tried any alternatives that support CRC32C?

For using CRC32 I tested this and it works for me:
import boto3
import requests
from zlib import crc32
import base64
import pathlib

bucket_name = 'test-bucket'
object_key = 'test'

s3_client = boto3.client('s3')

response = s3_client.create_multipart_upload(
    Bucket=bucket_name,
    Key=object_key
)
upload_id = response['UploadId']

part_number = 1
chunk_size = 10 * 1024 * 1024  # 10 MB

testfile = pathlib.Path('./11-mb-file.txt').expanduser()

parts = []

with open(testfile, 'rb') as f:
    while True:
        content = f.read(chunk_size)
        if not content:
            break

        checksum_crc32 = base64.b64encode(crc32(content).to_bytes(4, byteorder='big')).decode('utf-8')

        presigned_url = s3_client.generate_presigned_url(
            'upload_part',
            Params={
                'Bucket': bucket_name,
                'Key': object_key,
                'PartNumber': part_number,
                'UploadId': upload_id,
                'ChecksumCRC32': checksum_crc32,
                'ChecksumAlgorithm': 'CRC32',
            },
            ExpiresIn=3600
        )

        response = requests.put(presigned_url, data=content)

        if response.status_code == 200:
            print(f"Part {part_number} uploaded successfully!")
            parts.append({
                'PartNumber': part_number,
                'ETag': response.headers['ETag']
            })
        else:
            print(f"Failed to upload part {part_number}, status: {response.status_code}, response: {response.text}")
            break

        part_number += 1

if len(parts) == part_number - 1:
    s3_client.complete_multipart_upload(
        Bucket=bucket_name,
        Key=object_key,
        UploadId=upload_id,
        MultipartUpload={
            'Parts': parts
        }
    )
    print("Multipart upload completed successfully!")
else:
    s3_client.abort_multipart_upload(
        Bucket=bucket_name,
        Key=object_key,
        UploadId=upload_id
    )
    print("Multipart upload failed and has been aborted.")

@tim-finnigan tim-finnigan added the response-requested Waiting on additional info and feedback. label Jul 9, 2024
@richardnpaul
Copy link
Author

richardnpaul commented Jul 10, 2024

Hi Tim,

Okay, so yes, as noted in my initial notes yes, we use the crc32c package, but we're just trying to test that the checksums work so it doesn't matter which one we use apart from it should be valid.

I've taken your code and made a couple of changes, I've added aws_access_key_id etc. to the s3_client instantiation, I changed the bucket name, object key and the testfile variables and otherwise I didn't change anything else......and I got an error Failed to upload part 1, status: 403, response: <?xml version="1.0" encoding="UTF-8"?> which was because I got a SignatureDoesNotMatch ... The request signature we calculated does not match the signature you provided. Check your key and signing method. response.

I had the bucket deployed in eu-west-2 so I tried to create a bucket in another region, eu-west-1 to see if the issue persisted. After thinking that it did persisit, and working through some issues, I changed all the region references in my .aws/config file to eu-west-1 as they were set to eu-west-2 and we have success...but not in the region that I'm trying to use 😞
(I realised shortly after that I could have just added region_name = "eu-west-1 to the s3 client so that I didn't have to change my config file 🤦)

So, at this point I'm not sure if this is a botocore/boto3 issue or an AWS infrastructure issue 🤔 (...or something else)

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. label Jul 11, 2024
@richardnpaul
Copy link
Author

Just some additional information, adding explicit v4 signature_version via botocore.config results in the same error in both eu-west-1 and eu-west-2:

from botocore.config import Config

my_config = Config(signature_version = 'v4')

s3_client = boto3.client('s3', config=my_config)

@tim-finnigan
Copy link
Contributor

Thanks for following up and for your patience here. The SignatureDoesNotMatch error could occur for a variety of reasons. Having the incorrect region configured is often the cause, but it could also occur due to issues with your clock skew, credentials, headers - there are troubleshooting guides here and here that provide more context. Have you looked into those, or do you have any other updates on your end?

@tim-finnigan tim-finnigan added the response-requested Waiting on additional info and feedback. label Aug 2, 2024
@richardnpaul
Copy link
Author

Does the script work for you if you use signature_version='v4'?

From the first link:

  • The only thing that really jumps out immediately is that I'm using SSO, and the SSO region is eu-west-1, but I'm setting the region via the config object.
  • The other thing was the time, I tried setting my clock to UTC timezone, but I got the same result. My clock is managed by NTP and doesn't really skew far from the upstream timesource and I checked with time.is and there was no noticeable difference

The second link seems to be for those people not using the SDK, we're using botocore/boto3 here. I using an administrator role that works for generating the url and uploading to the url so long as checksums are not used in the generation.

The output for the signature that I see is like this:

PUT
/test-upload
X-Amz-Algorithm=AWS4-HMAC-SHA256&amp;X-Amz-Credential=ASIA....%2F20240805%2Feu-west-2%2Fs3%2Faws4_request&amp;X-Amz-Date=20240805T101928Z&amp;X-Amz-Expires=3600&amp;X-Amz-Security-Token=<...security_token...>&amp;X-Amz-SignedHeaders=host%3Bx-amz-checksum-crc32c%3Bx-amz-sdk-checksum-algorithm&amp;partNumber=1&amp;uploadId=<...upload_id..,>
host:richard-presigned-urls-test-eu-west-2.s3.amazonaws.com
x-amz-checksum-crc32c:
x-amz-sdk-checksum-algorithm:

host;x-amz-checksum-crc32c;x-amz-sdk-checksum-algorithm

What I note from this is that the checksum in the signature is blank, as is the algorithm.


Current Script
#!/usr/bin/env python3
import boto3
import requests
from crc32c import crc32c
import base64
import pathlib
from botocore.config import Config


# boto3.set_stream_logger('')

REGION = 'eu-west-2'

session = boto3.Session(region_name=REGION)

my_config = Config(
    region_name=REGION,
    retries={
        'max_attempts': 10,
        'mode': 'standard'
    },
    signature_version='v4',
)

bucket_name = f'presigned-urls-test-{REGION}'
object_key = 'test-upload'

s3_client = session.client(
    's3',
    config=my_config,
)

response = s3_client.create_multipart_upload(
    Bucket=bucket_name,
    Key=object_key
)
upload_id = response['UploadId']

part_number = 1
chunk_size = 10 * 1024 * 1024  # 10 MB

testfile = pathlib.Path('~/Downloads/bbb_sunflower_2160p_30fps_stereo_abl.mp4').expanduser()

parts = []

with open(testfile, 'rb') as f:
    while True:
        content = f.read(chunk_size)
        if not content:
            break

        checksum_crc32c = base64.b64encode(crc32c(content).to_bytes(4, byteorder='big')).decode('utf-8')

        presigned_url = s3_client.generate_presigned_url(
            'upload_part',
            Params={
                'Bucket': bucket_name,
                'ChecksumAlgorithm': 'CRC32C',
                'ChecksumCRC32C': checksum_crc32c,
                'Key': object_key,
                'PartNumber': part_number,
                'UploadId': upload_id,
            },
            ExpiresIn=3600
        )

        response = requests.put(
            presigned_url,
            data=content,
            # headers={
            #     'x-amz-checksum-crc32c': checksum_crc32c,
            #     'x-amz-sdk-checksum-algorithm': 'CRC32C'
            # }
        )

        if response.status_code == 200:
            print(f"Part {part_number} uploaded successfully!")
            parts.append({
                'PartNumber': part_number,
                'ETag': response.headers['ETag']
            })
        else:
            print(f"Failed to upload part {part_number}, status: {response.status_code}, response: {response.text}")
            break

        part_number += 1

if len(parts) == part_number - 1:
    s3_client.complete_multipart_upload(
        Bucket=bucket_name,
        Key=object_key,
        UploadId=upload_id,
        MultipartUpload={
            'Parts': parts
        }
    )
    print("Multipart upload completed successfully!")
else:
    s3_client.abort_multipart_upload(
        Bucket=bucket_name,
        Key=object_key,
        UploadId=upload_id
    )
    print("Multipart upload failed and has been aborted.")

@richardnpaul
Copy link
Author

richardnpaul commented Aug 5, 2024

Okay, I got it sorted out. I was pretty sure that it came down to the missing headers on the PUT request but adding those errors brought me back to the original Checksum Type mismatch occurred, expected checksum Type: null, actual checksum Type: crc32c message. This was correct and now I know why.

The initial create_multipart_upload needs to be passed with ChecksumAlgorithm='CRC32C'. Then you can pass the ChecksumAlgorithm and ChecksumCRC32C in the Params for the generate_presigned_url and in the headers for the requests.put. Finally, keeping track and adding the checksum to each part allows the complete_multipart_upload to complete successfully.


The final script
#!/usr/bin/env python3
import base64
import pathlib

import boto3
import requests
from crc32c import crc32c
from botocore.config import Config


# boto3.set_stream_logger('')

testfile = pathlib.Path('~/Downloads/bbb_sunflower_2160p_30fps_stereo_abl.mp4').expanduser()

REGION = 'eu-west-2'

session = boto3.Session(region_name=REGION)

my_config = Config(
    region_name=REGION,
    retries={
        'max_attempts': 10,
        'mode': 'standard'
    },
    signature_version='v4',
)

bucket_name = f'presigned-urls-test-{REGION}'
object_key = 'test-upload'

s3_client = session.client(
    's3',
    config=my_config,
)

# def resolve_endpoint_ruleset(method):
#     def wrapper(operation_model, params, context, ignore_signing_region=False):
#         (endpoint_url, additional_headers, properties) = method(
#             operation_model, params, context, ignore_signing_region
#         )  # Call the original method

#         if "ContentType" not in params:
#             additional_headers = {
#                 "Content-Type": "binary/octet-stream",
#                 **additional_headers,
#             }

#         return (endpoint_url, additional_headers, properties)

#     return wrapper


# s3_client._resolve_endpoint_ruleset = resolve_endpoint_ruleset(
#     s3_client._resolve_endpoint_ruleset
# )

upload_id_request = s3_client.create_multipart_upload(
    Bucket=bucket_name,
    Key=object_key,
    ChecksumAlgorithm='CRC32C',
)
upload_id = upload_id_request['UploadId']

part_number = 1
chunk_size = 10 * 1024 * 1024  # 10 MB

parts = []

with open(testfile, 'rb') as f:
    while True:
        content = f.read(chunk_size)
        if not content:
            break

        checksum_crc32c = base64.b64encode(crc32c(content).to_bytes(4, byteorder='big')).decode('utf-8')

        presigned_url = s3_client.generate_presigned_url(
            ClientMethod='upload_part',
            Params={
                'Bucket': bucket_name,
                # 'ContentLength': len(content),
                'ChecksumAlgorithm': 'CRC32C',
                'ChecksumCRC32C': checksum_crc32c,
                'Key': object_key,
                'PartNumber': part_number,
                'UploadId': upload_id,
            },
            ExpiresIn=3600
        )

        response = requests.put(
            presigned_url,
            data=content,
            headers={
                # 'content-type': 'binary/octet-stream',
                'x-amz-sdk-checksum-algorithm': 'CRC32C',
                'x-amz-checksum-crc32c': checksum_crc32c,
            }
        )

        if response.status_code == 200:
            print(f"Part {part_number} uploaded successfully!")
            parts.append({
                'PartNumber': part_number,
                'ETag': response.headers['ETag'],
                'ChecksumCRC32C': checksum_crc32c,
            })
        else:
            print(f"Failed to upload part {part_number}, status: {response.status_code}, response: {response.text}")
            break

        part_number += 1

if len(parts) == part_number - 1:
    s3_client.complete_multipart_upload(
        Bucket=bucket_name,
        Key=object_key,
        UploadId=upload_id,
        MultipartUpload={
            'Parts': parts
        },
    )
    print("Multipart upload completed successfully!")
else:
    s3_client.abort_multipart_upload(
        Bucket=bucket_name,
        Key=object_key,
        UploadId=upload_id
    )
    print("Multipart upload failed and has been aborted.")

@richardnpaul richardnpaul closed this as not planned Won't fix, can't repro, duplicate, stale Aug 5, 2024
Copy link

github-actions bot commented Aug 5, 2024

This issue is now closed. Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a confirmed bug. p2 This is a standard priority issue response-requested Waiting on additional info and feedback. s3
Projects
None yet
Development

No branches or pull requests

2 participants