What is the optimal part size? #2

stefansundin · 2021-11-14T02:13:00Z

stefansundin
Nov 14, 2021
Maintainer

You can now set the part size on your own using --part-size. This is useful if your upload speed is very good and the parts are uploaded very fast. In that case it would make sense to use a part size so that the total upload consists of fewer parts. AWS charges you based on the number of requests, for standard storage it is $0.005 per 1,000 requests (see pricing page), so it should be an insignificant amount for almost all use cases but now you have the ability to optimize this if you choose to.

When writing the code that determined the part size I was thinking about inflating the size by a factor. This might be useful if append more data to the file and the part count limit (10,000) is almost reached. This would prevent a problem with not being able to complete the upload because the final file produces a part count that is greater than 10,000. However, this would be tricky since you would need to abort the upload before it reaches the end of the initial file size. There are other complexities too so please don't try this if you don't know what you are doing.

In the end I decided to just emulate exactly what the aws cli does. I'm open to changing this if the best practice changes.

shrimp/main.go

Lines 185 to 196 in f8e9367

    
           // Detect best part size 
        
           // Double the part size until the file fits in 10,000 parts. 
        
           // The minimum part size is 5 MiB (except for the last part), although shrimp starts at 8 MiB (like the aws cli). 
        
           // The maximum part size is 5 GiB, which would in theory allow 50000 GiB (~48.8 TiB) in 10,000 parts. 
        
           // The aws cli follows a very similar algorithm: https://github.com/boto/s3transfer/blob/0.5.0/s3transfer/utils.py#L711-L763 
        
           // var partSize int64 = 8 * MiB 
        
           for 10000*partSize < fileSize { 
        
           	partSize *= 2 
        
           } 
        
           if partSize > 5*GiB { 
        
           	partSize = 5 * GiB 
        
           }

One thing that is interesting is that the CompleteMultipartUpload API call is very fast even for very large files. When I was first writing shrimp I imagined that this call would take some time for big files because I thought it would actually assemble the on S3 and put data neatly next to each other in a continuous manner, as if the object had been uploaded in a single request. But it appears that the data is not moved (at least not immediately, perhaps regular S3 operations does move the parts closer to each other over time). This makes sense since things like the ETag keeps the weird format when the upload is done, and there's a feature to download a subset of the data based on the part number.

Here's another interesting snippet from the documentation: https://docs.aws.amazon.com/whitepapers/latest/s3-optimizing-performance-best-practices/use-byte-range-fetches.html

Using the Range HTTP header in a GET Object request, you can fetch a byte-range from an object, transferring only the specified portion. You can use concurrent connections to Amazon S3 to fetch different byte ranges from within the same object. This helps you achieve higher aggregate throughput versus a single whole-object request. Fetching smaller ranges of a large object also allows your application to improve retry times when requests are interrupted. For more information, see Getting Objects.

Typical sizes for byte-range requests are 8 MB or 16 MB. If objects are PUT using a multipart upload, it’s a good practice to GET them in the same part sizes (or at least aligned to part boundaries) for best performance. GET requests can directly address individual parts; for example, GET ?partNumber=N.

Interesting part is "If objects are PUT using a multipart upload, it’s a good practice to GET them in the same part sizes (or at least aligned to part boundaries) for best performance."

Does this mean that a bigger part size is better for fast downloading? I imagine there's some overhead going from part to part on the S3 side, since they're very likely stored on different hard drives.

It also means that if shrimp ever learns how to download files and if it can do it with parallel operations, it would make sense for those operations to fetch individual parts as they were uploaded.

Discuss below if you have any useful insight. How do other S3 compatible platforms behave?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is the optimal part size? #2

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

What is the optimal part size? #2

stefansundin Nov 14, 2021 Maintainer

Replies: 0 comments

stefansundin
Nov 14, 2021
Maintainer