What is the optimal part size? #2
stefansundin
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
You can now set the part size on your own using
--part-size
. This is useful if your upload speed is very good and the parts are uploaded very fast. In that case it would make sense to use a part size so that the total upload consists of fewer parts. AWS charges you based on the number of requests, for standard storage it is $0.005 per 1,000 requests (see pricing page), so it should be an insignificant amount for almost all use cases but now you have the ability to optimize this if you choose to.When writing the code that determined the part size I was thinking about inflating the size by a factor. This might be useful if append more data to the file and the part count limit (10,000) is almost reached. This would prevent a problem with not being able to complete the upload because the final file produces a part count that is greater than 10,000. However, this would be tricky since you would need to abort the upload before it reaches the end of the initial file size. There are other complexities too so please don't try this if you don't know what you are doing.
In the end I decided to just emulate exactly what the aws cli does. I'm open to changing this if the best practice changes.
shrimp/main.go
Lines 185 to 196 in f8e9367
One thing that is interesting is that the
CompleteMultipartUpload
API call is very fast even for very large files. When I was first writing shrimp I imagined that this call would take some time for big files because I thought it would actually assemble the on S3 and put data neatly next to each other in a continuous manner, as if the object had been uploaded in a single request. But it appears that the data is not moved (at least not immediately, perhaps regular S3 operations does move the parts closer to each other over time). This makes sense since things like the ETag keeps the weird format when the upload is done, and there's a feature to download a subset of the data based on the part number.Here's another interesting snippet from the documentation: https://docs.aws.amazon.com/whitepapers/latest/s3-optimizing-performance-best-practices/use-byte-range-fetches.html
Interesting part is "If objects are PUT using a multipart upload, it’s a good practice to GET them in the same part sizes (or at least aligned to part boundaries) for best performance."
Does this mean that a bigger part size is better for fast downloading? I imagine there's some overhead going from part to part on the S3 side, since they're very likely stored on different hard drives.
It also means that if shrimp ever learns how to download files and if it can do it with parallel operations, it would make sense for those operations to fetch individual parts as they were uploaded.
Discuss below if you have any useful insight. How do other S3 compatible platforms behave?
Beta Was this translation helpful? Give feedback.
All reactions