-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch S3 writes to use streaming I/O #51
Conversation
Item 4: We also need to decide when to fail out an in-progress operation. We currently pause the handle when we're done with the current buffer -- but it's not clear what happens when the client starts some writes and then never returns. For example, I think the file is automatically closed by XRootD but in our destructor, we never cancel or fail the workflow. Needs investigation. |
372cbba
to
87f5a7c
Compare
Ok --- all 4 items outlined are complete!
|
src/CurlUtil.cc
Outdated
@@ -223,7 +236,7 @@ void CurlWorker::Run() { | |||
op->Fail(op->getErrorCode(), op->getErrorMessage()); | |||
} | |||
} catch (...) { | |||
m_logger.Log(LogMask::Debug, "CurlWorker", | |||
m_logger.Log(LogMask::Debug, "Run", | |||
"Unable to setup the curl handle"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"setup" is usually used as a noun, I think we want "set up" here.
|
||
// Return whether or not the request has timed out since the last | ||
// call to send more data. | ||
bool Timeout() const { return m_timeout; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of your other setters/getters include a verb and that makes them a little easier to read. Maybe this can be GetTimeout()
?
This provides some minimal test cases for testing the S3 write code.
Most significantly, switches the debug/dump of the libcurl interaction to use the XRootD logging framework instead of printing right to stderr.
This refactors the request logic to allow requests to be continued over multiple calls. The calling thread will regain control when the buffer has been completely consumed (even if the full operation will require additional buffers). Note this only works if the client provides the full file size.
We want to always unpause a given operation from the same thread that is handling it. If a separate thread can pick up the operation, there is a race condition where both the original one and new one are operating on the same `CURL *` handle at the same time, resulting in an observed segfault. This commit introduces a separate "unpause" queue for each curl worker; this queue is notified by the parent when there is additional data available.
If an entire part is a single libcurl operation, a client that starts writing - but then leaves for an extended period of time - will leave dangling references inside libcurl (eventually exhausting the number of allowable transfers). This commit adds a background thread for S3File that will go through all pending uploads and check to see if they are still live; if not, then it'll timeout the operation.
After the notification is done, the request may be deleted by the owning S3File instance. Do not call `Notify` from within the curl result processing function as the request object needs to be alive to release the curl handle.
If the client doesn't pre-declare the size of the object it will write, we don't know the size of the last part of the upload. Hence, we must switch back to buffer mode in this case.
If the entire object is uploaded during a single `Write` call, then skip the multipart upload and just do a single non-buffered upload.
The unit test refactoring left copy/pasted code. This commit splits the common piece into a single header, allowign `s3_tests.cc` with the unit tests that utilize AWS S3 while `s3_unit_tests.cc` use the minio instance started up by ctest.
7e160ee
to
4ca3895
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
We currently buffer the data from the client until enough is in memory to form a full "part" that we send to S3 in one call.
As long as we know the full size of the object, we can simply stream the data as it comes in from the client - no need to hold it in the buffer. This PR starts the process to transitioning to streaming mode.
Note that if the client doesn't provide the full file size, then we don't know the size of the part we are uploading (will it be a full 100MB? Will it be shorter?); this is a no-no from the AWS side.
Hence, a final solution, unfortunately, is going to likely require both approaches. We can, however, put in a requirement like "if you don't provide the final output data size, we will buffer at most 10MB and you'll have a corresponding object size limit".
TODO items to finish off this branch: