-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add archive extraction support for http(s) #24
base: main
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: pvaneck The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
These helper functions can be leveraged for future archive extraction support for other providers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this @pvaneck.
Couple of additional comments, probably not for this PR though:
- As you mentioned it would be good to have this be storage-type agnostic. I know it's most useful for http, and fine to just add it for that first. I guess performance-wise its good to have some provider-specific logic so we can unzip/tar a stream rather than writing to disk first.
- I wonder if blind extraction could cause a problem for runtimes that might expect an archive themselves. I think some also autodetect and accept either so it wouldn't be a problem for those. But per some of the prior internal discussion it would be nice to add an parameter on the
ServingRuntime
with a list of archive types natively supported, so that the extraction would be bypassed if appropriate/necessary.
pullman/helpers.go
Outdated
destFilePath := filepath.Join(dest, zipFile.Name) | ||
|
||
// Zip slip vulnerability check | ||
if !strings.HasPrefix(destFilePath, filepath.Clean(dest)+string(os.PathSeparator)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could move filepath.Clean(dest)+string(os.PathSeparator)
before the loop
return fmt.Errorf("error writing resource to local file '%s': %w", filename, err) | ||
contentType := resp.Header.Get("Content-type") | ||
if strings.Contains(contentType, "application/x-tar") || strings.Contains(contentType, "application/x-gtar") || | ||
strings.Contains(contentType, "application/x-gzip") || strings.Contains(contentType, "application/gzip") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also application/tar
and application/tar+gzip
(though I guess Contains
would catch the latter).
Also wondering whether we should detect based on file extensions in the URL even if Content-Type
isn't specified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can add file extension checks as fall backs.
pullman/helpers.go
Outdated
return nil | ||
} | ||
|
||
// Extract a tar/tgz/tar.gz archive file into the provided destination directory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't work currently for a non-gzipped tar, and we should probably also support gzipped non-tar (single file). I think we need some additional differentiation logic e.g. some of the content types should be unambiguous, just application/gzip
might need to unzip and then inspect (or peek beginning of stream maybe).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on my testing, this does actually work with non-gzipped tars even with the gzip reader included. Nevermind, I was mistaken here (my tar file was actually gzipped). Will adjust the logic here.
For gzip only support, as you say, the content-type application/gzip
can be used for both .gz
and .tar.gz
. Would checking for a lone .gz
extension then sending it through the gzip reader suffice for this? I feel like gzip only files probably aren't nearly as common as the tar.gz use case.
0c8d91e
to
f0d00a5
Compare
Updated this PR to rely on magic bytes for checking the file format (as opposed to the Content-Type headers). Tested with actual .gz, .tar.gz, .tar, and .zip files. |
This enables archive extraction for the zip format and the gz/tar/tar.gz formats. If a user provides an http(s) link to one of these formats, pullman will now automatically extract the contents into the destination directory. Signed-off-by: Paul Van Eck <pvaneck@us.ibm.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @pvaneck.
Another thing I thought of - since we are changing the filename (and possibly changing to a directory) after unzipping / untarring, we may need to add a way for the new model file/dir location to be returned from the puller. This is because the subsequent adapter logic expects a local file/dir in the location specified by the original model path (which in this case could end in e.g. somefile.tar.gz
), and it won't find that.
Not sure if you had a chance to try this out with one of the built in model servers (e.g. Triton) but unless I'm mistaken I don't think it will work. Which may limit the usefulness of this feature.
} | ||
os.Remove(filename) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could make sense here to split into two - check for gz or zip first and if one of them unzip and update the filename and extension vars. Then separately check extension var for tar after (which will be either original or unzipped).
🤔 Thanks for pointing that out. I think you are right. Let me do some more thorough end to end tests and get back to you. |
/hold |
This enables archive extraction/decompression for the zip format and the gz/tar/tar.gz formats.
If a user provides an http(s) link to one of these formats, pullman will now automatically
extract the contents into the destination directory.
Unit tests were added to test the extraction functions where archives are generated during the test.
Closes: #16