-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updates Parse Runner Structure #70
Conversation
func (h *handler) parsePortfolio(ctx context.Context, taskID task.ID, req *task.ParsePortfolioRequest) error { | ||
// Load the portfolio from blob storage, place it in /mnt/raw_portfolios, where | ||
// the `process_portfolios.R` script expects it to be. | ||
for _, assetID := range req.AssetIDs { | ||
srcURI := blob.Join(h.blob.Scheme(), h.sourcePortfolioContainer, assetID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're not using the sourcePortfolioContainer
any more, can you remove the flag and associated config + terraform?
But also, I like the previous structure better because it means the runner is only ever pulling from a single bucket, whereas moving it to an opaque blob means the caller can set it to whatever, which introduces a (small) new attack vector. What's the motivation for the opaque BlobURI
s?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could pursue the approach you describe here, but it'd require us fully reworking the blob storage architechure, and doing so in a way that isn't clearly better.
Why?
In order to know how to access or delete a blob we need to know the full URI, including the scheme, the container, and the blob ID. Since blobs are present in more context than just the source portfolio, we couldn't infer or extract that information into the code (without just duplicating the logic of what a blob is and how it is handled). Thus, we either can represent that in the DB as a full string in the DB, or chunk it into parts (perhaps represented by enum values for the portfolio container, ex), so that as you suggest here we would be more certain that we're always pulling from the same container.
That doesn't provide additional security, however, it just adds structure. That structure would make migrations harder, encourage introspection/more manual munging, and be more error prone in situations where they are read, deleted or retrieved.
AssetIDs: req.AssetIDs, | ||
Outputs: out, | ||
TaskID: taskID, | ||
Request: req, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why duplicate the whole request in the response? Wouldn't it be better to just have a standard API interface between srv + runner like anything else, where we input the stuff we need and output the stuff we need?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Round-tripped data in an async request paradigm has huge benefits:
- In this case, it removes the requirement we write down details of the task in our database, since the info is available on the task completing. This removes the need for (a) a DB write and (b) error handling in case the DB state mismatches with the async task state (as it could do in the case of a transaction failure, ex).
- Doesn't require the async processing to know what is going to be important on the return trip (this turned out to be hugely important for a variety of things we built into async CM processing)
- No code changes needed to round trip more data (though you still need a binary deploy)
- With payloads as small as they are right now, there isn't a real cost (if that changes in the future I agree we can reevaluate)
In short this is the easier to maintain, more future proof way of doing it. As you say, premature optimization is at the heart of many issues.
No description provided.