Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates Parse Runner Structure #70

Merged
merged 3 commits into from
Dec 6, 2023
Merged

Updates Parse Runner Structure #70

merged 3 commits into from
Dec 6, 2023

Conversation

gbdubs
Copy link
Contributor

@gbdubs gbdubs commented Nov 20, 2023

No description provided.

@gbdubs gbdubs requested a review from bcspragu November 20, 2023 22:12
func (h *handler) parsePortfolio(ctx context.Context, taskID task.ID, req *task.ParsePortfolioRequest) error {
// Load the portfolio from blob storage, place it in /mnt/raw_portfolios, where
// the `process_portfolios.R` script expects it to be.
for _, assetID := range req.AssetIDs {
srcURI := blob.Join(h.blob.Scheme(), h.sourcePortfolioContainer, assetID)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're not using the sourcePortfolioContainer any more, can you remove the flag and associated config + terraform?

But also, I like the previous structure better because it means the runner is only ever pulling from a single bucket, whereas moving it to an opaque blob means the caller can set it to whatever, which introduces a (small) new attack vector. What's the motivation for the opaque BlobURIs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could pursue the approach you describe here, but it'd require us fully reworking the blob storage architechure, and doing so in a way that isn't clearly better.

Why?

In order to know how to access or delete a blob we need to know the full URI, including the scheme, the container, and the blob ID. Since blobs are present in more context than just the source portfolio, we couldn't infer or extract that information into the code (without just duplicating the logic of what a blob is and how it is handled). Thus, we either can represent that in the DB as a full string in the DB, or chunk it into parts (perhaps represented by enum values for the portfolio container, ex), so that as you suggest here we would be more certain that we're always pulling from the same container.

That doesn't provide additional security, however, it just adds structure. That structure would make migrations harder, encourage introspection/more manual munging, and be more error prone in situations where they are read, deleted or retrieved.

AssetIDs: req.AssetIDs,
Outputs: out,
TaskID: taskID,
Request: req,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why duplicate the whole request in the response? Wouldn't it be better to just have a standard API interface between srv + runner like anything else, where we input the stuff we need and output the stuff we need?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Round-tripped data in an async request paradigm has huge benefits:

  • In this case, it removes the requirement we write down details of the task in our database, since the info is available on the task completing. This removes the need for (a) a DB write and (b) error handling in case the DB state mismatches with the async task state (as it could do in the case of a transaction failure, ex).
  • Doesn't require the async processing to know what is going to be important on the return trip (this turned out to be hugely important for a variety of things we built into async CM processing)
  • No code changes needed to round trip more data (though you still need a binary deploy)
  • With payloads as small as they are right now, there isn't a real cost (if that changes in the future I agree we can reevaluate)

In short this is the easier to maintain, more future proof way of doing it. As you say, premature optimization is at the heart of many issues.

cmd/runner/main.go Outdated Show resolved Hide resolved
cmd/runner/taskrunner/taskrunner.go Outdated Show resolved Hide resolved
cmd/runner/taskrunner/taskrunner.go Outdated Show resolved Hide resolved
@gbdubs gbdubs enabled auto-merge (squash) December 6, 2023 15:49
@gbdubs gbdubs merged commit 9e88ed4 into main Dec 6, 2023
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants