-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SAT-29018] Fix/corrupted RA blocks content streaming #6064
[SAT-29018] Fix/corrupted RA blocks content streaming #6064
Conversation
58375ba
to
d5f13af
Compare
) | ||
return server, remote | ||
|
||
yield _generate_server_and_remote |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved from test_acs
to facilitate creating an ACS + server.
I need an ACS here because it associated RA has priority on content-streaming.
with pytest.raises(ClientPayloadError, match="Response payload is not completed"): | ||
download_file(get_url) | ||
|
||
# Assert again with curl just to be sure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reason:
This is testing the close-connection on digest validation.
The implementation on this PR cause the second request (from curl) gets a 404 instead, because the only (and corrupted) RA is temporarily ignored.
Here I wanted to test different clients, but testing only curl is enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we added another call to curl right away, we would expect a different error, because the "only" RA is corrupted. right?
Would it be easy to add that to this test?
random.seed(seed) | ||
contents = random.randbytes(size) | ||
else: | ||
contents = os.urandom(size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reason:
Facilitate creating two different files with same content to exercise corruption scenarios.
content_artifact.remoteartifact_set.select_related("remote") | ||
.order_by_acs() | ||
.exclude(failed_at__gte=timezone.now() - timedelta(minutes=5)) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I though it would be nice to start simple, but maybe this threshold could depend on the content size? Or be configurable via settings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Configurable is probably better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mdellweg On the first approach that was a constant, but the "proper" value is subjective.
pulpcore/content/handler.py
Outdated
"- Forcing the connection to close.\n" | ||
"- Marking this Remote Server to be ignored for the next 5 minutes.\n\n" | ||
"If the Remote is permanently corrupted, we advice the admin to " | ||
"manually prune affected RemoteArtifacts/Remotes from Pulp." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's not really a way to delete RemoteArtifacts directly without messing with the database, and I'm not sure we want to suggest in any way that the user should go digging in there. Deleting remotes is probably better, and we should probably suggest resyncing the repo as it might have been fixed at the source.
Typo "checksum" on line 1153
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's not really a way to delete RemoteArtifacts directly without messing with the database, and I'm not sure we want to suggest in any way that the user should go digging in there.
Fair, I was thinking in the last case, but that is probably not a good general idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I mean, it CAN be done, occasionally if someone comes to us asking how to do it we'll help them out, but it's not a road we want to suggest going down
1b376db
to
03d1a11
Compare
pulpcore/app/settings.py
Outdated
# The time a RemoteArtifact will be ignored after failure. | ||
# In on-demand, if a fetching content from a remote failed due to corrupt data, | ||
# the corresponding RemoteArtifact will be ignored for that time (seconds). | ||
FAILED_REMOTE_ARTIFACT_PROTECTION_TIME = 5 * 60 # 5 minutes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds more like a cooldown to me, not a protection time.
Have we considered making this a constant before adding a new setting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cooldown sounds good.
And yes, it was a constant first (marked you above)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, let's not ping pong it.
with pytest.raises(ClientPayloadError, match="Response payload is not completed"): | ||
download_file(get_url) | ||
|
||
# Assert again with curl just to be sure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we added another call to curl right away, we would expect a different error, because the "only" RA is corrupted. right?
Would it be easy to add that to this test?
Yes, client will get a 404 because the RA will be ignored for the second attempt within the cooldown interval.
Yes, I can do that. |
dabb51e
to
4b4b195
Compare
pulpcore/app/settings.py
Outdated
@@ -298,7 +298,7 @@ | |||
# The time a RemoteArtifact will be ignored after failure. | |||
# In on-demand, if a fetching content from a remote failed due to corrupt data, | |||
# the corresponding RemoteArtifact will be ignored for that time (seconds). | |||
FAILED_REMOTE_ARTIFACT_PROTECTION_TIME = 5 * 60 # 5 minutes | |||
FAILED_REMOTE_ARTIFACT_COOLDOWN_TIME = 5 * 60 # 5 minutes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying to find a snappier name here: What about REMOTE_ARTIFACT_SKIP_COOLDOWN
?
Also I think we have a bit of a containment issue here. RemoteArtifact
is a pure implementation detail. Users and administrators will never see them. So referencing them here may be confusing. (This is me probably overthinking this.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree RemoteArtifact is an implementation detail to some degree, but there is also a concept of a remote source for an artifact which is something meaningful to users. They know there are possible multiple sources for the same content (alternate content sources, for example), and that's what we need to express here. So yes, maybe RemoteArtifact doesnt express that very well for the uninitiated.
Personally I don't think REMOTE_ARTIFACT_SKIP_COOLDOWN
much better, it reads very vaguely if we were not inside the context of this problem. But that's also true for the current one, so I'll be ok with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about:
REMOTE_CONTENT_FETCH_FAILURE_COOLDOWN
FAILED_REMOTE_CONTENT_FETCH_COOLDOWN
REMOTE_CONTENT_FETCH_COOLDOWN
.ON_DEMAND_CONTENT_FETCH_COOLDOWN
ON_DEMAND_REMOTE_COOLDOWN
IHO the "content/artifact fetch" here improves the scoping of that.
Those should generally read as "cooldown for when content fetching from a remote fails".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
REMOTE_CONTENT_FETCH_FAILURE_COOLDOWN
is still a mouth full, but probably the best of all the suggestions we came up with. (Being a setting means we cannot easily change it later.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All my concerns have been addressed. Time for squashing?
On a request for on-demand content in the content app, a corrupted Remote that contains the wrong binary (for that content) prevented other Remotes from being attempted on future requests. Now the last failed Remotes are temporarily ignored and others may be picked. Closes pulp#5725
215397e
to
0dff938
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good - tests look good to exercise the new codepaths, great discussion.
Backport to 3.39: 💔 cherry-picking failed — conflicts found❌ Failed to cleanly apply 0a5ac4a on top of patchback/backports/3.39/0a5ac4aaf0f28b1055ee421a2ec35c65726039b0/pr-6064 Backporting merged PR #6064 into main
🤖 @patchback |
Backport to 3.22: 💔 cherry-picking failed — conflicts found❌ Failed to cleanly apply 0a5ac4a on top of patchback/backports/3.22/0a5ac4aaf0f28b1055ee421a2ec35c65726039b0/pr-6064 Backporting merged PR #6064 into main
🤖 @patchback |
Backport to 3.49: 💔 cherry-picking failed — conflicts found❌ Failed to cleanly apply 0a5ac4a on top of patchback/backports/3.49/0a5ac4aaf0f28b1055ee421a2ec35c65726039b0/pr-6064 Backporting merged PR #6064 into main
🤖 @patchback |
Backport to 3.28: 💔 cherry-picking failed — conflicts found❌ Failed to cleanly apply 0a5ac4a on top of patchback/backports/3.28/0a5ac4aaf0f28b1055ee421a2ec35c65726039b0/pr-6064 Backporting merged PR #6064 into main
🤖 @patchback |
Backport to 3.68: 💚 backport PR created✅ Backport PR branch: Backported as #6160 🤖 @patchback |
Backport to 3.63: 💔 cherry-picking failed — conflicts found❌ Failed to cleanly apply 0a5ac4a on top of patchback/backports/3.63/0a5ac4aaf0f28b1055ee421a2ec35c65726039b0/pr-6064 Backporting merged PR #6064 into main
🤖 @patchback |
On a request for on-demand content in the content app, a corrupted Remote that contains the wrong binary (for that content) prevented other Remotes from being attempted on future requests. Now the last failed Remotes are temporarily ignored and others may be picked. [BACKPORT NOTES] * removed the failed_at migration. * on a different commit, a unique migration to this branch will be used as a replacement Original issue pulp#5725 Backported from: pulp#6064
depends on #6026
closes #5725