Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8338534: GenShen: Handle alloc failure differently when immediate garbage is pending #479

Open
wants to merge 29 commits into
base: master
Choose a base branch
from

Conversation

kdnilsen
Copy link
Contributor

@kdnilsen kdnilsen commented Aug 21, 2024

Several changes are implemented here:

  1. Re-order the phases that execute immediately after final-mark so that we do concurrent-cleanup quicker (but still after concurrent weak references)
  2. After immediate garbage has been reclaimed by concurrent cleanup, notify waiting allocators
  3. If an allocation failure occurs while immediate garbage recycling is pending, stall the allocation but do not cancel the concurrent gc.

Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed (1 review required, with at least 1 Committer)

Issue

  • JDK-8338534: GenShen: Handle alloc failure differently when immediate garbage is pending (Bug - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/shenandoah.git pull/479/head:pull/479
$ git checkout pull/479

Update a local copy of the PR:
$ git checkout pull/479
$ git pull https://git.openjdk.org/shenandoah.git pull/479/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 479

View PR using the GUI difftool:
$ git pr show -t 479

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/shenandoah/pull/479.diff

Webrev

Link to Webrev Comment

When we round up, we introduce the risk that the new size exceeds the
maximum LAB size, resulting in an assertion error.
This may allow us to reclaim immediate garbage more quickly.
We just reported freeset status after rebuilding freeset in final mark.
There will be heavy contention on the heaplock at the start of
evacuation as many mutator and worker threads prep their GCLAB and PLAB
buffers for evacuation.  We avoid some lock contention by not reporting
freset status here.
If we notify waiting mutators after immediate garbage is reclaimed, do
not clear the alloc-failure flag.  Otherwise, this hides the fact that
alloc failure occurred during this GC.
@bridgekeeper
Copy link

bridgekeeper bot commented Aug 21, 2024

👋 Welcome back kdnilsen! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Aug 21, 2024

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

@kdnilsen
Copy link
Contributor Author

kdnilsen commented Aug 21, 2024

BTW, I can't figure out what jcheck whitespace is complaining about. I think it is reporting the wrong line number. jcheck on my local copy does not report a whitespace problem.

@ysramakrishna
Copy link
Member

BTW, I can't figure out what jcheck whitespace is complaining about. I think it is reporting the wrong line number. jcheck on my local copy does not report a whitespace problem.

Yes, for some reason git jcheck -s doesn't find it.

However, there is indeed whitespace (in fact, 2 as in the error message) in that file on an otherwise blank line. However, the line number is indeed incorrect as you stated. It should be line 2467.

diff --git a/src/hotspot/share/gc/shenandoah/shenandoahHeap.cpp b/src/hotspot/share/gc/shenandoah/shenandoahHeap.cpp
index cbcd22c053f..342c023d180 100644
--- a/src/hotspot/share/gc/shenandoah/shenandoahHeap.cpp
+++ b/src/hotspot/share/gc/shenandoah/shenandoahHeap.cpp
@@ -2464,7 +2464,7 @@ void ShenandoahHeap::rebuild_free_set(bool concurrent) {
   _free_set->prepare_to_rebuild(young_cset_regions, old_cset_regions, first_old_region, last_old_region, old_region_count);
   size_t anticipated_immediate_garbage = (old_cset_regions + young_cset_regions) * ShenandoahHeapRegion::region_size_words();
   control_thread()->anticipate_immediate_garbage(anticipated_immediate_garbage);
-  
+
   // If there are no old regions, first_old_region will be greater than last_old_region
   assert((first_old_region > last_old_region) ||
          ((last_old_region + 1 - first_old_region >= old_region_count) &&

@ysramakrishna
Copy link
Member

% find . -type f -name "*pp" | xargs grep -n " $" 
./src/hotspot/share/gc/shenandoah/shenandoahHeap.cpp:2467:  

@openjdk openjdk bot added the rfr Pull request is ready for review label Aug 30, 2024
@mlbridge
Copy link

mlbridge bot commented Aug 30, 2024

Webrevs

Copy link
Member

@ysramakrishna ysramakrishna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several changes are implemented here:

  1. Re-order the phases that execute immediately after final-mark so that we do concurrent-cleanup quicker (but still after concurrent weak references)
  2. After immediate garbage has been reclaimed by concurrent cleanup, notify waiting allocators
  3. If an allocation failure occurs while immediate garbage recycling is pending, stall the allocation but do not cancel the concurrent gc.

As you suggested offline, I agree that it might make sense to do (1) separately, and then do (2+3).

Left a few comments mainly on (2+3), but I'm still missing the solution to the problem described in the ticket. I'll chat with you offline to get clarity.

@@ -71,6 +73,8 @@ class ShenandoahController: public ConcurrentGCThread {
// until another cycle runs and clears the alloc failure gc flag.
void handle_alloc_failure(ShenandoahAllocRequest& req, bool block);

void anticipate_immediate_garbage(size_t anticipated_immediate_garbage_words);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a 1-line documentation comment on the role of the field (and that the method sets it -- why not simply call it set_foo(value) for field _foo ? I realize you want readers to get the most recently written value, hence the atomic store & load.

Comment on lines +101 to +104
if (clear_alloc_failure) {
_alloc_failure_gc.unset();
_humongous_alloc_failure_gc.unset();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For good hygiene, I'd move the variable value changes into the monitor which is held when waiting or notifying. I realize this doesn't matter for correctness, but makes debugging easier.

Further, if you protect the updates and reads of the variables with the lock, you don't need to do the extra atomic ops.

You'd need to examine all sets/gets and waits/notifys to make sure this works, but I am guessing it will, and it'll also improve performance.

However, that can be done in a separate effort, if you prefer, for which I'm happy to file a separate ticket for that investigation/change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize now that this idiom is quite pervasive in Shenandoah code, so just fixing this instance of it doesn't accomplish much at this time. I am not convinced it's a good idiom. I'll investigate this separately. I vaguely recall a discussion along these lines in an older PR.

I'll file a separate ticket for this; you can ignore this remark for the purposes of this PR.

ShenandoahHeap::heap()->free_set()->recycle_trash();
ShenandoahHeap* heap = ShenandoahHeap::heap();
if (heap->free_set()->recycle_trash()) {
heap->control_thread()->notify_alloc_failure_waiters(false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you motivate this notification? As far as I can tell, all waiters will react to the notification by waking up, finding that the variables are still set, and consequently go back to wait.

I am sure I am missing something here, or you didn't make an intended change to allow waiters to retry allocation after waking up and go back to sleep if they didn't succeed?

A documentation comment would definitely help cross the t's and dot the i's for the reader.

@@ -2462,6 +2462,9 @@ void ShenandoahHeap::rebuild_free_set(bool concurrent) {
size_t young_cset_regions, old_cset_regions;
size_t first_old_region, last_old_region, old_region_count;
_free_set->prepare_to_rebuild(young_cset_regions, old_cset_regions, first_old_region, last_old_region, old_region_count);
size_t anticipated_immediate_garbage = (old_cset_regions + young_cset_regions) * ShenandoahHeapRegion::region_size_words();
control_thread()->anticipate_immediate_garbage(anticipated_immediate_garbage);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the line that has two whitespaces, vide the jcheck whitespace error above.

}

void ShenandoahFreeSet::recycle_trash() {
bool ShenandoahFreeSet::recycle_trash() {
bool result = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd take the opportunity to do some counting verification here.

int n_trash_regions = 0;

Read on ...

Comment on lines +1271 to +1272
_heap->control_thread()->anticipate_immediate_garbage((size_t) 0);
return result;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...

clear_anticipated_immediate_garage(n_trash_regions*HeapRegionSize);
   return n_trash_regions > 0;

with

void ...::clear_anticipated_immediate_garbage(size_t aig) {
    assert(_anticipated_immediate_garbage == aig, "Mismatch?");
    _anticipated_immediate_garbage = 0;
}

Is this an intended invariant? I think it is, but don't understand enough of the details to be certain.

@@ -53,6 +53,10 @@ size_t ShenandoahController::get_gc_id() {
return Atomic::load(&_gc_id);
}

void ShenandoahController::anticipate_immediate_garbage(size_t anticipated_immediate_garbage) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested rename see further above. set_<field>.

Comment on lines +1236 to 1243
bool ShenandoahFreeSet::try_recycle_trashed(ShenandoahHeapRegion* r) {
bool result = false;
if (r->is_trash()) {
r->recycle();
result = true;
}
return true;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understood your intent, I think this has a bug because it always returns true here. I believe you just wanted:

    if (r->is_trash()) {
      r->recycle();
      return true;
    }
   return false;

@@ -753,6 +753,9 @@ void ShenandoahGeneration::prepare_regions_and_collection_set(bool concurrent) {
// We are preparing for evacuation. At this time, we ignore cset region tallies.
size_t first_old, last_old, num_old;
heap->free_set()->prepare_to_rebuild(young_cset_regions, old_cset_regions, first_old, last_old, num_old);
size_t anticipated_immediate_garbage = (old_cset_regions + young_cset_regions) * ShenandoahHeapRegion::region_size_words();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes it sound like old_cset_regions & young_cset_regions hold all regions that will be part of immediate garbage -- which indeed is the case. The name prepare_to_rebuild() does not give one much clue as to the fact that that's happening when we return from the call, and the API spec of the method does not explicitly specify it. One needs to read into the code of the method find_regions_with_alloc_capacity() to realize that this is what is happening.

In summary, firstly, I feel some of these methods are in need of tighter header file documentation of post-conditions that callers are relying on and, secondly, I feel given the extremely fat APIs (lots of reference variables that are modified by these methods) that some amount of refactoring is needed in the longer term. The refactoring should be a separate effort, but in the shorter term I think the API/spec documentation of prepare_to_rebuild and find_regions_with_alloc_capacity could be improved.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The names old_cset_regions and young_cset_regions is very confusing as well. These regions are already trash before the collection set is chosen during final mark (and so, will not themselves be part of the collection set). Suggest calling them old_trashed_regions and young_trashed_regions here.

@@ -474,6 +474,9 @@ void ShenandoahOldGeneration::prepare_regions_and_collection_set(bool concurrent
size_t cset_young_regions, cset_old_regions;
size_t first_old, last_old, num_old;
heap->free_set()->prepare_to_rebuild(cset_young_regions, cset_old_regions, first_old, last_old, num_old);
size_t anticipated_immediate_garbage = (cset_young_regions + cset_old_regions) * ShenandoahHeapRegion::region_size_words();
heap->control_thread()->anticipate_immediate_garbage(anticipated_immediate_garbage);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest set_<foo> for changing field value <foo>.

@@ -753,6 +753,9 @@ void ShenandoahGeneration::prepare_regions_and_collection_set(bool concurrent) {
// We are preparing for evacuation. At this time, we ignore cset region tallies.
size_t first_old, last_old, num_old;
heap->free_set()->prepare_to_rebuild(young_cset_regions, old_cset_regions, first_old, last_old, num_old);
size_t anticipated_immediate_garbage = (old_cset_regions + young_cset_regions) * ShenandoahHeapRegion::region_size_words();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The names old_cset_regions and young_cset_regions is very confusing as well. These regions are already trash before the collection set is chosen during final mark (and so, will not themselves be part of the collection set). Suggest calling them old_trashed_regions and young_trashed_regions here.

@@ -65,11 +69,12 @@ void ShenandoahController::handle_alloc_failure(ShenandoahAllocRequest& req, boo
req.type_string(),
byte_size_in_proper_unit(req.size() * HeapWordSize), proper_unit_for_byte_size(req.size() * HeapWordSize));

// Now that alloc failure GC is scheduled, we can abort everything else
heap->cancel_gc(GCCause::_allocation_failure);
if (Atomic::load(&_anticipated_immediate_garbage) < req.size()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make sure I understand... here we are saying that if final mark anticipates this much immediate garbage (computed when it rebuilt the freeset after choosing the collection set), then we aren't going to cancel the GC if this particular request could be satisfied. Instead we will block as though the gc has already been cancelled. This thread will be notified when concurrent cleanup completes.

_alloc_failure_gc.unset();
_humongous_alloc_failure_gc.unset();
void ShenandoahController::notify_alloc_failure_waiters(bool clear_alloc_failure) {
if (clear_alloc_failure) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we not clear the alloc failure? This seems like it would confuse the control thread. Isn't this going to have the control thread attempt to notify alloc failure waiters again when the cycle is finished?

@bridgekeeper
Copy link

bridgekeeper bot commented Sep 28, 2024

@kdnilsen This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@ysramakrishna
Copy link
Member

@kdnilsen : Should this become a draft for now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rfr Pull request is ready for review
Development

Successfully merging this pull request may close these issues.

3 participants