Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a leak of open Blob files #13106

Closed
wants to merge 4 commits into from

Conversation

pdillinger
Copy link
Contributor

@pdillinger pdillinger commented Oct 31, 2024

Summary: An earlier change (b34cef5) removed apparently unused functionality where an obsolete blob file number is passed for removal from TableCache, which manages SST files. This was actually relying on broken/fragile abstractions wherein TableCache and BlobFileCache share the same Cache and using the TableCache interface to manipulate blob file caching. No unit test was actually checking for removal of obsolete blob files from the cache (which is somewhat tricky to check and a second order correctness requirement).

Here we fix the leak and add a DEBUG+ASAN-only check in DB::Close() that no obsolete files are lingering in the table/blob file cache.

Fixes #13066

Important follow-up (FIXME): The added check discovered some apparent cases of leaked (into table_cache) SST file readers that would stick around until DB::Close(). Need to enable that check, diagnose, and fix.

Test Plan: added a check that is called during DB::Close in ASAN builds (to minimize paying the cost in all unit tests). Without the fix, the check failed in at least these tests:

db_blob_basic_test DBBlobBasicTest.DynamicallyWarmCacheDuringFlush
db_blob_compaction_test DBBlobCompactionTest.CompactionReadaheadMerge
db_blob_compaction_test DBBlobCompactionTest.MergeBlobWithBase
db_blob_compaction_test DBBlobCompactionTest.CompactionDoNotFillCache
db_blob_compaction_test DBBlobCompactionTest.SkipUntilFilter
db_blob_compaction_test DBBlobCompactionTest.CompactionFilter
db_blob_compaction_test DBBlobCompactionTest.CompactionReadaheadFilter
db_blob_compaction_test DBBlobCompactionTest.CompactionReadaheadGarbageCollection

Summary: ...

Fixes facebook#13066

Important follow-up (FIXME): The added check discovered some apparent
cases of leaked (into table_cache) SST file readers that would stick
around until DB::Close(). Need to enable that check, diagnose, and fix.

Test Plan: added a check that is called during DB::Close in ASAN builds
(to minimize paying the cost in all unit tests). Without the fix, the
check failed in at least these tests:

```
db_blob_basic_test DBBlobBasicTest.DynamicallyWarmCacheDuringFlush
db_blob_compaction_test DBBlobCompactionTest.CompactionReadaheadMerge
db_blob_compaction_test DBBlobCompactionTest.MergeBlobWithBase
db_blob_compaction_test DBBlobCompactionTest.CompactionDoNotFillCache
db_blob_compaction_test DBBlobCompactionTest.SkipUntilFilter
db_blob_compaction_test DBBlobCompactionTest.CompactionFilter
db_blob_compaction_test DBBlobCompactionTest.CompactionReadaheadFilter
db_blob_compaction_test DBBlobCompactionTest.CompactionReadaheadGarbageCollection
```
Copy link
Contributor

@ltamasi ltamasi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the fix @pdillinger !

cfd->current()->AddLiveFiles(&live_table_files, &live_blob_files);
}

std::set<uint64_t> live_files;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorting live_table_files and live_blob_files and just looking in the appropriate vector based on file type would be probably a bit more efficient (but no big deal since this is test/debug code)

if (vs) {
assert(ioptions);
assert(!ioptions->cf_paths.empty());
assert(shared_meta);
assert(bc);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like cfd_ and thus bc can be nullptr. Should this be a check instead (if (bc) { bc->Evict(...); }) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

// contains entries for live files, to check for effective leaks of open
// files. This can only be called when purging of obsolete files has
// "settled," such as during parts of DB Close().
void TEST_VerifyNoObsoleteFilesCached(bool db_mutex_already_held) const;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might have to move this declaration to the #ifdef NDEBUG block above and #ifdef the definition+call too to fix those linker errors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't see that db_impl_debug.cc is already in a giant ifdef

@facebook-github-bot
Copy link
Contributor

@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@pdillinger has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

continue;
}
// Sneakily add both SST and blob files to the same list
cfd->current()->AddLiveFiles(&live_files, &live_files);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, one more question here: unlike the previous logic, this only adds the live files from the current Version; is it possible that when this method is called, we still have e.g. iterators or compactions holding on to earlier Versions? (Obsolete files would be the ones that are not needed by any Versions that are in use, not just the latest.)

Copy link
Contributor Author

@pdillinger pdillinger Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can only be called when purging of obsolete files has
"settled," such as during parts of DB Close().

You can't close with open iterators, but I should be more specific about multiple live versions outside of Close()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to T206372901

@facebook-github-bot
Copy link
Contributor

@pdillinger merged this pull request in a28cc4a.

pdillinger added a commit to pdillinger/rocksdb that referenced this pull request Nov 1, 2024
Summary:
An earlier change (facebook@b34cef5) removed apparently unused functionality where an obsolete blob file number is passed for removal from TableCache, which manages SST files. This was actually relying on broken/fragile abstractions wherein TableCache and BlobFileCache share the same Cache and using the TableCache interface to manipulate blob file caching. No unit test was actually checking for removal of obsolete blob files from the cache (which is somewhat tricky to check and a second order correctness requirement).

Here we fix the leak and add a DEBUG+ASAN-only check in DB::Close() that no obsolete files are lingering in the table/blob file cache.

Fixes facebook#13066

Important follow-up (FIXME): The added check discovered some apparent cases of leaked (into table_cache) SST file readers that would stick around until DB::Close(). Need to enable that check, diagnose, and fix.

Pull Request resolved: facebook#13106

Test Plan:
added a check that is called during DB::Close in ASAN builds (to minimize paying the cost in all unit tests). Without the fix, the check failed in at least these tests:

```
db_blob_basic_test DBBlobBasicTest.DynamicallyWarmCacheDuringFlush
db_blob_compaction_test DBBlobCompactionTest.CompactionReadaheadMerge
db_blob_compaction_test DBBlobCompactionTest.MergeBlobWithBase
db_blob_compaction_test DBBlobCompactionTest.CompactionDoNotFillCache
db_blob_compaction_test DBBlobCompactionTest.SkipUntilFilter
db_blob_compaction_test DBBlobCompactionTest.CompactionFilter
db_blob_compaction_test DBBlobCompactionTest.CompactionReadaheadFilter
db_blob_compaction_test DBBlobCompactionTest.CompactionReadaheadGarbageCollection
```

Reviewed By: ltamasi

Differential Revision: D65296123

Pulled By: pdillinger

fbshipit-source-id: 2276d76482beb2c75c9010bc1bec070bb23a24c0
pdillinger added a commit to pdillinger/rocksdb that referenced this pull request Nov 1, 2024
Summary:
An earlier change (facebook@b34cef5) removed apparently unused functionality where an obsolete blob file number is passed for removal from TableCache, which manages SST files. This was actually relying on broken/fragile abstractions wherein TableCache and BlobFileCache share the same Cache and using the TableCache interface to manipulate blob file caching. No unit test was actually checking for removal of obsolete blob files from the cache (which is somewhat tricky to check and a second order correctness requirement).

Here we fix the leak and add a DEBUG+ASAN-only check in DB::Close() that no obsolete files are lingering in the table/blob file cache.

Fixes facebook#13066

Important follow-up (FIXME): The added check discovered some apparent cases of leaked (into table_cache) SST file readers that would stick around until DB::Close(). Need to enable that check, diagnose, and fix.

Pull Request resolved: facebook#13106

Test Plan:
added a check that is called during DB::Close in ASAN builds (to minimize paying the cost in all unit tests). Without the fix, the check failed in at least these tests:

```
db_blob_basic_test DBBlobBasicTest.DynamicallyWarmCacheDuringFlush
db_blob_compaction_test DBBlobCompactionTest.CompactionReadaheadMerge
db_blob_compaction_test DBBlobCompactionTest.MergeBlobWithBase
db_blob_compaction_test DBBlobCompactionTest.CompactionDoNotFillCache
db_blob_compaction_test DBBlobCompactionTest.SkipUntilFilter
db_blob_compaction_test DBBlobCompactionTest.CompactionFilter
db_blob_compaction_test DBBlobCompactionTest.CompactionReadaheadFilter
db_blob_compaction_test DBBlobCompactionTest.CompactionReadaheadGarbageCollection
```

Reviewed By: ltamasi

Differential Revision: D65296123

Pulled By: pdillinger

fbshipit-source-id: 2276d76482beb2c75c9010bc1bec070bb23a24c0
pdillinger added a commit that referenced this pull request Nov 1, 2024
Summary:
An earlier change (b34cef5) removed apparently unused functionality where an obsolete blob file number is passed for removal from TableCache, which manages SST files. This was actually relying on broken/fragile abstractions wherein TableCache and BlobFileCache share the same Cache and using the TableCache interface to manipulate blob file caching. No unit test was actually checking for removal of obsolete blob files from the cache (which is somewhat tricky to check and a second order correctness requirement).

Here we fix the leak and add a DEBUG+ASAN-only check in DB::Close() that no obsolete files are lingering in the table/blob file cache.

Fixes #13066

Important follow-up (FIXME): The added check discovered some apparent cases of leaked (into table_cache) SST file readers that would stick around until DB::Close(). Need to enable that check, diagnose, and fix.

Pull Request resolved: #13106

Test Plan:
added a check that is called during DB::Close in ASAN builds (to minimize paying the cost in all unit tests). Without the fix, the check failed in at least these tests:

```
db_blob_basic_test DBBlobBasicTest.DynamicallyWarmCacheDuringFlush
db_blob_compaction_test DBBlobCompactionTest.CompactionReadaheadMerge
db_blob_compaction_test DBBlobCompactionTest.MergeBlobWithBase
db_blob_compaction_test DBBlobCompactionTest.CompactionDoNotFillCache
db_blob_compaction_test DBBlobCompactionTest.SkipUntilFilter
db_blob_compaction_test DBBlobCompactionTest.CompactionFilter
db_blob_compaction_test DBBlobCompactionTest.CompactionReadaheadFilter
db_blob_compaction_test DBBlobCompactionTest.CompactionReadaheadGarbageCollection
```

Reviewed By: ltamasi

Differential Revision: D65296123

Pulled By: pdillinger

fbshipit-source-id: 2276d76482beb2c75c9010bc1bec070bb23a24c0
pdillinger added a commit to pdillinger/rocksdb that referenced this pull request Nov 5, 2024
Summary: Follow-up to facebook#13106 which revealed that some SST file readers
(in addition to blob files) were being essentially leaked in TableCache
(until DB::Close() time). Patched sources of leaks:
* Flush that is not committed (builder.cc)
* Various obsolete SST files picked up by directory scan but not caught
  by SubcompactionState::Cleanup() cleaning up from some failed
  compactions. Dozens of unit tests fail without the "backstop"
  TableCache::Evict() call in PurgeObsoleteFiles().

We also needed to adjust the check for leaks as follows:
* Ok if DB::Open never finished (see comment)
* Ok if deletions are disabled (see comment)
* Allow "quarantined" files to be in table_cache because (presumably)
  they might become live again.
* Get live files from all live Versions.

Suggested follow-up:
* Potentially delete more obsolete files sooner with a FIXME in
  db_impl_files.cc. This could potentially be high value because it
  seems to gate deletion of any/all newer obsolete files on all
  older compactions finishing.
* Try to catch obsolete files in more places using the
  VersionSet::obsolete_files_ pipeline rather than relying on them
  being picked up with directory scan, or deleting them outside of
  normal mechanisms.

Test Plan: updated check used in most all unit tests in ASAN build
pdillinger added a commit to pdillinger/rocksdb that referenced this pull request Nov 5, 2024
Summary: Follow-up to facebook#13106 which revealed that some SST file readers
(in addition to blob files) were being essentially leaked in TableCache
(until DB::Close() time). Patched sources of leaks:
* Flush that is not committed (builder.cc)
* Various obsolete SST files picked up by directory scan but not caught
  by SubcompactionState::Cleanup() cleaning up from some failed
  compactions. Dozens of unit tests fail without the "backstop"
  TableCache::Evict() call in PurgeObsoleteFiles().

We also needed to adjust the check for leaks as follows:
* Ok if DB::Open never finished (see comment)
* Ok if deletions are disabled (see comment)
* Allow "quarantined" files to be in table_cache because (presumably)
  they might become live again.
* Get live files from all live Versions.

Suggested follow-up:
* Potentially delete more obsolete files sooner with a FIXME in
  db_impl_files.cc. This could potentially be high value because it
  seems to gate deletion of any/all newer obsolete files on all
  older compactions finishing.
* Try to catch obsolete files in more places using the
  VersionSet::obsolete_files_ pipeline rather than relying on them
  being picked up with directory scan, or deleting them outside of
  normal mechanisms.

Test Plan: updated check used in most all unit tests in ASAN build
facebook-github-bot pushed a commit that referenced this pull request Nov 8, 2024
Summary:
Follow-up to #13106 which revealed that some SST file readers (in addition to blob files) were being essentially leaked in TableCache (until DB::Close() time). Patched sources of leaks:
* Flush that is not committed (builder.cc)
* Various obsolete SST files picked up by directory scan but not caught by SubcompactionState::Cleanup() cleaning up from some failed compactions. Dozens of unit tests fail without the "backstop" TableCache::Evict() call in PurgeObsoleteFiles().

We also needed to adjust the check for leaks as follows:
* Ok if DB::Open never finished (see comment)
* Ok if deletions are disabled (see comment)
* Allow "quarantined" files to be in table_cache because (presumably) they might become live again.
* Get live files from all live Versions.

Suggested follow-up:
* Potentially delete more obsolete files sooner with a FIXME in db_impl_files.cc. This could potentially be high value because it seems to gate deletion of any/all newer obsolete files on all older compactions finishing.
* Try to catch obsolete files in more places using the VersionSet::obsolete_files_ pipeline rather than relying on them being picked up with directory scan, or deleting them outside of normal mechanisms.

Pull Request resolved: #13117

Test Plan: updated check used in most all unit tests in ASAN build

Reviewed By: hx235

Differential Revision: D65502988

Pulled By: pdillinger

fbshipit-source-id: aa0795a8a09d9ec578d25183fe43e2a35849209c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Strange leak of open file handles in Java using BlobDB
3 participants