Releases: facebook/rocksdb
RocksDB Release v5.18.3
Rocksdb Change Log
5.18.3 (2/11/2019)
Bug Fixes
- Fix possible LSM corruption when both range deletions and subcompactions are used. The symptom of this corruption is L1+ files overlapping in the user key space.
5.18.2 (01/31/2019)
Public API Change
- Change time resolution in FileOperationInfo.
- Deleting Blob files also go through SStFileManager.
5.18.0 (11/30/2018)
New Features
- Introduced
JemallocNodumpAllocator
memory allocator. When being use, block cache will be excluded from core dump. - Introduced
PerfContextByLevel
as part ofPerfContext
which allows storing perf context at each level. Also replaced__thread
withthread_local
keyword for perf_context. Added per-level perf context for bloom filter andGet
query. - With level_compaction_dynamic_level_bytes = true, level multiplier may be adjusted automatically when Level 0 to 1 compaction is lagged behind.
- Introduced DB option
atomic_flush
. If true, RocksDB supports flushing multiple column families and atomically committing the result to MANIFEST. Useful when WAL is disabled. - Added
num_deletions
andnum_merge_operands
members toTableProperties
. - Added "rocksdb.min-obsolete-sst-number-to-keep" DB property that reports the lower bound on SST file numbers that are being kept from deletion, even if the SSTs are obsolete.
- Add xxhash64 checksum support
- Introduced
MemoryAllocator
, which lets the user specify custom memory allocator for block based table. - Improved
DeleteRange
to prevent read performance degradation. The feature is no longer marked as experimental. - Enabled checkpoint on readonly db (DBImplReadOnly).
Public API Change
DBOptions::use_direct_reads
now affects reads issued byBackupEngine
on the database's SSTs.NO_ITERATORS
is divided into two countersNO_ITERATOR_CREATED
andNO_ITERATOR_DELETE
. Both of them are only increasing now, just as other counters.
Bug Fixes
- Fix corner case where a write group leader blocked due to write stall blocks other writers in queue with WriteOptions::no_slowdown set.
- Fix in-memory range tombstone truncation to avoid erroneously covering newer keys at a lower level, and include range tombstones in compacted files whose largest key is the range tombstone's start key.
- Properly set the stop key for a truncated manual CompactRange
- Fix slow flush/compaction when DB contains many snapshots. The problem became noticeable to us in DBs with 100,000+ snapshots, though it will affect others at different thresholds.
- Fix the bug that WriteBatchWithIndex's SeekForPrev() doesn't see the entries with the same key.
- Fix the bug where user comparator was sometimes fed with InternalKey instead of the user key. The bug manifests when during GenerateBottommostFiles.
- Fix a bug in WritePrepared txns where if the number of old snapshots goes beyond the snapshot cache size (128 default) the rest will not be checked when evicting a commit entry from the commit cache.
- Fixed Get correctness bug in the presence of range tombstones where merge operands covered by a range tombstone always result in NotFound.
- Start populating
NO_FILE_CLOSES
ticker statistic, which was always zero previously. - The default value of NewBloomFilterPolicy()'s argument use_block_based_builder is changed to false. Note that this new default may cause large temp memory usage when building very large SST files.
- Fix a deadlock caused by compaction and file ingestion waiting for each other in the event of write stalls.
- Make DB ignore dropped column families while committing results of atomic flush.
RocksDB Release v5.17.2
Rocksdb Change Log
5.17.2 (10/24/2018)
Bug Fixes
- Fix the bug that WriteBatchWithIndex's SeekForPrev() doesn't see the entries with the same key.
5.17.1 (10/16/2018)
Bug Fixes
- Fix slow flush/compaction when DB contains many snapshots. The problem became noticeable to us in DBs with 100,000+ snapshots, though it will affect others at different thresholds.
- Properly set the stop key for a truncated manual CompactRange
- Fix corner case where a write group leader blocked due to write stall blocks other writers in queue with WriteOptions::no_slowdown set.
New Features
- Introduced CacheAllocator, which lets the user specify custom allocator for memory in block cache.
5.17.0 (10/05/2018)
Public API Change
OnTableFileCreated
will now be called for empty files generated during compaction. In that case,TableFileCreationInfo::file_path
will be "(nil)" andTableFileCreationInfo::file_size
will be zero.- Add
FlushOptions::allow_write_stall
, which controls whether Flush calls start working immediately, even if it causes user writes to stall, or will wait until flush can be performed without causing write stall (similar toCompactRangeOptions::allow_write_stall
). Note that the default value is false, meaning we add delay to Flush calls until stalling can be avoided when possible. This is behavior change compared to previous RocksDB versions, where Flush calls didn't check if they might cause stall or not. - Application using PessimisticTransactionDB is expected to rollback/commit recovered transactions before starting new ones. This assumption is used to skip concurrency control during recovery.
RocksDB Release v5.16.6
Rocksdb Change Log
5.16.6 (10/24/2018)
Bug Fixes
- Fix the bug that WriteBatchWithIndex's SeekForPrev() doesn't see the entries with the same key.
5.16.5 (10/16/2018)
Bug Fixes
- Fix slow flush/compaction when DB contains many snapshots. The problem became noticeable to us in DBs with 100,000+ snapshots, though it will affect others at different thresholds.
- Properly set the stop key for a truncated manual CompactRange
5.16.4 (10/10/2018)
Bug Fixes
- Fix corner case where a write group leader blocked due to write stall blocks other writers in queue with WriteOptions::no_slowdown set.
5.16.3 (10/1/2018)
Bug Fixes
- Fix crash caused when
CompactFiles
run withCompactionOptions::compression == CompressionType::kDisableCompressionOption
. Now that setting causes the compression type to be chosen according to the column family-wide compression options.
5.16.2 (9/21/2018)
Bug Fixes
- Fix bug in partition filters with format_version=4.
5.16.1 (9/17/2018)
Bug Fixes
- Remove trace_analyzer_tool from rocksdb_lib target in TARGETS file.
- Fix RocksDB Java build and tests.
- Remove sync point in Block destructor.
5.16.0 (8/21/2018)
Public API Change
-
OnTableFileCreated
will now be called for empty files generated during compaction. In that case,TableFileCreationInfo::file_path
will be "(nil)" andTableFileCreationInfo::file_size
will be zero. -
Add
FlushOptions::allow_write_stall
, which controls whether Flush calls start working immediately, even if it causes user writes to stall, or will wait until flush can be performed without causing write stall (similar toCompactRangeOptions::allow_write_stall
). Note that the default value is false, meaning we add delay to Flush calls until stalling can be avoided when possible. This is behavior change compared to previous RocksDB versions, where Flush calls didn't check if they might cause stall or not. -
The merge operands are passed to
MergeOperator::ShouldMerge
in the reversed order relative to how they were merged (passed to FullMerge or FullMergeV2) for performance reasons -
GetAllKeyVersions() to take an extra argument of
max_num_ikeys
.
New Features
- Changes the format of index blocks by delta encoding the index values, which are the block handles. This saves the encoding of BlockHandle::offset of the non-head index entries in each restart interval. The feature is backward compatible but not forward compatible. It is disabled by default unless format_version 4 or above is used.
- Add a new tool: trace_analyzer. Trace_analyzer analyzes the trace file generated by using trace_replay API. It can convert the binary format trace file to a human readable txt file, output the statistics of the analyzed query types such as access statistics and size statistics, combining the dumped whole key space file to analyze, support query correlation analyzing, and etc. Current supported query types are: Get, Put, Delete, SingleDelete, DeleteRange, Merge, Iterator (Seek, SeekForPrev only).
- Add hash index support to data blocks, which helps reducing the cpu utilization of point-lookup operations. This feature is backward compatible with the data block created without the hash index. It is disabled by default unless BlockBasedTableOptions::data_block_index_type is set to data_block_index_type = kDataBlockBinaryAndHash.
Bug Fixes
- Fix a bug in misreporting the estimated partition index size in properties block.
- Avoid creating empty SSTs and subsequently deleting them in certain cases during compaction.
RocksDB v5.15.10
Rocksdb Change Log
5.15.10 (9/13/2018)
Bug Fixes
- Fix RocksDB Java build and tests.
5.15.9 (9/4/2018)
Bug Fixes
- Fix compilation errors on OS X clang due to '-Wsuggest-override'.
5.15.8 (8/31/2018)
Bug Fixes
- Further avoid creating empty SSTs and subsequently deleting them during compaction.
5.15.7 (8/24/2018)
Bug Fixes
- Avoid creating empty SSTs and subsequently deleting them in certain cases during compaction.
5.15.6 (8/21/2018)
Public API Change
- The merge operands are passed to
MergeOperator::ShouldMerge
in the reversed order relative to how they were merged (passed to FullMerge or FullMergeV2) for performance reasons
5.15.5 (8/16/2018)
Bug Fixes
- Fix VerifyChecksum() API not preserving options
5.15.4 (8/11/2018)
Bug Fixes
- Fix a bug caused by not generating OnTableFileCreated() notification for a 0-byte SST.
5.15.3 (8/10/2018)
Bug Fixes
- Fix a bug in misreporting the estimated partition index size in properties block.
5.15.2 (8/9/2018)
Bug Fixes
- Return correct usable_size for BlockContents.
5.15.1 (8/1/2018)
Bug Fixes
- Prevent dereferencing invalid STL iterators when there are range tombstones in ingested files.
5.15.0 (7/17/2018)
Public API Change
- Remove managed iterator. ReadOptions.managed is not effective anymore.
- For bottommost_compression, a compatible CompressionOptions is added via
bottommost_compression_opts
. To keep backward compatible, a new booleanenabled
is added to CompressionOptions. For compression_opts, it will be always used no matter what value ofenabled
is. For bottommost_compression_opts, it will only be used when user setenabled=true
, otherwise, compression_opts will be used for bottommost_compression as default. - With LRUCache, when high_pri_pool_ratio > 0, midpoint insertion strategy will be enabled to put low-pri items to the tail of low-pri list (the midpoint) when they first inserted into the cache. This is to make cache entries never get hit age out faster, improving cache efficiency when large background scan presents.
- For users of
Statistics
objects created viaCreateDBStatistics()
, the format of the string returned by itsToString()
method has changed. - The "rocksdb.num.entries" table property no longer counts range deletion tombstones as entries.
New Features
- Changes the format of index blocks by storing the key in their raw form rather than converting them to InternalKey. This saves 8 bytes per index key. The feature is backward compatbile but not forward compatible. It is disabled by default unless format_version 3 or above is used.
- Avoid memcpy when reading mmap files with OpenReadOnly and max_open_files==-1.
- Support dynamically changing
ColumnFamilyOptions::ttl
viaSetOptions()
. - Add a new table property, "rocksdb.num.range-deletions", which counts the number of range deletion tombstones in the table.
- Improve the performance of iterators doing long range scans by using readahead, when using direct IO.
- pin_top_level_index_and_filter (default true) in BlockBasedTableOptions can be used in combination with cache_index_and_filter_blocks to prefetch and pin the top-level index of partitioned index and filter blocks in cache. It has no impact when cache_index_and_filter_blocks is false.
Bug Fixes
- Fix deadlock with enable_pipelined_write=true and max_successive_merges > 0
- Check conflict at output level in CompactFiles.
- Fix corruption in non-iterator reads when mmap is used for file reads
- Fix bug with prefix search in partition filters where a shared prefix would be ignored from the later partitions. The bug could report an eixstent key as missing. The bug could be triggered if prefix_extractor is set and partition filters is enabled.
- Change default value of
bytes_max_delete_chunk
to 0 in NewSstFileManager() as it doesn't work well with checkpoints. - Fix a bug caused by not copying the block trailer with compressed SST file, direct IO, prefetcher and no compressed block cache.
- Fix write can stuck indefinitely if enable_pipelined_write=true. The issue exists since pipelined write was introduced in 5.5.0.
RocksDB 5.14.3
5.14.3 (8/21/2018)
Public API Change
- The merge operands are passed to
MergeOperator::ShouldMerge
in the reversed order relative to how they were merged (passed to FullMerge or FullMergeV2) for performance reasons
Bug Fixes
- Fixes DBImpl::FindObsoleteFiles() calling GetChildren() on the same path
RocksDB 5.14.3
5.14.3 (8/21/2018)
Public API Change
- The merge operands are passed to
MergeOperator::ShouldMerge
in the reversed order relative to how they were merged (passed to FullMerge or FullMergeV2) for performance reasons
Bug Fixes
- Fixes DBImpl::FindObsoleteFiles() calling GetChildren() on the same path
RocksDB release v5.14.2
5.14.2 (7/3/2018)
Bug Fixes
- Change default value of
bytes_max_delete_chunk
to 0 in NewSstFileManager() as it doesn't work well with checkpoints. - Set DEBUG_LEVEL=0 for RocksJava Mac Release build.
5.14.1 (6/20/2018)
Bug Fixes
- Fix block-based table reader pinning blocks throughout its lifetime, causing memory usage increase.
- Fix bug with prefix search in partition filters where a shared prefix would be ignored from the later partitions. The bug could report an eixstent key as missing. The bug could be triggered if prefix_extractor is set and partition filters is enabled.
5.14.0 (5/16/2018)
Public API Change
- Add a BlockBasedTableOption to align uncompressed data blocks on the smaller of block size or page size boundary, to reduce flash reads by avoiding reads spanning 4K pages.
- The background thread naming convention changed (on supporting platforms) to "rocksdb:", e.g., "rocksdb:low0".
- Add a new ticker stat rocksdb.number.multiget.keys.found to count number of keys successfully read in MultiGet calls
- Touch-up to write-related counters in PerfContext. New counters added: write_scheduling_flushes_compactions_time, write_thread_wait_nanos. Counters whose behavior was fixed or modified: write_memtable_time, write_pre_and_post_process_time, write_delay_time.
- Posix Env's NewRandomRWFile() will fail if the file doesn't exist.
- Now,
DBOptions::use_direct_io_for_flush_and_compaction
only applies to background writes, andDBOptions::use_direct_reads
applies to both user reads and background reads. This conforms with Linux'sopen(2)
manpage, which advises against simultaneously reading a file in buffered and direct modes, due to possibly undefined behavior and degraded performance. - Iterator::Valid() always returns false if !status().ok(). So, now when doing a Seek() followed by some Next()s, there's no need to check status() after every operation.
- Iterator::Seek()/SeekForPrev()/SeekToFirst()/SeekToLast() always resets status().
New Features
- Introduce TTL for level compaction so that all files older than ttl go through the compaction process to get rid of old data.
- TransactionDBOptions::write_policy can be configured to enable WritePrepared 2PC transactions. Read more about them in the wiki.
- Add DB properties "rocksdb.block-cache-capacity", "rocksdb.block-cache-usage", "rocksdb.block-cache-pinned-usage" to show block cache usage.
- Add
Env::LowerThreadPoolCPUPriority(Priority)
method, which lowers the CPU priority of background (esp. compaction) threads to minimize interference with foreground tasks. - Fsync parent directory after deleting a file in delete scheduler.
- In level-based compaction, if bottom-pri thread pool was setup via
Env::SetBackgroundThreads()
, compactions to the bottom level will be delegated to that thread pool.
Bug Fixes
- Fsync after writing global seq number to the ingestion file in ExternalSstFileIngestionJob.
- Fix WAL corruption caused by race condition between user write thread and FlushWAL when two_write_queue is not set.
- Fix
BackupableDBOptions::max_valid_backups_to_open
to not delete backup files when refcount cannot be accurately determined. - Fix memory leak when pin_l0_filter_and_index_blocks_in_cache is used with partitioned filters
- Disable rollback of merge operands in WritePrepared transactions to work around an issue in MyRocks. It can be enabled back by setting TransactionDBOptions::rollback_merge_operands to true.
- Fix bug with prefix search in partition filters where a shared prefix would be ignored from the later partitions. The bug could report an eixstent key as missing. The bug could be triggered if prefix_extractor is set and partition filters is enabled.
Java API Changes
- Add
BlockBasedTableConfig.setBlockCache
to allow sharing a block cache across DB instances. - Added SstFileManager to the Java API to allow managing SST files across DB instances.
RocksDB 5.13.4
Bug Fixes
- Fix regression bug of Prev() with ReadOptions.iterate_upper_bound.
RocksDB v5.12.5
Bug Fixes
- Fix regression bug of Prev() with ReadOptions.iterate_upper_bound.
RocksDB v5.13.3
5.13.3 (6/6/2018)
Bug Fixes
- Fix assertion when reading bloom filter of SST files containing range deletions but no data