Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix swapdb bugs when multiple clients involved. #738

Merged
merged 1 commit into from
Jan 10, 2024

Conversation

karthyuom
Copy link
Contributor

@karthyuom karthyuom commented Nov 13, 2023

This PR fixes couple of bugs related to swapdb. Following are the list of bugs:

Bug 1: Incorrect swapdb results in multiple clients (or cli) from different terminals

Terminal 1 (cli)

192.168.0.20:9000> DBSIZE
(integer) 0
192.168.0.20:9000> set k0 v0
OK
192.168.0.20:9000> DBSIZE
(integer) 1
192.168.0.20:9000> keys *
1) "k0"
192.168.0.20:9000> SWAPDB 0 1
OK
192.168.0.20:9000> keys *
(empty array)
192.168.0.20:9000> DBSIZE
(integer) 0
192.168.0.20:9000>

Terminal 2 (cli)

192.168.0.20:9000> DBSIZE
(integer) 0
192.168.0.20:9000> select 1
OK
192.168.0.20:9000[1]> DBSIZE
(integer) 0
--> after swapdb from terminal 1
192.168.0.20:9000[1]> DBSIZE
(integer) 0
192.168.0.20:9000[1]> keys *
(empty array)
192.168.0.20:9000[1]>

Expected result: Terminal 2 should correctly display the result for DB0 after the swapdb.

Bug 2: Swapdb result is not recovered after restart in flash mode

192.168.0.20:9000> DBSIZE
(integer) 0
192.168.0.20:9000> set k0 v0
OK
192.168.0.20:9000> DBSIZE
(integer) 1
192.168.0.20:9000> keys *
1) "k0"
192.168.0.20:9000> SWAPDB 0 1
OK
192.168.0.20:9000> keys *
(empty array)
192.168.0.20:9000> DBSIZE
(integer) 0
192.168.0.20:9000> select 1
OK
192.168.0.20:9000[1]> DBSIZE
(integer) 1
192.168.0.20:9000[1]> keys *
1) "k0"
--> after server restart
192.168.0.20:9000[1]> keys *
Error: Server closed the connection
192.168.0.20:9000[1]> keys *
(empty array)
192.168.0.20:9000[1]> DBSIZE
(integer) 0
192.168.0.20:9000[1]>

Expected result: Even after restarting the server, it should recover the swapdb results correctly.

Bug 3: Watched key/swapdb unit tests failure

As part of this PR, I have fixed a known bug as below (line 1731):

int dbSwapDatabases(int id1, int id2) {
//...
    std::swap(g_pserver->db[id1]->ready_keys, g_pserver->db[id2]->ready_keys);
    std::swap(g_pserver->db[id1]->watched_keys, g_pserver->db[id2]->watched_keys);
//...

After this fix, following unit tests failed:

  test {SWAPDB is able to touch the watched keys that exist} {
        r flushall
        r select 0
        r set x 30
        r watch x ;# make sure x (set to 30) doesn't change (SWAPDB will "delete" it)
        r swapdb 0 1
        r multi
        r ping
        r exec
    } {}

    test {SWAPDB is able to touch the watched keys that do not exist} {
        r flushall
        r select 1
        r set x 30
        r select 0
        r watch x ;# make sure the key x (currently missing) doesn't change (SWAPDB will create it)
        r swapdb 0 1
        r multi
        r ping
        r exec
    } {}

I have successfully verified all the above cases, and following are the test results after the fix:

Test bug 1 fix:

Terminal 1 (cli)

192.168.0.20:9000> DBSIZE
(integer) 0
192.168.0.20:9000> set k0 v0
OK
192.168.0.20:9000> DBSIZE
(integer) 1
192.168.0.20:9000> keys *
1) "k0"
192.168.0.20:9000> SWAPDB 0 1
OK
192.168.0.20:9000> keys *
(empty array)
192.168.0.20:9000> DBSIZE
(integer) 0
192.168.0.20:9000>

Terminal 2 (cli)

192.168.0.20:9000[1]> DBSIZE
(integer) 0
--> after swapdb from terminal 1
192.168.0.20:9000[1]> DBSIZE
(integer) 1
192.168.0.20:9000[1]> keys *
1) "k0"
192.168.0.20:9000[1]>

Test bug 2 fix:

192.168.0.20:9000> DBSIZE
(integer) 0
192.168.0.20:9000> set k0 v0
OK
192.168.0.20:9000> DBSIZE
(integer) 1
192.168.0.20:9000> keys *
1) "k0"
192.168.0.20:9000> SWAPDB 0 1
OK
192.168.0.20:9000> keys *
(empty array)
192.168.0.20:9000> DBSIZE
(integer) 0
192.168.0.20:9000> select 1
OK
192.168.0.20:9000[1]> DBSIZE
(integer) 1
192.168.0.20:9000[1]> keys *
1) "k0"
--> after server restart
192.168.0.20:9000[1]> DBSIZE
Error: Server closed the connection
192.168.0.20:9000[1]> DBSIZE
(integer) 1
192.168.0.20:9000[1]> keys *
1) "k0"
192.168.0.20:9000[1]> select 0
OK
192.168.0.20:9000> DBSIZE
(integer) 0
192.168.0.20:9000> keys *
(empty array)
192.168.0.20:9000>

Test bug 3 fix:

./runtest unit test passed

Additional information

The bug 2 was expected to be fixed in the previous PR #668. However, in the PR, the line https://github.com/Snapchat/KeyDB/blob/main/src/db.cpp#L1728 was still kept though it was expected to be removed. But, now as part of this PR we still don't need to remove that line but with a new solution of storage_id.

Also, for the bug 3, when the same unit test was executed manually, server crashed with the following stack race:

12075:12075:C 09 Nov 2023 15:53:52.140 # oO0OoO0OoO0Oo KeyDB is starting oO0OoO0OoO0Oo
12075:12075:C 09 Nov 2023 15:53:52.140 # KeyDB version=6.2.6, bits=64, commit=aab4cbfd, modified=1, pid=12075, just started
12075:12075:C 09 Nov 2023 15:53:52.140 # Configuration loaded
12075:12075:C 09 Nov 2023 15:53:52.141 * Initializing FLASH storage provider (this may take a long time)
12075:12075:C 09 Nov 2023 15:53:52.141 * Increased maximum number of open files to 10042 (it was originally set to 1024).
12075:12075:M 09 Nov 2023 15:53:52.567 # You requested maxclients of 10010 requiring at least 10298 max file descriptors.
12075:12075:M 09 Nov 2023 15:53:52.567 # Server can't set maximum open files to 10298 because of OS error: Operation not permitted.
12075:12075:M 09 Nov 2023 15:53:52.567 # Current maximum open files is 10042. maxclients has been reduced to 10010 to compensate for low ulimit. If you need higher maxclients increase 'ulimit -n'.
12075:12075:M 09 Nov 2023 15:53:52.567 * monotonic clock: POSIX clock_gettime

                  _
               _-(+)-_
            _-- /   \ --_
         _--   /     \   --_            KeyDB Enterprise 6.2.6 (aab4cbfd/1) 64 bit
     __--     /       \     --__
    (+) _    /         \    _ (+)       Running in standalone mode
     |   -- /           \ --   |        Port: 9000
     |     /--_   _   _--\     |        PID: 12075
     |    /     -(+)-     \    |
     |   /        |        \   |        https://docs.keydb.dev
     |  /         |         \  |
     | /          |          \ |
    (+)_ -- -- -- | -- -- -- _(+)
        --_       |       _--
            --_   |   _--
                -(+)-

12075:12075:M 09 Nov 2023 15:53:52.569 # WARNING: The TCP backlog setting of 10000 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 1024.
12075:12075:M 09 Nov 2023 15:53:52.569 # Server initialized
12075:12075:M 09 Nov 2023 15:53:52.570 * Loading the RDB even though we have a storage provider because the database is empty
12075:12111:M 09 Nov 2023 15:53:52.571 * Thread 0 alive.
12075:12112:M 09 Nov 2023 15:53:52.571 * Thread 1 alive.
12075:12113:M 09 Nov 2023 15:53:52.571 * Thread 2 alive.
12075:12111:M 09 Nov 2023 15:54:12.605   audit: receive "flushall" from trusted network, client: 192.168.0.20:35964(:0), db:0


=== KEYDB BUG REPORT START: Cut & paste starting from here ===
12075:12111:M 09 Nov 2023 15:55:05.052 # === ASSERTION FAILED CLIENT CONTEXT ===
12075:12111:M 09 Nov 2023 15:55:05.052 # client->flags = 1075838976
12075:12111:M 09 Nov 2023 15:55:05.052 # client->conn = fd=45
12075:12111:M 09 Nov 2023 15:55:05.052 # client->argc = 1
12075:12111:M 09 Nov 2023 15:55:05.052 # client->argv[0] = "exec" (refcount: 1)
12075:12111:M 09 Nov 2023 15:55:05.052 # === ASSERTION FAILED ===
12075:12111:M 09 Nov 2023 15:55:05.052 # ==> multi.cpp:348 'clients != NULL' is not true

------ STACK TRACE ------

Backtrace:
/home/paas/karthick/git/keydb-pro-6/src/keydb-server 192.168.0.20:9000(unwatchAllKeys(client*)+0x10e) [0x556af104c0fe]
/home/paas/karthick/git/keydb-pro-6/src/keydb-server 192.168.0.20:9000(execCommand(client*)+0x99) [0x556af104c499]
/home/paas/karthick/git/keydb-pro-6/src/keydb-server 192.168.0.20:9000(call(client*, int)+0xb1) [0x556af10d50a1]
/home/paas/karthick/git/keydb-pro-6/src/keydb-server 192.168.0.20:9000(processCommand(client*, int)+0x88f) [0x556af10d614f]
/home/paas/karthick/git/keydb-pro-6/src/keydb-server 192.168.0.20:9000(processCommandAndResetClient(client*, int)+0x66) [0x556af10c3c16]
/home/paas/karthick/git/keydb-pro-6/src/keydb-server 192.168.0.20:9000(processInputBuffer(client*, bool, int)+0x1f4) [0x556af10c9f44]
/home/paas/karthick/git/keydb-pro-6/src/keydb-server 192.168.0.20:9000(processClients()+0xd9) [0x556af10ca0d9]
/home/paas/karthick/git/keydb-pro-6/src/keydb-server 192.168.0.20:9000(+0x29d7b2) [0x556af10e67b2]
/home/paas/karthick/git/keydb-pro-6/src/keydb-server 192.168.0.20:9000(beforeSleep(aeEventLoop*)+0x170) [0x556af101f460]
/home/paas/karthick/git/keydb-pro-6/src/keydb-server 192.168.0.20:9000(aeProcessEvents+0x10b) [0x556af1019d3b]
/home/paas/karthick/git/keydb-pro-6/src/keydb-server 192.168.0.20:9000(aeMain+0x47) [0x556af1021287]
/home/paas/karthick/git/keydb-pro-6/src/keydb-server 192.168.0.20:9000(workerThreadMain(void*)+0x64) [0x556af10d7b44]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f4e22c5c6db]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f4e2298561f]

------ INFO OUTPUT ------
# Server
redis_version:6.2.6
patch_version:6.2.6.1
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:c7935a3e6a202cc2
redis_mode:standalone
os:Linux
arch_bits:64
multiplexing_api:epoll
atomicvar_api:atomic-builtin
gcc_version:0.0.0
process_id:12075
process_supervised:no
run_id:952d4c19d4e1ed585f5eaad6514da3e95a50aeee
tcp_port:9000
server_time_usec:1699563305054020
uptime_in_seconds:73
uptime_in_days:0
hz:10
configured_hz:10
lru_clock:5064488
executable:redis-server
config_file:redis.conf
instance_id:00000000-0000-0000-0000-000000000000

# Clients
connected_clients:1
cluster_connections:0
maxclients:10010
client_recent_max_input_buffer:16
client_recent_max_output_buffer:0
blocked_clients:0
tracking_clients:0
clients_in_timeout_table:0
current_client_thread:0
rx_controlled_clients:0
total_rx_controlled:0
repl_backlog_controlled_clients:0
total_repl_backlog_controlled_clients:0
proxy_header_error:0
thread_0_clients:1
thread_1_clients:0
thread_2_clients:0

# Memory
used_memory:4085624
used_memory_human:3.90M
used_memory_rss:22196224
used_memory_rss_human:21.17M
used_memory_peak:4148608
used_memory_peak_human:3.96M
used_memory_peak_perc:98.48%
used_memory_overhead:218400
used_memory_startup:197824
used_memory_dataset:3867224
used_memory_dataset_perc:99.47%
allocator_allocated:5338992
allocator_active:6377472
allocator_resident:18214912
total_system_memory:540644896768
total_system_memory_human:503.51G
used_memory_lua:37888
used_memory_lua_human:37.00K
used_memory_scripts:0
used_memory_scripts_human:0B
number_of_cached_scripts:0
maxmemory:4294967296
maxmemory_human:4.00G
maxmemory_policy:allkeys-lru
allocator_frag_ratio:1.19
allocator_frag_bytes:1038480
allocator_rss_ratio:2.86
allocator_rss_bytes:11837440
rss_overhead_ratio:1.22
rss_overhead_bytes:3981312
mem_fragmentation_ratio:5.49
mem_fragmentation_bytes:18153320
mem_not_counted_for_evict:134217728
mem_replication_backlog:0
mem_clients_slaves:0
mem_clients_normal:20496
mem_aof_buffer:0
mem_allocator:jemalloc-5.2.1
active_defrag_running:0
lazyfree_pending_objects:0
lazyfreed_objects:0
storage_provider:flash
flash_memory:0

# Persistence
loading:0
current_cow_size:0
current_cow_size_age:0
current_fork_perc:0.00
current_save_keys_processed:0
current_save_keys_total:0
rdb_changes_since_last_save:3
rdb_bgsave_in_progress:0
rdb_last_save_time:1699563232
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:-1
rdb_current_bgsave_time_sec:-1
rdb_last_cow_size:0
aof_enabled:0
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok
aof_last_cow_size:0
module_fork_in_progress:0
module_fork_last_cow_size:0

# Stats
total_connections_received:1
total_commands_processed:8
instantaneous_ops_per_sec:0
total_net_input_bytes:197
total_net_output_bytes:21761
instantaneous_input_kbps:0.00
instantaneous_output_kbps:0.00
rejected_connections:0
sync_full:0
sync_partial_ok:0
sync_partial_err:0
expired_keys:0
expired_stale_perc:0.00
expired_time_cap_reached_count:0
expire_cycle_cpu_milliseconds:0
evicted_keys:0
keyspace_hits:0
keyspace_misses:0
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:0
total_forks:0
migrate_cached_sockets:0
slave_expires_tracked_keys:0
active_defrag_hits:0
active_defrag_misses:0
active_defrag_key_hits:0
active_defrag_key_misses:0
tracking_total_keys:0
tracking_total_items:0
tracking_total_prefixes:0
unexpected_error_replies:0
total_error_replies:0
dump_payload_sanitizations:0
total_reads_processed:10
total_writes_processed:9
instantaneous_lock_contention:1
avg_lock_contention:0.000000
storage_provider_read_hits:0
storage_provider_read_misses:0

# Replication
role:master
connected_slaves:0
master_failover_state:no-failover
master_replid:0eaa52e43d36bab13981b3d1d5454300f5207830
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:0
second_repl_offset:-1
repl_backlog_active:0
repl_backlog_size:134217728
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0
repl_backlog_current_buffer_size:0

# CPU
used_cpu_sys:0.052349
used_cpu_user:0.383899
used_cpu_sys_children:0.000000
used_cpu_user_children:0.000000
server_threads:3
long_lock_waits:0
used_cpu_sys_main_thread:0.013328
used_cpu_user_main_thread:0.195479

# Modules

# Commandstats
cmdstat_swapdb:calls=1,usec=94,usec_per_call=94.00,rejected_calls=0,failed_calls=0
cmdstat_set:calls=1,usec=13,usec_per_call=13.00,rejected_calls=0,failed_calls=0
cmdstat_dbsize:calls=1,usec=4,usec_per_call=4.00,rejected_calls=0,failed_calls=0
cmdstat_multi:calls=1,usec=2,usec_per_call=2.00,rejected_calls=0,failed_calls=0
cmdstat_command:calls=1,usec=810,usec_per_call=810.00,rejected_calls=0,failed_calls=0
cmdstat_watch:calls=1,usec=6,usec_per_call=6.00,rejected_calls=0,failed_calls=0
cmdstat_select:calls=1,usec=2,usec_per_call=2.00,rejected_calls=0,failed_calls=0
cmdstat_flushall:calls=1,usec=153,usec_per_call=153.00,rejected_calls=0,failed_calls=0

# Errorstats

# Cluster
cluster_enabled:0

# Keyspace
db1:keys=1,expires=0,avg_ttl=0,cached_keys=1

# KeyDB
variant:enterprise
license_status:OK
mvcc_depth:0

------ CLIENT LIST OUTPUT ------
id=5 addr=192.168.0.20:35964 laddr=192.168.0.20:9000 fd=45 name= age=67 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=40954 argv-mem=4 obl=5 oll=0 omem=0 tot-mem=61452 events=r cmd=exec user=default redir=-1 peer=192.168.0.20:35964 network=bms

------ CURRENT CLIENT INFO ------
id=5 addr=192.168.0.20:35964 laddr=192.168.0.20:9000 fd=45 name= age=67 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=40954 argv-mem=4 obl=5 oll=0 omem=0 tot-mem=61452 events=r cmd=exec user=default redir=-1 peer=192.168.0.20:35964 network=bms
argv[0]: 'exec'

------ MODULES INFO OUTPUT ------

------ FAST MEMORY TEST ------
12075:12111:M 09 Nov 2023 15:55:05.054 # main thread terminated
12075:12111:M 09 Nov 2023 15:55:05.054 # Bio thread for job type #0 terminated
12075:12111:M 09 Nov 2023 15:55:05.054 # Bio thread for job type #1 terminated
12075:12111:M 09 Nov 2023 15:55:05.055 # Bio thread for job type #2 terminated

Fast memory test PASSED, however your memory can still be broken. Please run a memory test for several hours if possible.

=== KEYDB BUG REPORT END. Make sure to include from START to END. ===

       Please report the crash by opening an issue on github:

           https://github.com/JohnSully/KeyDB/issues

  Suspect RAM error? Use keydb-server --test-memory to verify it.

Aborted (core dumped)

@paulmchen @hengku @CrazyTennisFan

@JohnSully JohnSully merged commit 2ad14de into Snapchat:main Jan 10, 2024
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants