Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to use Khepri database to store metadata instead of Mnesia #7206

Merged
merged 2 commits into from
Sep 29, 2023

Commits on Sep 29, 2023

  1. Allow to use Khepri database to store metadata instead of Mnesia

    [Why]
    
    Mnesia is a very powerful and convenient tool for Erlang applications:
    it is a persistent disc-based database, it handles replication accross
    multiple Erlang nodes and it is available out-of-the-box from the
    Erlang/OTP distribution. RabbitMQ relies on Mnesia to manage all its
    metadata:
    
    * virtual hosts' properties
    * intenal users
    * queue, exchange and binding declarations (not queues data)
    * runtime parameters and policies
    * ...
    
    Unfortunately Mnesia makes it difficult to handle network partition and,
    as a consequence, the merge conflicts between Erlang nodes once the
    network partition is resolved. RabbitMQ provides several partition
    handling strategies but they are not bullet-proof. Users still hit
    situations where it is a pain to repair a cluster following a network
    partition.
    
    [How]
    
    @kjnilsson created Ra [1], a Raft consensus library that RabbitMQ
    already uses successfully to implement quorum queues and streams for
    instance. Those queues do not suffer from network partitions.
    
    We created Khepri [2], a new persistent and replicated database engine
    based on Ra and we want to use it in place of Mnesia in RabbitMQ to
    solve the problems with network partitions.
    
    This patch integrates Khepri as an experimental feature. When enabled,
    RabbitMQ will store all its metadata in Khepri instead of Mnesia.
    
    This change comes with behavior changes. While Khepri remains disabled,
    you should see no changes to the behavior of RabbitMQ. If there are
    changes, it is a bug. After Khepri is enabled, there are significant
    changes of behavior that you should be aware of.
    
    Because it is based on the Raft consensus algorithm, when there is a
    network partition, only the cluster members that are in the partition
    with at least `(Number of nodes in the cluster ÷ 2) + 1` number of nodes
    can "make progress". In other words, only those nodes may write to the
    Khepri database and read from the database and expect a consistent
    result.
    
    For instance in a cluster of 5 RabbitMQ nodes:
    * If there are two partitions, one with 3 nodes, one with 2 nodes, only
      the group of 3 nodes will be able to write to the database.
    * If there are three partitions, two with 2 nodes, one with 1 node, none
      of the group can write to the database.
    
    Because the Khepri database will be used for all kind of metadata, it
    means that RabbitMQ nodes that can't write to the database will be
    unable to perform some operations. A list of operations and what to
    expect is documented in the associated pull request and the RabbitMQ
    website.
    
    This requirement from Raft also affects the startup of RabbitMQ nodes in
    a cluster. Indeed, at least a quorum number of nodes must be started at
    once to allow nodes to become ready.
    
    To enable Khepri, you need to enable the `khepri_db` feature flag:
    
        rabbitmqctl enable_feature_flag khepri_db
    
    When the `khepri_db` feature flag is enabled, the migration code
    performs the following two tasks:
    1. It synchronizes the Khepri cluster membership from the Mnesia
       cluster. It uses `mnesia_to_khepri:sync_cluster_membership/1` from
       the `khepri_mnesia_migration` application [3].
    2. It copies data from relevant Mnesia tables to Khepri, doing some
       conversion if necessary on the way. Again, it uses
       `mnesia_to_khepri:copy_tables/4` from `khepri_mnesia_migration` to do
       it.
    
    This can be performed on a running standalone RabbitMQ node or cluster.
    Data will be migrated from Mnesia to Khepri without any service
    interruption. Note that during the migration, the performance may
    decrease and the memory footprint may go up.
    
    Because this feature flag is considered experimental, it is not enabled
    by default even on a brand new RabbitMQ deployment.
    
    More about the implementation details below:
    
    In the past months, all accesses to Mnesia were isolated in a collection
    of `rabbit_db*` modules. This is where the integration of Khepri mostly
    takes place: we use a function called `rabbit_khepri:handle_fallback/1`
    which selects the database and perform the query or the transaction.
    Here is an example from `rabbit_db_vhost`:
    
    * Up until RabbitMQ 3.12.x:
    
            get(VHostName) when is_binary(VHostName) ->
                get_in_mnesia(VHostName).
    
    * Starting with RabbitMQ 3.13.0:
    
            get(VHostName) when is_binary(VHostName) ->
                rabbit_khepri:handle_fallback(
                  #{mnesia => fun() -> get_in_mnesia(VHostName) end,
                    khepri => fun() -> get_in_khepri(VHostName) end}).
    
    This `rabbit_khepri:handle_fallback/1` function relies on two things:
    1. the fact that the `khepri_db` feature flag is enabled, in which case
       it always executes the Khepri-based variant.
    4. the ability or not to read and write to Mnesia tables otherwise.
    
    Before the feature flag is enabled, or during the migration, the
    function will try to execute the Mnesia-based variant. If it succeeds,
    then it returns the result. If it fails because one or more Mnesia
    tables can't be used, it restarts from scratch: it means the feature
    flag is being enabled and depending on the outcome, either the
    Mnesia-based variant will succeed (the feature flag couldn't be enabled)
    or the feature flag will be marked as enabled and it will call the
    Khepri-based variant. The meat of this function really lives in the
    `khepri_mnesia_migration` application [3] and
    `rabbit_khepri:handle_fallback/1` is a wrapper on top of it that knows
    about the feature flag.
    
    However, some calls to the database do not depend on the existence of
    Mnesia tables, such as functions where we need to learn about the
    members of a cluster. For those, we can't rely on exceptions from
    Mnesia. Therefore, we just look at the state of the feature flag to
    determine which database to use. There are two situations though:
    
    * Sometimes, we need the feature flag state query to block because the
      function interested in it can't return a valid answer during the
      migration. Here is an example:
    
            case rabbit_khepri:is_enabled(RemoteNode) of
                true  -> can_join_using_khepri(RemoteNode);
                false -> can_join_using_mnesia(RemoteNode)
            end
    
    * Sometimes, we need the feature flag state query to NOT block (for
      instance because it would cause a deadlock). Here is an example:
    
            case rabbit_khepri:get_feature_state() of
                enabled -> members_using_khepri();
                _       -> members_using_mnesia()
            end
    
    Direct accesses to Mnesia still exists. They are limited to code that is
    specific to Mnesia such as classic queue mirroring or network partitions
    handling strategies.
    
    Now, to discover the Mnesia tables to migrate and how to migrate them,
    we use an Erlang module attribute called
    `rabbit_mnesia_tables_to_khepri_db` which indicates a list of Mnesia
    tables and an associated converter module. Here is an example in the
    `rabbitmq_recent_history_exchange` plugin:
    
        -rabbit_mnesia_tables_to_khepri_db(
           [{?RH_TABLE, rabbit_db_rh_exchange_m2k_converter}]).
    
    The converter module  — `rabbit_db_rh_exchange_m2k_converter` in this
    example  — is is fact a "sub" converter module called but
    `rabbit_db_m2k_converter`. See the documentation of a `mnesia_to_khepri`
    converter module to learn more about these modules.
    
    [1] https://github.com/rabbitmq/ra
    [2] https://github.com/rabbitmq/khepri
    [3] https://github.com/rabbitmq/khepri_mnesia_migration
    
    See #7206.
    
    Co-authored-by: Jean-Sébastien Pédron <jean-sebastien@rabbitmq.com>
    Co-authored-by: Diana Parra Corbacho <dparracorbac@vmware.com>
    Co-authored-by: Michael Davis <mcarsondavis@gmail.com>
    3 people committed Sep 29, 2023
    Configuration menu
    Copy the full SHA
    5f0981c View commit details
    Browse the repository at this point in the history
  2. Partially revert commit 3253fe4

    Khepri needs ra, and unless khepri is a native bazel dep, we still
    need to declare ra in the classic fashion
    HoloRin authored and dumbbell committed Sep 29, 2023
    Configuration menu
    Copy the full SHA
    0bbb188 View commit details
    Browse the repository at this point in the history