Skip to content

Backup stage Internals

Akira Kurogane edited this page Oct 21, 2019 · 1 revision

Relevant code: cmd/pbm/main.go SendCmd(), pbm/backup.go Run(), NodeSuits(), pbm/cmd.go ListenCmd()

  1. When pbm-agent processes are started they begin to listen/watch on admin.pbmCmd in the replicaset with the PBM control collections (= configsvr replicaset if a cluster, otherwise the non-shared replicaset itself).
  2. When pbm backup is executed the pbm CLI inserts a document with {"cmd": "backup", "backup": { .... } } into admin.pbmCmd.
  3. All pbm-agent process react to appearance of the new command document in admin.pbmCmd.
    1. The first step is to see if they are valid (as of v1.0 that is = in PRIMARY or SECONDARY status && replication lag < 21s)
    2. Second step is to AcquireLock(), which is to write into admin.pbmOp first for the replicaset they're in.
    3. pbm-agent processes that didn't acquire the lock log "Backup has been scheduled on another replset node" and ??? goes back to listening/watching admin.pbmCmd for the next command.???
  4. The pbm-agent that took the log runs through the pbm/backup.go Run() command.
    1. Upserts the backup metadata document in admin.pbmBackup.
    2. Runs the dump, updates admin.pbmBackup that the dump is complete for that replicaset
    3. If non-sharded replicaset no wait. If a shard replicaset it waits to see that the whole cluster dump is done. If a configsvr replicaset it watches the admin.pbmBackup document periodically (once per second) until it sees that all shards have StatusDumpDone (or the later step StatusDone). When all replicasets have finished the dump the parent level "status" field is set to "done"
    4. After the dump(s) are all done Oplog slices are made by function ???
Clone this wiki locally