Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Schema State Version Migrations #287

Open
3 tasks
CMCDragonkai opened this issue Nov 7, 2021 · 7 comments
Open
3 tasks

Schema State Version Migrations #287

CMCDragonkai opened this issue Nov 7, 2021 · 7 comments
Labels
development Standard development r&d:polykey:core activity 1 Secret Vault Sharing and Secret History Management

Comments

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Nov 7, 2021

Specification

The state schema requires migrations once we go beyond state version of 1. Note that the schema domain is under development in https://gitlab.com/MatrixAI/Engineering/Polykey/js-polykey/-/merge_requests/213 and will be available when that is merged.

The version code is being moved to its own domain schema/Schema.ts. The purpose of this domain is to check the state version and upgrade the schema if it is below the current state version. Of course right now there's nothing to upgrade so it will just be checking and storing the version. And throwing an exception if the stateVersionPath is out of bounds.

This issue is about the specification of the Schema.upgradeVersion method.

This is the proposed structure:

  1. Under schema/migrations, create migration files named by the version they are upgrading to. So things like 2.ts for version 2, 3.ts for version 3.
  2. The migration file should be like:
    export default async function (nodePath: string): Promise<void> { }
  3. The migration default function can perform side-effects against nodePath. It is not the real nodePath, but a node state that has been copied by the real node path. This means it includes all files in the node state.
  4. The nodePath is actually pointing to a temporary directory like ~/.polykey/migrations.tmp.
  5. Any exception means the migration did not succeed, this will be caught by Schema.upgradeVersion and reported as ErrorSchemaVersionMigrate.
  6. Once a migration is file is created, import it into schema/migrations/index.ts. Make sure you do something like:
    import version2 from './2';
    const migrations: Map<StateVersion, Migration> = new Map;
    migrations.set(2 as StateVersion, version2);
  7. The upgradeVersion should be updated to cycle over all migrations and execute them. And it has to also copy over the changed node state. We can attempt to do this atomically (somewhat) with 2 renames. Renaming .polykey to .polykey.tmp and then renaming the .polykey.tmp/migrations.tmp to .polykey and if successful delete the .polykey (no-need to restart the PolykeyAgent program, as everything should work as normal). This however hits a problem involving the usage of Status in which there is a lock covering it. Also on Windows, it doesn't appear that there are atomic guarantees. At any case the reason to do this is to keep the original state of .polykey intact in .polykey.tmp when this is occurring. And then then the new upgraded schema is in .polykey/migrations.tmp, that has to be moved to .polykey.

Now here's a prototype of upgradeVersion.

  /**
   * This is only called when the version is older.
   * So it is expected that the stateVersion here is the oldest one.
   * So we only want to upgrade.
   * We can say that if we failed to upgrade
   * We can catch and indicate this
   */
  protected async upgradeVersion(stateVersion: StateVersion): Promise<void> {
    return await this.lock.write(async () => {
      if (!migrations.has(stateVersion + 1 as StateVersion)) {
        throw new schemaErrors.ErrorSchemaVersionTooOld;
      }
      try {
        await utils.mkdirExists(this.fs, this.migrationPath);
        await this.fs.promises.cp(
          this.nodePath,
          this.migrationPath,
          {
            force: true,
            recursive: true,
            filter: (p) => !utils.pathIncludes(p, this.migrationPath)
          }
        );
        for (let i = stateVersion; i < config.stateVersion; i++) {
          const migration = migrations.get(i + 1 as StateVersion);
          if (migration == null) {
            throw new schemaErrors.ErrorSchemaMigrationMissing;
          }
          try {
            await migration(this.migrationPath);
          } catch (e) {
            throw new schemaErrors.ErrorSchemaMigrationFail;
          }
        }
        // TODO:
        // migrationPath is ready to copied as nodePath
        // use 2 fs.rename should be atomic
        this.writeVersion(config.stateVersion as StateVersion);
      } finally {
        await this.fs.promises.rm(this.migrationPath, {
          force: true,
          recursive: true
        });
      }
    });
  }

Regarding the Status. This is the only problem because a lock is held on it. We are able to copy files all over, because nothing should be operating while a migration is occuring, and this happens on PolykeyAgent.start. However the Status is started earlier than Schema, which means Status could cause problems when moving .polykey to .polykey.tmp. On Linux, the lock is held on the file descriptor and hence related to the inode. It's not related to the name on the filesystem. Which means even if I hold a flock on the file, I can move it. On Windows, this may not work.

Another solution may be to consider the Status separately from the rest of the system. That would mean .polykey/status should be separate from .polykey/state.

.polykey/status - this doesn't migrated
.polykey/state
.polykey/migrations.tmp

So then the atomic renaming option works against a subdirectory of state that contains keys, db... etc. This would also mean that status is kept static, and this works if there are other concurrent commands to do pk agent start or pk bootstrap as they would be blocked from interacting with the state as the status is locked.

Another alternative is not to do the renaming at all, but instead just use fs.promises.cp again to copy back over the .polykey. But this doesn't give us the nice backup.

So it seems that we would need to change our state structure:

.polykey/
.polykey/status - used by Status
.polykey/state
.polykey/version - used by Schema
.polykey/state/vaults
.polykey/state/keys
.polykey/state/db
.polykey/state/...
.polykey/state.bak <- instead of state.tmp (can be backed up during migrations, and deleted after doing diagnostics)
.polykey/state.tmp <- instead of migrations.tmp
.polykey/client - clientPath

It is essential that migrations do not corrupt the node state. And if there is a corruption, then it must not be silent. Silent corruptions are deadly! If we end up deleting the old .polykey.tmp there's no backup of the current polykey state. We may consider not to delete at all if there's a problem, and instead keep it around and tell the user to delete it if they are comfortable with everything. So maybe instead there needs to be a full diagnostic at the end of the migration before we automatically delete the .polykey.tmp.

In the context of backups, backups should be done from 1 PK node to another PK node. But if a fleet of PK nodes are all upgraded and there's a silent corruption, this can cause all backups to be dead. Therefore it's a good idea for us to build in some form of archival backup by "exporting" all PK secrets. The PK secret state is held in leveldb. An "exported" flat file might just be a big JSON file, or a big CSV file or a number of files that are archived up then encrypted.

Additional context

Tasks

  1. - Spec out how the new state move would work with the other domains, it's a good thing that the paths for all domains are dependency injectable
  2. - Spec out how to ensure we know that the migrations are successful and we can proceed to delete the old pk state
  3. - Ensure that users have an alternative way of backing up PK state without relying on Keynode synchronisation
  4. Prepare the pkgs.nix and ensure that node16 works
  5. Update the Schema.upgradeVersion to incorporate the new changes
  6. Update Schema to use statePath, it will be assumed that it can write to the parent directory of statePath. Which means statePath is not allowed to be / "root".
@CMCDragonkai CMCDragonkai added the development Standard development label Nov 7, 2021
@CMCDragonkai
Copy link
Member Author

Note the usage of fs.cp and the filter option. This uses a new utility utils.pathIncludes that I got from https://stackoverflow.com/a/45242825/582917 but tuned it to mean "does path 1 include path 2". This only considers resolved paths of course, and it will resolve paths automatically.

The filter then ignores the migration directory. But if we move towards to having a state subdirectory that schema can do the migration over, then this is less needed, as won't have to filter anything out.

@CMCDragonkai
Copy link
Member Author

@CMCDragonkai
Copy link
Member Author

Note the error when renaming a directory into an existing directory with stuff:

> fs.renameSync('./blah', './blah2')
Uncaught Error: ENOTEMPTY: directory not empty, rename './blah' -> './blah2'
    at Object.renameSync (fs.js:794:3) {
  errno: -39,
  syscall: 'rename',
  code: 'ENOTEMPTY',
  path: './blah',
  dest: './blah2'

@CMCDragonkai
Copy link
Member Author

Note that I'm not using /tmp or mkdtemp because I cannot trust that /tmp is secure and is on the same filesystem thus losing some atomicity guarantees.

In this case it's just cleaner everything is done within the same node path.

@CMCDragonkai
Copy link
Member Author

Backup and Restore specced out here: #288. It seems both of these issues should be worked out by 1 implementation.

@CMCDragonkai
Copy link
Member Author

One issue with using .polykey/state is that what manages that specific path.

One can imagine that Schema's job is to create that statePath. Because the PolykeyAgent uses Schema to build statePath. Which it then passes .polykey/state/keys into the KeyManager.

It just seems a bit weird for statePath to be managed by Schema. But I'm trying this out in MR https://gitlab.com/MatrixAI/Engineering/Polykey/js-polykey/-/merge_requests/213 for now.

Note that means fresh in Schema means it actually removes all of the state path not just the version in this case. But that's what it means to have fresh be used.

Copy link
Contributor

With Polykey being mostly complete and usable now. We need to look into completing this issue before making any changes to the DB schema what would break state between versions. I'm upping the priority on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
development Standard development r&d:polykey:core activity 1 Secret Vault Sharing and Secret History Management
Development

No branches or pull requests

2 participants