Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fine Grained Concurrency Control Standardisation #294

Closed
CMCDragonkai opened this issue Nov 23, 2021 · 50 comments · Fixed by #374 or #419
Closed

Fine Grained Concurrency Control Standardisation #294

CMCDragonkai opened this issue Nov 23, 2021 · 50 comments · Fixed by #374 or #419
Assignees
Labels
development Standard development epic Big issue with multiple subissues r&d:polykey:core activity 1 Secret Vault Sharing and Secret History Management r&d:polykey:core activity 3 Peer to Peer Federated Hierarchy

Comments

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Nov 23, 2021

Specification

Locking in js-polykey has gone through alot of iterations. The most recent iteration is in js-db usage of locks where fine grained locks provided by async-mutex is used, as well as the RWLock abstraction in src/utils.ts.

For most domains such as ACL, the locks are too coarse-grained, causing one to lock the entire domain itself. Many of these locks should be replaced with the usage of DB transactions as it is done in js-encryptedfs.

In some places, fine grained locks can replace the existing coarse grained locking or even replace the usage of condition variables. Currently the src/network/Connection.ts and derived classes makes use of a _composed boolean which did double duty in terms of indicating when composition was done but as a way to prevented repeated concurrent calls of compose. The first duty is fine, but the second duty should done with a fine grained lock shared between the all the calls that should be blocked when composition operation is occurring. This means all the methods that currently check _composed and throw exceptions when it is not true.

Transactions has been introduced to js-db. With this we can replace a lot of the existing locking with the use of the new db transactions. The general changes that need to be implemented are as follows.

  1. Updating any transact wrappers by removing them or using withF, withG locking directly.
  2. In all cases where there is a conflict just throw it up the stack. We will expect to handle them within the handlers or look deeper into it later.
  3. ErrorDBTransactionConflict Error should never be seen by the user. We should catch and override it with a more descriptive error for the context.
  4. Transactions should be started within the handlers and passed all the way down to where they are needed. The idea is to make each handler attomic.
  5. Concurrency testing should be introduced but only after SI transactions has be implemented.
  6. All usage of DB should be updated to use the new API. This means removing sublevels and utilising LevelPath and KeyPaths instead.
  7. All usage of db streams should be replaced with the db iterator.
  8. All instances of db.put, db.get and db.del should be using transactions via tran.put/get/del

This applies to all domains that make use of DB OR domains that depend on others that make use of DB. The goal here is to make any even starting from the handlers atomic.

There are limitations to this however. Since a transaction can fail if there is overlapping edits between transactions. We can't really include changes to the db that will commonly or guarantee conflict. Example of this are counters or commonly updated fields. So far this has been seen in;

  1. NotificationsManager. Makes use of a counter so any transactions that include Adding or removing a notification WILL conflict. Reads also update metadata so concurrently reading the same message WILL conflict.
  2. More to follow?

Some cases we will need to make use of locking along with a transaction. A good example of this is in the NotificationManager where we are locking the counter update. When this is the case we need to take extra care with the locking. Unless the lock wraps the whole transaction it is still possible to conflict on the transaction. we can't compose operations that rely on this locking with larger transactions.

An example of this problem is.

start tran1
start tran2

start lock
	tran1 update counter
end lock

start lock
	tran2 update counter
end lock

end tran1
end tran2 // conflict!

To avoid this tran2 has to start after tran1 ends. 
We need to force serialisation of the transactions here.
As a result we can't compose with a transaction outside of
the lock.

This means that some operations or domains can't be composed with larger transactions. It has yet to be seen if this will cause an issue since more testing is required to confirm any problem. I suppose this means we can't mix pessimistic and optimistic transactions. So far it seems it will be a problem with the following domains.

  1. Vaults domain - Lifecycle and editing of vaults relies heavily on locks.
  2. NotificationsManager - Locking is needed for the counter and preventing other conflicts.

Note that this has nothing to do with IPC locking as in #290.

Additional Context

Tasks

  1. Updating any transact wrappers by removing them or using withF, withG locking directly.
  2. In all cases where there is a conflict just throw it up the stack. We will expect to handle them within the handlers or look deeper into it later.
  3. ErrorDBTransactionConflict Error should never be seen by the user. We should catch and override it with a more descriptive error for the context.
  4. Transactions should be started within the handlers and passed all the way down to where they are needed. The idea is to make each handler attomic.
  5. Concurrency testing should be introduced but only after SI transactions has be implemented.
  6. All usage of DB should be updated to use the new API. This means removing sublevels and utilising LevelPath and KeyPaths instead.
  7. All usage of db streams should be replaced with the db iterator.
  8. All instances of db.put, db.get and db.del should be using transactions via tran.put/get/del
@CMCDragonkai
Copy link
Member Author

The development of the RWLock system here https://gist.github.com/CMCDragonkai/4de5c1526fc58dac259e321db8cf5331 may be usable by a number of domains to increase their concurrency.

@CMCDragonkai
Copy link
Member Author

The RWLock system has been improved with the ability to do read-preferring or write-preferring.

The async-init has been upgraded to use locking for their start, stop, destroy methods. This makes it alot better now, they are using write-preferring rwlock.

There was some discussion about locking abstractions, the usage of a generic with since alot of domains are using a sort of transact or withTransaction method and this can be made unto a generic withF or withG utility function that also avoids the need the nest a bunch of transact callbacks.

I'm just trying to find where I wrote that.

@CMCDragonkai CMCDragonkai self-assigned this Jan 17, 2022
@CMCDragonkai
Copy link
Member Author

The commentary about a general with is here: #310 (comment)

It came out of discussion about general resource management which includes locking.

@CMCDragonkai
Copy link
Member Author

Attempting to implement withF combinator. One of the problems is avoiding lots of overloaded type signatures. I originally thought I'd have to implement it similar to Promise.all. But it turns tuple types are spreadable in TS 4.0.

However I also need to map & index into the tuple type before spreading. This is a bit more complicated, had to ask a SO question about this: https://stackoverflow.com/questions/70736753/how-to-map-index-into-tuple-types-that-is-generically-spread-in-typescript

@CMCDragonkai
Copy link
Member Author

CMCDragonkai commented Jan 17, 2022

Initial attempt, however the types are not correct due to f callback still taking a spread of the resources, we need a magic TC to convert Resources to the second element of the return type of each resource.

type Resource<T = void> = () => Promise<[
  release: () => Promise<void>,
  resource: T
]>;

async function withF<
  Resources extends Array<Resource<unknown>>,
  Result
>(
  resources: Resources,
  f: (...resources: [...Resources]) => Promise<Result>
): Promise<Result> {
  const releases = [];
  const items = [];
  try {
    for (const resource of resources) {
      const [release, item] = await resource();
      releases.push(release);
      items.push(item);
    }
    return await f(...items);
  } finally {
    releases.reverse();
    for (const release of releases) {
      await release();
    }
  }
}

@CMCDragonkai
Copy link
Member Author

CMCDragonkai commented Jan 17, 2022

It appears one of the problems is that mapped types don't seem to work correctly for tuple types.

For example:

type Api = {
  createItem: (name: string) => Promise<any>;
  getItem: (id: number) => Promise<any>;
  updateItem: (item: any) => Promise<any>;
  deleteItem: (id: number) => Promise<void>;
};

// type Api = [() => number, () => string];

type NewApi = {
  [K in keyof Api]: ReturnType<Api[K]>
};

The above type checks, however if you switch to using the second Api, where it's a tuple of functions, then it doesn't type check. I thought mapped types already work fine with tuples, however this does not appear to be the case. And if it's not possible to map into a tuple type like this, then we can't really create the signature we want for withF.

@CMCDragonkai
Copy link
Member Author

Actually maybe the Resource can return a record type instead, and then we could actually map into it.

However we will need to also index into the Promise type to get rid of the promise wrapper.

@CMCDragonkai
Copy link
Member Author

Based on this answer: https://stackoverflow.com/a/60713409/582917, there appears to be a way to do this.

We will need to dispense with ReturnType, it just doesn't work. However ReturnType relies on conditional type expressions and the usage of the infer keyword, which we can use as well.

Here's an example:

type Call<R> = (...args) => R;

type FunctionReturns<T extends Record<number, Call<any>>> = {
  [K in keyof T] : T[K] extends Call<infer R> ? R: never
}

type FunctionReturns2<T extends readonly Call<any>[]> = {
  [K in keyof T] : T[K] extends Call<infer R> ? R: never
}

function getReturns<
  T extends (readonly [Call<any>] | readonly Call<any>[])
>(fs: T): FunctionReturns2<T> {
  // Escape hatch!
  return fs.map(f => f()) as any;
}

getReturns([() => 123, () => 'abc', () => [1,2] as const])

// force inference as a tuple, and not as an array
const fs = [() => 123, () => 'abc'] as const;

getReturns(fs);

The getReturns here can either return tuple type or array type depending on what the input type is. This is what the as const does, it ensures that TS infers the array as a tuple, by default TS assumes it is an array.

Notice I have 2 variants. The FunctionReturns2 constrains T to be an tuple type.

It is essential to understand that tuples are "readonly arrays". It has to be written like readonly X[] or readonly [X].

@CMCDragonkai
Copy link
Member Author

Note that the equivalent FunctionReturns that uses what ReturnType does is like this:

type ReturnType<T extends (...args: any) => any> = T extends (...args: any) => infer R ? R : any

However it's better for us to have a separate Call type just so that we can refer to it in the getReturns.

@CMCDragonkai
Copy link
Member Author

Some solutions are incoming...

type ResourceAcquire<T = void> = () => Promise<[() => Promise<void>, T?]>;

type Resources<T extends readonly ResourceAcquire<any>[]> = {
  [K in keyof T] : T[K] extends ResourceAcquire<infer R> ? R: never
}

async function withF<
  ResourceAcquires extends readonly [ResourceAcquire<unknown>] | readonly ResourceAcquire<unknown>[],
  Result
>(
  resourceAcquires: ResourceAcquires,
  f: (...resources: Resources<ResourceAcquires> & unknown[]) => Promise<Result>
): Promise<Result> {
  const releases: Array<() => Promise<void>> = [];
  const resources: Array<unknown> = [];
  try {
    for (const resourceAcquire of resourceAcquires) {
      const [release, resource] = await resourceAcquire();
      releases.push(release);
      resources.push(resource);
    }
    return await f(...resources as unknown as Resources<ResourceAcquires>);
  } finally {
    releases.reverse();
    for (const release of releases) {
      await release();
    }
  }
}

async function x() {
  let count: number = 0;
  const h: ResourceAcquire<number> = async () => {
    ++count;
    return [async () => { --count; }, count];
  };
  await withF(
    [
      h,
      async () => {
        return [async () => { }];
      }
    ],
    async (c, cs) => {
      console.log(c, cs);
      return c;
    }
  );
}

x();

One of the strange things is when we make the resource array constant. Functions that don't have readonly applied if given a readonly array will fail. If we say the parameter is readonly, we're guaranteeing that we won't modify this array. So having a strictly readonly array is technically more flexible. So a few more tweaks on the above and things should work.

@CMCDragonkai
Copy link
Member Author

Slight extension:

type ResourceAcquire<T = void> = () => Promise<readonly [() => Promise<void>, T?]>;

type Resources<T extends readonly ResourceAcquire<any>[]> = {
  [K in keyof T] : T[K] extends ResourceAcquire<infer R> ? R: never
}

async function withF<
  ResourceAcquires extends readonly [ResourceAcquire<any>] | readonly ResourceAcquire<any>[],
  Result
>(
  resourceAcquires: ResourceAcquires,
  f: (...resources: Resources<ResourceAcquires> & any[]) => Promise<Result>
): Promise<Result> {
  const releases: Array<() => Promise<void>> = [];
  const resources: Array<any> = [];
  try {
    for (const resourceAcquire of resourceAcquires) {
      const [release, resource] = await resourceAcquire();
      releases.push(release);
      resources.push(resource);
    }
    return await f(...resources as unknown as Resources<ResourceAcquires>);
  } finally {
    releases.reverse();
    for (const release of releases) {
      await release();
    }
  }
}

async function x() {
  let count: number = 0;

  await withF(
    [
      async () => {
        return [async () => { }];
      }
    ] as const,
    async (c) => {
      console.log(c);
      return c;
    }
  );

  const h: ResourceAcquire<number> = async () => {
    ++count;
    return [async () => { --count; }, count];
  };

  await withF(
    [
      h,
      async () => {
        return [async () => { } ];
      }
    ] as const,
    async (c, cs) => {
      console.log(c, cs);
      return c;
    }
  );

  const arr = [
    h
  ] as const;

  await withF(arr, async (c) => {
    console.log(c);
  });

  const y = async () => {
    return [async () => {}] as const;
  }

  const arr2 = [
    y
  ] as const;

  await withF(arr2, async (c) => {
    console.log(c);
  });

}

x();

Notice the usage of as const, that is required if there's no explicit typing of the resources as a tuple.

@CMCDragonkai
Copy link
Member Author

Changed to any instead of unknown:

type ResourceAcquire<T = void> = () => Promise<readonly [() => Promise<void>, T?]>;

type Resources<T extends readonly ResourceAcquire<any>[]> = {
  [K in keyof T] : T[K] extends ResourceAcquire<infer R> ? R: never
}

async function withF<
  ResourceAcquires extends readonly [ResourceAcquire<unknown>] | readonly ResourceAcquire<unknown>[],
  Result
>(
  resourceAcquires: ResourceAcquires,
  f: (...resources: Resources<ResourceAcquires> & any[]) => Promise<Result>
): Promise<Result> {
  const releases: Array<() => Promise<void>> = [];
  const resources: Array<unknown> = [];
  try {
    for (const resourceAcquire of resourceAcquires) {
      const [release, resource] = await resourceAcquire();
      releases.push(release);
      resources.push(resource);
    }
    return await f(...resources as unknown as Resources<ResourceAcquires>);
  } finally {
    releases.reverse();
    for (const release of releases) {
      await release();
    }
  }
}

@CMCDragonkai
Copy link
Member Author

CMCDragonkai commented Jan 18, 2022

Attempting to do the withG variant, and discovered this interesting behaviour:

async function* g1 () {
  const x = yield 1;
  yield x;
  return 3;
}

async function* g2 () {
  try {
    return yield* g1();
  } finally {
    console.log('DONE!');
  }
}

async function main () {
  const g = g2();

  console.log('first', await g.next());
  console.log('second', await g.next(2));
  console.log('third', await g.next());

  for await (const v of g2()) {
    console.log(v);
  }

}

main();

The result is:

first { value: 1, done: false }
second { value: 2, done: false }
DONE!
third { value: 3, done: true }
1
undefined
DONE!

The finally block runs before the value is returned. This isn't really a huge issue since by that time, all asynchronous work has been done.

Furthermore the return value is not processed by for await. It only look at the values being yielded.

@CMCDragonkai
Copy link
Member Author

The withF and withG are now integrated in #266. Example code follows:

type ResourceAcquire<Resource = void> = () => Promise<
  readonly [ResourceRelease, Resource?]
>;

type ResourceRelease = () => Promise<void>;

type Resources<T extends readonly ResourceAcquire<any>[]> = {
  [K in keyof T]: T[K] extends ResourceAcquire<infer R> ? R : never;
};

/**
 * Make sure to explicitly declare or cast `acquires` as a tuple using `[ResourceAcquire...]` or `as const`
 */
async function withF<
  ResourceAcquires extends
    | readonly [ResourceAcquire<unknown>]
    | readonly ResourceAcquire<unknown>[],
  T,
>(
  acquires: ResourceAcquires,
  f: (resources: Resources<ResourceAcquires>) => Promise<T>,
): Promise<T> {
  const releases: Array<ResourceRelease> = [];
  const resources: Array<unknown> = [];
  try {
    for (const acquire of acquires) {
      const [release, resource] = await acquire();
      releases.push(release);
      resources.push(resource);
    }
    return await f(resources as unknown as Resources<ResourceAcquires>);
  } finally {
    releases.reverse();
    for (const release of releases) {
      await release();
    }
  }
}

/**
 * Make sure to explicitly declare or cast `acquires` as a tuple using `[ResourceAcquire...]` or `as const`
 */
async function* withG<
  ResourceAcquires extends
    | readonly [ResourceAcquire<unknown>]
    | readonly ResourceAcquire<unknown>[],
  T = unknown,
  TReturn = any,
  TNext = unknown,
>(
  acquires: ResourceAcquires,
  g: (
    resources: Resources<ResourceAcquires>,
  ) => AsyncGenerator<T, TReturn, TNext>,
): AsyncGenerator<T, TReturn, TNext> {
  const releases: Array<ResourceRelease> = [];
  const resources: Array<unknown> = [];
  try {
    for (const acquire of acquires) {
      const [release, resource] = await acquire();
      releases.push(release);
      resources.push(resource);
    }
    return yield* g(resources as unknown as Resources<ResourceAcquires>);
  } finally {
    releases.reverse();
    for (const release of releases) {
      await release();
    }
  }
}

export { withF, withG };

export type { ResourceAcquire, ResourceRelease };

It's important to start to plan the usage of this for other domains like nodes and wherever there's a transaction callback, but the first one I'll be trying it out on is in the Vaults domain.

@CMCDragonkai
Copy link
Member Author

CMCDragonkai commented Jan 20, 2022

Partially resolved by these commits which have been cherry picked into master:

  • 2862e01 - using write-preferring RWLock for Schema and make it available to other implementations
  • e2e1cc8 - withF and withG are not available to be used

@tegefaulkes once you rebase on master, you can also start using the RWLock and withF and withG in #310.

@CMCDragonkai
Copy link
Member Author

The node connections are now using withF and withG. The #266 will be introducing them into the vaults domain.

@CMCDragonkai
Copy link
Member Author

While working on #326, I noticed that the KeyManager doesn't have any locking. This is because locking concepts were developed afterwards. It should have locking built in to many of the methods like renewKeyPair. We don't want race conditions between renewKeyPair and resetKeyPair and resetRootCert.

@CMCDragonkai
Copy link
Member Author

Some things to note:

  1. The NodeConnectionManager.acquireConnection does not need to be async.
  2. The withG isn't being used in NodeConnectionManager due to some issue with try catch, but this should be investigated
  3. In NodeGraph I'm prototyping with this:
      protected lock: RWLock = new RWLock();
    
      public acquireLockRead(lazy: boolean = false): ResourceAcquire<RWLock> {
        return async () => {
          let release: ResourceRelease;
          if (lazy && this.lock.isLocked()) {
            release = async () => {};
          } else {
            const r = await this.lock.acquireRead();
            release = async () => r();
          }
          return [release, this.lock];
        };
      }
    
      public acquireLockWrite(lazy: boolean = false): ResourceAcquire<RWLock> {
        return async () => {
          let release: ResourceRelease;
          if (lazy && this.lock.isLocked()) {
            release = async () => {};
          } else {
            const r = await this.lock.acquireWrite();
            release = async () => r();
          }
          return [release, this.lock];
        };
      }
    
  4. Take note of the lazy boolean. This is because some of the public methods of the class may be called by other public methods of the same class, we want to avoid deadlocking itself This is because neither async-mutex nor RWLock is an re-entrant mutex: https://en.wikipedia.org/wiki/Reentrant_mutex. Right now "ownership tracking" is only done directly with the lock being a protected property, however it's possible to acquire this lock under special cases.
  5. This may mean we replace _transaction with withF([this.acquireLockWrite(true)], ...) all the time. If that is the case, we may want lazy to be true by default...
  6. In RWLock I've added 2 additional methods isLockedReader, and isLockedWriter, perhaps this should be used by the lazy acquisition?

Further prototyping is required to realise the application of this to all domains that are using _transaction. Especially with ACL and GestaltGraph.

@CMCDragonkai
Copy link
Member Author

Right now the _transaction method is actually buggy. This is because it checks if the lock is already locked. This means if a class wanted to expose 2 public methods which requires mutual exclusion, using _transaction won't work because if the lock is already locked, the second invocation just passes through.

So the usage of it was to allow re-entrancy, however it's not able to track that these 2 independent calls.

Consider this difference:

class A {
  // Both `pubA` and `pubB` are locking
  pubA () { this.pubB() }
  pubB () { ... }
}

If the call stack was:

A.pubA() -> A.pubB()

We would expect that pubB should be re-entrant, so it doesn't call a deadlock.

However if the call stack was instead:

A.pubA()
A.pubB()

Where they are called simultaneously & concurrently, then now we want want pubA and pubB to be blocking each other here.

So _transaction isn't enough, and neither is the lazy boolean. We need a more automatic way to track the ownership and call stack.

@CMCDragonkai
Copy link
Member Author

The quick solution is to extract the common functionality out of pubA and pubB into a protected method which is not locking. This way pubA calls the protected method, and both methods can block each other as normal. This means re-entrancy isn't required because its built into the structure of the code. It just means we have to be careful not to have public methods call other public methods when these public methods are state mutating procedures.

@CMCDragonkai CMCDragonkai mentioned this issue Feb 21, 2022
29 tasks
@CMCDragonkai
Copy link
Member Author

It seems the DBTransaction was the only way to maintain context in a call stack/graph.

Basically methods of INodeManager take a tran: DBTransaction as the first parameter. This allows one to build up a transactional context for methods to run in.

Then if pubA calls pubB, they pass the same transactional context. This was made as the first parameter, since they all required. However unlike INodeManager, if we were doing this we would need to make this an optional parameter at the very end, so that normal calls to pubA would still work.

This context is different from the instance context, since it only exists during the call stack.

That would be the only way to track the lock ownership on a call-basis.

If we were do this for RWLock, then it could be done.

However NodeGraph also uses the DB. So the question is that for such a lock, it would make sense that we would want to tie our lock context to the DB locking as well.

It is possible to create an acquisition function for the DB transaction:

  public acquireTransaction: ResourceAcquire<DBTransaction> = async () => {
    const tran = new Transaction({ db: this.db });
    return [
      async (e?: Error) => {
        if (e == null) {
          try {
            await tran.commit();
          } catch (e) {
            await tran.rollback();
            throw e;
          }
          await tran.finalize();
        } else {
          await tran.rollback();
        }
      },
      tran
    ];
  };

To do so, a little change must be done on withF and withG:

type ResourceRelease = (e: Error | undefined) => Promise<void>;

It's backwards compatible, but basically means the resource releaser will now have a reference to the error in case it was thrown during the acquisition or the handler call.

@CMCDragonkai
Copy link
Member Author

Applying the concept of withF and withG to DBTransaction means that different domains should share a transaction context in order to maintain atomicity.

Consider a "transaction" that takes place between ACL and NodeGraph. An example usage might look like this:

withF([
  acquireTransaction()
], async ([tran]) => {
  await nodeGraph.doSomething(tran);
  await acl.doSomething(tran);
});

The result is that any mutation here is atomic between nodeGraph and acl. Compare this to GestaltGraph where the GG just locks both itself and ACL to perform mutation. There's no atomicity, so it's possible for mutations to ACL to persist while mutations to GG is lost.

At the same time it's a sledgehammer because it locks the entire GG and ACL together. There's no fine-grained concurrency control here.

The usage of the transaction, allows one to be more finegrained, since the interactions operate entirely on the transaction context. Note that transactions are sort of limited due to iteration, but that may be resolved by MatrixAI/js-db#4.

It is the combination of smaller locks, and sophisticated locks that RWLock along with the transaction snapshot system that enables fine-grained concurrency control.

@CMCDragonkai CMCDragonkai changed the title Fine Grained Locking Standardisation Fine Grained Concurrency Control Standardisation Mar 16, 2022
@CMCDragonkai
Copy link
Member Author

CMCDragonkai commented Mar 16, 2022

What if nodeGraph.doSomething wants to interact with ACL? Like in GestaltGraph. What would this look like?

We want to use withF or withG everywhere where this pattern might take place, so it's turtles all the way down.

Methods should take a transaction parameter, most likely optionally so that the method can still work within its own transaction if it doesn't need to be shared between any other higher-level methods.

class NodeGraph {
  public doSomething(tran?: DBTransaction) {
    if (tran == null) {
      return await withF([ 
        acquireTransaction()
      ], async ([tran]) => {
        return this.doSomething(tran);
      });
    }
    this.acl.doSomething(tran);
    // continue with mutations
  }
}

@CMCDragonkai
Copy link
Member Author

Also the passing of tran object around is quite flexible, but kind of verbose, would be nice if we have a monad context that operations can run within. Only issue is that that would require an object like iNodeMgrTran.dirCreate(). But cannot extend to other objects like a transaction between domains. It only works because say ACL, and other domains would use DBTransaction together. A shared monadic context, would need to combine a transactional "proxy" object that allows calls to all the domain regular methods.

See https://www.measurethat.net/Benchmarks/Show/6274/4/access-to-proxy-vs-object, most important would be to avoid nesting proxies as per https://gomakethings.com/better-proxy-performance-in-vanilla-js/. Accessing proxy vs accessing object isn't that bad. It's just an extra pointer dereference.

I could imagine something like:

// Each domain uses their own custom wrapper withTransactionF that creates a transaction object with associated locking properties
this.withTransactionF(async (tran) => {
  // tran is DBTransaction shared between domains
  // associated locks is setup ahead of time with withTransactionF
  // other resources required would be designed ahead of time
  this.doSomething(tran);
  acl.doSomething(tran);
  gg.doSomething(tran);
});

This would be a "proxy" based API:

withTransactionF(this, acl, gg, async (thisProxy, aclProxy, ggProxy) => {

  // proxy
  thisProxy.doSomething();

  // can't do this
  // these have to be part of the proxy
  // acl.doSomething();
  // gg.doSomething();

  aclProxy.doSomething();
  ggProxy.doSomething();

});

@CMCDragonkai
Copy link
Member Author

The proxy is just used so that it will maintain the transaction context across calls to the underlying domain.

The callback is still needed to know when a context must needs to be started and ended. It naturally "scopes" the context. The proxy objects mean that you don't need to do explicit passing of tran around.

This possibly avoids having to have a tran parameter on every method, but instead the tran property can be referenced via this instead.

The proxy can actually be a "decorator" of sorts. But it's important to maintain type-safety, so that the proxy can still be used whenever some function requires the same domain type.

The creation of the proxy is therefore the same as any other resource, and fits in the resource API.

withF([ acl.transactionProxy(), gg.transactionProxy(), async ([aclProxy, ggProxy]) => {
  // now use the proxies
});

This abstracts above the transaction() calls which return a DBTransaction. But now instead creates a proxy of an object that contains the transaction context.

One can abstract DBTransaction to a generic "Software Transaction", but ultimately based on some sort of state. It can take a "transaction" interface, that any thing that has state can provide. Of course one has to fix the isolation level of that transaction here.

But you'd need to figure out how to incorporate locks as well in their relevant domains too.

Just a crazy idea, but the benefit is just to avoid having to pass the tran around.

For future investigation.

@CMCDragonkai
Copy link
Member Author

CMCDragonkai commented Apr 11, 2022

I feel like there's a way to extend the ES6 promise for one situation (or better a decorator around a promise, so that subsequent promises remain the same), and this would enable you to do something like:

async method(tran?) {
  const tran = makeTran(tran, ino);
  await doAsyncStuff();
  doSyncStuff();
}

And after await method() finishes, at the very end the tran is comitted. It would require makeTran to return an augmented promise. One that takes the then method, and adds something to be completed at the end using finally.

Unfortunately I'm not sure how to do it. I tried adding a finally into the augmented promise. But finally activates as soon as the promise resolves regardless of thens. You need to intercept at specific then but you don't know when this is because you don't know how many thens there may be.

@tegefaulkes
Copy link
Contributor

Can't you already await a promise multiple times to get the same result? wouldn't this be like caching the returned promise rather than augmenting the promise itself?

@CMCDragonkai
Copy link
Member Author

CMCDragonkai commented Apr 12, 2022

Not sure what you mean, but I've abandoned that approach. I'm moving ahead with how it is done right now in EFS. Just need to abstract the locks collections structure from both DB and PK into async-locks with MatrixAI/js-async-locks#5. I realised I forgot to filter out the same lock key which may cause deadlocks in a single call to locks.lock(1, 1). But that's going to be solved in MatrixAI/js-async-locks#6

@CMCDragonkai
Copy link
Member Author

CMCDragonkai commented Apr 22, 2022

During our meeting, we discussed that there can be a slight overhead with using the DB transaction with locks.

Consider ACL.sameNodePerm (#294 (comment)). Suppose it takes tran?: DBTransaction as the last parameter.

Now technically this doesn't require a lock, since a check can be made through 2 reads, even if those reads may be inconsistent. But this is a good example to illustrate the problem.

Now if you don't pass a tran, the method knows what keys to lock when starting the transaction. Namely the 2 node ids need read lock so they can be checked.

But if you do pass tran, then the system won't attempt to lock those keys.

So the caller of sameNodePerm will need to know what it needs to lock before it calls ACL.sameNodePerm if it is intending to pass the tran. This knowledge has to be bubbled up by the programmer to the very beginning of wherever that transaction context starts. Which could be several layers up.

It's going to be quite complex to always keep track of what keys needs to be locked.

In SQL, this problem doesn't exist in this way, because locks are within the transaction context, not outside.

Then if one were to call sameNodePerm within a SQL transaction, one would just lock those as part of a SELECT ... FOR SHARE. Note that SQL normally isn't even composable, so these problems tend to require the programmer to structure things differently, and not through just function calls but through query builders.

Now this would mean that you are attempting to acquire a lock while within the transaction. Is that necessarily a bad thing?

I reckon sameNodePerm could try to acquire those locks even if the outside were to pass in a transaction.

However right now, nested acquisitions would result in a deadlock.

withF([this.lock('A')], async () => {
  withF([this.lock('A')], async () => {
    // DEADLOCK
  });
});

I believe this problem was mentioned already in #294 (comment), and the solution proposed was solution 2, where locks are also tracked as they are passed down the call graph with the tran. While #294 (comment) tells us when mutual exclusion would actually be necessary and when you can just use transaction by itself without further locking.

So there are some further things to do:

  1. Bring locks back into DBTransaction or at least provide an augmented object that combines the LockBox with the DBTransaction into a composed resource, such that subsequent acquisition of certain locks within the same tran context will just "flow through", and won't result in a nested deadlock. Note that this will however bring back the possibility of deadlocks due to lack of enforced lock hierarchy, as the locking order is now determined by the call graph hierarchy. The only way around this is to design your code so that this doesn't happen, or to introduce deadlock tracking here with our new timeout mechanism. Fine Grained Concurrency Control Standardisation #294 (comment)
    • Combined LockBox and DBTransaction concept to track nested deadlocks
    • Lock hierarchy across call graph cannot be enforced, so deadlocks can now occur due to mis-ordered lock hierarchy
    • Will require deadlock detection by using timeouts and retries
  2. The constant propagation of the tran parameter is kind of annoying, and future investigation into proxies & decorators could help (will require some typescript magic though): Fine Grained Concurrency Control Standardisation #294 (comment)

@CMCDragonkai
Copy link
Member Author

Note that combined Lockbox and DBTransaction is already in EncryptedFS, only that the Lockbox isn't really tracked along with DBTransaction. It just doesn't return it as a resource.

@CMCDragonkai
Copy link
Member Author

CMCDragonkai commented Apr 22, 2022

Some further reading of:

  1. https://adit.io/posts/2013-05-15-Locks,-Actors,-And-STM-In-Pictures.html#software-transactional-memory
  2. https://rhishikesh.com/posts/mvcc_stm/
  3. https://convincedcoder.com/2018/09/01/Optimistic-pessimistic-locking-sql/

Tells me that our LevelDB transactions are quite similar to STM. In our case, the state is leveldb, and the transactions are all just software and in-memory in a single process.

One major difference is that the transactions in STM are meant to be "optimistic", and they are designed to take a snapshot of the state before the transaction starts, and then when they commit they check if their initial state is still the same before committing, and if it is not, the whole transactional operation is retried at the beginning but with the new value as the initial state. The user code which is provided as a function that takes the initial state is then re-ran, but it can decide to abort the transaction if the initial state is unsatisfactory.

This is a form of optimistic concurrency control.

Our DBTransaction does not do this. Firstly it does not keep track of the initial state of the database. This is related to snapshot isolation? (I'm not sure if the definition of snapshot isolation is different across different contexts). We do have a snapshot, but our snapshot doesn't keep track of the initial state of the DB. Only if we were to use the tran.iterator() would this occur. We can expose this the end user if they really want to do this. But comparing equality between the transactional snapshot provided by the iterator seems quite inefficient, as many STM examples focus on a single value that can be atomically checked and not the entire database.

And during commits, it does not check if the initial state or snapshot state is consistent with the underlying database.

There is a way to make our DBTransaction similar to STM transactions.

One major change is to only compare the values that have been touched. This means any get, put or del or iterator operation would "touch" keys in the underlying database. Their original value gets recorded with a hidden iterator() that is created at the beginning of the transaction. This may also mean that we store the read value into the snapshot and end up with repeatable-read isolation level. Then during a commit operation, we compare the touched values from our iterator with what's in the DB (to do this we have to create another iterator on the DB at the point of committing), and if there are any discrepancies, reject the commit.

The problem is that the DBTransaction doesn't know what produced the transaction, so it doesn't know what to "re-run". At this point it could throw an exception and expect the user of the transaction to deal with it like:

while (true) {
  const [tranRelease, tran] = await tranAcquire();
  await tran.put('x', 'y');
  try {
    await tranRelease();
  } catch (e) {
    if (e instanceof DBCommitWriteConflict) {
      // we have to restart the transaction entirely, because the `x` was a conflict
      continue;p
    }
    throw e;
  }
}

Which means convenience methods would have to be used to deal with this situation in a nicer way, and they can just rerun the callback that is passed in.

Seems like we have quite a lot of flexibility to manipulate our DBTransaction, but depending on the way we do it, it will impact how we organise our code that uses it. It's big cross cutting concern.

@CMCDragonkai
Copy link
Member Author

Further reading here https://www.sobyte.net/post/2022-01/rocksdb-tx/ shows that rocksdb actually supports both optimistic transactions and pessimistic transactions.

The optimistic transaction follows what I wrote above. It also just throws an exception and expects the user to figure out what they want to do and that could mean a retry, or a loop goto with continue. But the DB doesn't control how this is to be done.

In the pessimistic case, it bundles up a lock manager into the transaction, and this lock manager is also detecting deadlocks, since by allowing locks to form within the transaction, deadlocks can occur. It also addresses re-entrant locking by allowing locks to be acquired if already acquired in the same transaction. But locks that are just deadlocked by other transactions due to different lock-ordering, the only solution is to use a timeout mechanism.

Once timedout, it also can do some diagnostics to find out why the transaction failed. In this case, you can track what locks you were waiting on. And that this points to particular transactions with transaction IDs. And you can check what those transactions were waiting for, and if they were waiting for a lock held by your transaction ID, then you have a circular deadlock.

@CMCDragonkai
Copy link
Member Author

This could mean that task 1 in #294 (comment) is just about bringing together a proper pessimistic DB transaction.

However many places in PK, it can probably do well with just optimistic DB transaction. More prototyping needed here.

@CMCDragonkai
Copy link
Member Author

CMCDragonkai commented Apr 22, 2022

The js-resources has been been updated so that ResourceAcquire functions will take an array of already allocated resources. This enables withF and withG to pass allocated resources in-order to the resource acquisition functions. Basically this means resource acquisition may be parameterised based on a prior resource acquisition.

MatrixAI/js-resources#1

Note that I tried to do it a bit more elegantly with using a POJO to form a resource DAG (https://github.com/microsoft/p-graph), but this was difficult to do, and didn't allow users to form a strict order. And may require more machinery like memoization. So for now this is usable.

However I couldn't get the types correct so explicit types are needed for the resource types.

An example of this:

    withF([
      iNodeMgr.inoAllocation(),
      ([rootIno]: [INodeIndex]) => iNodeMgr!.transaction(rootIno)()
    ], async ([rootIno, tran]) => {
      // ... use rootIno and tran
    });

The reason this was done is because iNodeMgr.transaction() returns the ResourceAcquire<DBTransaction>. But the acquisition of DBTransaction itself did not take parameters, as this was done by transaction().

@CMCDragonkai
Copy link
Member Author

I'm going to need to implement pessimistic transaction support in DBTransaction as well as a deadlock detector to ensure that we don't have an unscalable approach of expecting every call to know what locks they need to lock before call other domain's methods. Doing so should allow most of our domain methods to be composable. Alternative is optimistic transactions which should function similar to software transactional memory, which should enable even more composability.

@CMCDragonkai
Copy link
Member Author

So most of the updates to the PK codebase should be doable, but the final DBTransaction update would require refactoring the locking before transaction to be done within the transaction.

As for deadlock detection, we would consider that a programmer error when discovered, so when deadlock is discovered, an exception is thrown as normal. Users can retry of course. The PK application should not crash in this instance however.

@CMCDragonkai
Copy link
Member Author

This will be fixed when merging #366 before #326 rebases on top.

@CMCDragonkai
Copy link
Member Author

Pessimistic shouldn't be needed. Working on optimistic instead. It's alot more flexible.

@CMCDragonkai
Copy link
Member Author

CMCDragonkai commented May 4, 2022

Snapshot Isolation based OCC transactions should be backwards compatible with the existing usage of DBTransaction in EFS.

The main difference is the possibility of ErrorDBTransactionConflict exception, and the need to handle write-skews.

In EFS, it will continue to use locks because an OCC transaction can be converted to a PCC transaction as long as locks are used.

In EFS, it's API demands PCC behaviour, therefore it will continue to use locks, even if it updates to the new SI transactions. Thus the SI implementation will be backwards compatible. Also it's usage of locking should mean that write-skews cannot occur.

However if upgrading to the SI implementation results in EFS throwing ErrorDBTransactionConflict, then this is an indication of a concurrency bug in EFS because its usage of locks should prevent any write conflicts. Thus it should be fixed such that ErrorDBTransactionConflict cannot ever be thrown in EFS.

One happy change is the expectation that the reads will now be consistent entirely with get and iterator. This means that there's no need to ensure you create an iterator up front to get consistent reads. One can iterate on level path and perform get on other key paths without worries. This behaviour is also backwards compatible of course, since there's no place where we are expecting inconsistent reads in EFS, we have always used the multi-iterator pattern to maintain consistent reads.

The situation is different in PK, as soon as it upgrades to SI transactions, it will basically drop all usages of locking with respect to the DB. However it may preserve locking in situations where write-skew may occur. Write skew occurs when one transaction writes, and another transaction reads, and some consistency rule is broken. See:

A variety of solutions are possible: materialize the conflict, use read locks... etc.

In the long term, we may upgrade the transaction from SI to SSI (serializable snapshot isolation) which was a recent invention from 2008, and this will even prevent write-skews, and thus all DB locks can be completely dropped.

Now when a conflict occurs, we may decide to auto-retry. However this is complicated by other side-effects that may occur. Only if at least one of these is true:

  • Side-effects are idempotent
  • Side-effects are noops
  • Side-effects are compensated

Can you do an auto-retry.

Auto-retries should not be done when the action should be changed due a change in state. What "should" means depends on the usability of the program.

So there's an "roadmap" for the transaction development:

  1. First introduce SI transactions which are backwards compatible
  2. Secondly bring back opt-in key-locking into SI transactions that enable PCC transactional behaviour, this means PCC locks currently used can be dropped, but deadlock detection becomes important
  3. Thirdly introduce SSI, so that write-skew handling code can be dropped

The second and third phases do not block our testnet deployment. They just improve our software model, reduce future bugs, and reduce entropy in our code.

@CMCDragonkai
Copy link
Member Author

@tegefaulkes

With the new DB integration, there are 2 places where locking will be necessary to ensure serial counter updates:

  1. notifications
    • message count - this is a cardinal number, it is incremented on receive and decremented on delete
  2. sigchain
    • sequence number - this is an ordinal number, this is incremented on each new claim entered into the sigchain

To deal with these, we need to add locking before starting the transaction, and only do this as part of the public methods. If these 2 properties may be via a public method, then we should be using a RWLockWriter, if these 2 properties are only written to from public methods, then a Lock suffices.

The DB has no knowledge about these locks atm, so they occur outside the transaction as properties on the Sigchain and NotificationManager.

When acquiring locks for these transactions do it in this order:

withF([this.lock.write(), this.db.transaction()], async ([, tran]) => {
  // use tran
});

It is important to acquire the lock prior to the transaction to avoid building up resources to hold the transaction while waiting for the lock.

Note that when the DB gains the ability to lock things internal to the transaction, that is PCC control, this can be replaced with just:

withF([this.db.transaction()], async ([tran]) => {
  await tran.lock('counter');
  await tran.get('counter');
});

The API hasn't been fully worked out for PCC locking, it's possible locking can be integrated directly into get, put, and del operations. And we still have to figure out re-entrant locking and deadlock detection. So for now PK will just use locks as discussed above without expecting the DB supporting locking.

Even when PCC is used, ErrorDBTransactionConflict can still occur, that's expected if a write-set conflict occurs.

Still to do is to identify where write-skews may occur.

@CMCDragonkai
Copy link
Member Author

Currently the iterator still doesn't support keyAsBuffer: false and valueAsBuffer: false, this is held up in the SI PR: MatrixAI/js-db#18. I may be able to extract that out and cherry pick to master to release a minor version. Until then, you have to continue using dbUtils.deserialize. @tegefaulkes

@tegefaulkes
Copy link
Contributor

Expanded the spec with details from the PR.

@CMCDragonkai
Copy link
Member Author

CMCDragonkai commented Jul 1, 2022

The current staging of js-db MatrixAI/js-db#38 has the new DBTransaction and the usage of rocksdb. It's still got some build issues to solve before it is ready. That update will end up resulting in @matrixai/db at 5.0.0. This also brings in a new update to @matrixai/async-locks and @matrixai/logger. These changes are somewhat breaking, so it should be done together. First by applying it to EFS, and then to PK. Similar to what we did before with the DB update.

For EFS, the update should focus on dealing with:

  • The default consistency model is SI now, so we get ErrorDBTransactionConflict. Need to run tests to see if this occurs. The EFS must use PCC locking feature (either its own locks or using DBTransaction's own locks) to prevent any transaction conflicts from occurring. It must abstract over it.

For PK, the update should focus on dealing with:

  • PK now bubbles up ErrorDBTransactionConflict to the CLI, the user must then retry their work if it is conflicting with another call.
  • Some of those calls should then apply PCC locking using the DBTransaction lock to serialise operations
  • Replace all locks controlling DB transaction and counter racing issues with a combination of PCC locking on tran.lock and getForUpdate

We should also add benchmarks to identify slowness, I think the rocksdb is a bit slower in normal gets/puts, but the iterator and transaction implementation should be alot faster since it's using the native C++ without additional JS abstraction.

The PR to js-polykey should also solve #244.

@CMCDragonkai CMCDragonkai added r&d:polykey:core activity 1 Secret Vault Sharing and Secret History Management r&d:polykey:core activity 3 Peer to Peer Federated Hierarchy labels Jul 24, 2022
@CMCDragonkai
Copy link
Member Author

Because lots of methods will now be transactional with an optional transaction, they will all need variants of something like this:

  public async pushTask(task, tran?: DBTransaction): Promise<void> {
    if (tran == null) {
      return this.db.withTransactionF(
        (tran) => this.pushTask.apply(this, [...arguments, tran])
      );
    }
    // Do the work
  }

Ideally we could abstract even the calling itself more... but arguments.callee is not available under strict mode.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment