Sharing data b/w pool of worker threads ? #170

uzair004 · 2021-09-23T17:23:26Z

uzair004
Sep 23, 2021

To increase performance I have implemented pool of worker threads to do comparison of faces embeddings.

I have sent data to worker threads from main thread using Worker Data which is array of face embeddings

Problem:

Just realize that, Worker Data is copied by value to other threads instead of by reference. new write operations to array in main thread are not reflected in Worker threads. threads still uses array it get at start. (new values don't appear).

Possible solutions:

just came across SharedArrayBuffer which is used to share memory between threads.
however it come with its own challenges of handling Atomics to avoid race conditions.
Array will be constantly growing, It will also increase shared memory size
How will human match() method work with Atomic functions like load()

Worker threads are only solution to increase performance of human and tensorflowJS, but how is it practical.

this is how array looks like
const db = [ { name: 'person a', source: 'url', embedding: [...]} { name: 'person b', source: 'url', embedding: [...]} ... ]
What should type of array while creating SharedArrayBuffer for above array?
What should be size of SharedArrayBuffer?

Answered by vladmandic

Sep 23, 2021

good question - and you found the right problems :)

i'm assuming we're talking about browser environment as nodejs workers are not worth using just yet. instead in nodejs i'd go with process pool instead, but then there is really no sharing and option 3 or 4 are the only options

option 1

assuming array is small, there is no issue with copying it each time you call a worker. but i'm assuming it's intended to significantly grow over time, so that's a no-go

option 2

in theory there shouldn't be a race condition when using sharedbufferarray if array is always a valid array in the main thread as match() function takes array as-is (no additional copies) and just runs for ... of loop on it (calc…

View full answer

vladmandic · 2021-09-23T18:31:59Z

vladmandic
Sep 23, 2021
Maintainer

good question - and you found the right problems :)

i'm assuming we're talking about browser environment as nodejs workers are not worth using just yet. instead in nodejs i'd go with process pool instead, but then there is really no sharing and option 3 or 4 are the only options

option 1

assuming array is small, there is no issue with copying it each time you call a worker. but i'm assuming it's intended to significantly grow over time, so that's a no-go

option 2

in theory there shouldn't be a race condition when using sharedbufferarray if array is always a valid array in the main thread as match() function takes array as-is (no additional copies) and just runs for ... of loop on it (calculate distance for each record and return best match as result)

which means you need to make sure that array is always valid in the main thread - and never reinitialize/empty/reduce it once its initialized or you'll get out-of-bounds on index access. but appending to it should be safe and each appended record will be picked up by a worker next time it runs match(). but if you need to reinitialize the array for any reason, you need to stop sending messages to workers and wait until all workers are idle, then reinitialize the array and continue with processing

option 3

another idea is to copy the array once to worker upon startup and then leave it to each worker to maintain on its own and avoid further copying or sharing the main array completely

and when you change array in the main thread, send message to workers with that single change so each worker can reflect the same change inside its own thread

basically, you end up with n copies of array each maintained separately without any copy operations or sharing

option 4

if array is expected to be really big over time and memory size becomes a concern, you could use any decent database (e.g. browser-based indexdb is pretty good) instead of the array and loop though it in worker threads instead of using built-in match() method

just few ideas...

1 reply

uzair004 Sep 23, 2021
Author

Thanks for detail answer, I appreciate your efforts.

i'm assuming we're talking about browser environment as nodejs workers are not worth using just yet. instead in nodejs i'd go with process pool instead, but then there is really no sharing and option 3 or 4 are the only options

I am using NodeJS, also using Worker_threads instead of child processes. I have read that threads are more memory efficient and have less overhead. How are child processes better ?

option 2

What if main threading is writting to array at the time other threads are comparing. because of sharedmemory don't you think threads may get half change values, also there is not guarantee when changes will be reflected in other threads.

option 3

It would probably little faster then copying or sharing data, but not sure how memory efficient it will be as array may get bigger with increasing numbers of users.

option 4

I had that in mind, do you have any database recommendation as i am using nodejs ? Redis , mongoDB , anything ?
I will still probably need to fetch all data from DB in array and compare, that won't be good on every request. how would that work without built in match() method ?

vladmandic · 2021-09-23T19:55:03Z

vladmandic
Sep 23, 2021
Maintainer

I am using NodeJS, also using Worker_threads instead of child processes. I have read that threads are more memory efficient and have less overhead. How are child processes better ?

when i played with worker_threads, they were all hitting same cpu core, so benefit was non-blocking calls, but my cpu utilization was less than ideal. i've noticed that nodejs completely changed what a thread is and now runs separate engine instance for each thread, so its probably better nowadays. but it also means that multi-processing vs multi-threading overhead is pretty close.

i use multi processing as i run parallel detections and tfjs does global registrations (i don't think that tensorflow.so shared library is thread-safe for external usage although its does use multi-threading internally), so that also caused some access violations. but in your case, just for matching there is no tfjs needed at all, so it should be safe to use threads.

What if main threading is writting to array at the time other threads are comparing. because of sharedmemory don't you think threads may get half change values, also there is not guarantee when changes will be reflected in other threads.

three options (perhaps more, those two are what i can think of):

use atomics on sharedbufferarray: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Atomics
wrap sharedbufferarray in a proxy: `https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Proxy
and implement access via getters/setters, that is also guaranteed safe
implement a lock as an additional sharedbufferarray with a single boolean value
main thread sets lock on array updates (again, best via atomics, but since its just a boolean, not really required)
and worker threads check lock on each array access

I had that in mind, do you have any database recommendation as i am using nodejs

i use mongodb.

I will still probably need to fetch all data from DB in array and compare, that won't be good on every request. how would that work without built in match() method ?

once descriptor is determined, match() doesn't use any heavy logic or machine learning, its a simple and short math (although very cpu intensive since it's a lot of numbers go to through).

just copy the methods match and similarity methods from https://github.com/vladmandic/human/blob/main/src/faceres/faceres.ts and you don't need human at all for that.
(or import them directly from the source file as they are exported, although normally accessed via human only)

1 reply

uzair004 Sep 23, 2021
Author

Just a thing.

What should be SharedArrayBuffer type ? It has options of Unit8array, floatArray, not sure which type I should select in this case ?

What should be size of SharedArrayBuffer ?

vladmandic · 2021-09-23T22:01:21Z

vladmandic
Sep 23, 2021
Maintainer

sharedbufferarray only has size, no type. type is used to create a view into sharedbufferarray and its likely some kind of int, but it really doesnt matter much since you need to deal with serialization anyhow

why? i dont know how to deal with object (in your case array of objects) transfer to and from buffer (chrome web workers support concept of transferrable objects, but nodejs doesnt have that) other than manually serialize object to string and placing string into buffer plus ending the buffer with strlenght. and then on receiving side cut strlenght from buffer and deserialize back into object

and to avoid out-of-memory when serializing and deserializing large array, it should be done per-record, not per entire array.

e.g. JSON.stringify(record) placed onto sharedbufferarray at specific offset plus updating number of records available as int32, best in some other sharedbufferarray so workers know how many records to deserialize. this is quickly starting to look really ugly.

and yes, deserialization should also be done per-record and use each record in to find similarity or youd end up recreating entire array in each worker so any memory benefits are gone not to mention the processing latency involved

so whats the required size of sharedbufferarray? maximum expected size of any record multiplied by maximum expected number of records. large.

or complicate even more by implementing sort-of paging so you fit n number of records into single sharedbufferarrays and create new ones as needed

id much rather go with option (3) to start with and have each worker maintain its copy of array by receiving per-record update messages from main thread

0 replies

vladmandic · 2021-09-29T13:37:38Z

vladmandic
Sep 29, 2021
Maintainer

@uzair004 question - is there interest to implement similarity and match methods in pure WASM (pure AssemblyScript, not tfjs-backend-wasm. and yes, it can be used in NodeJS as well)?
since it's simple algorithm but math-heavy, it should be quite a lot faster than JS...

1 reply

uzair004 Sep 29, 2021
Author

Yes, main issue is performance. However It is one of first real world project (college FYP) I wanna deploy it as well. Just putting pieces together to create the system. Not really sure, how would I use WASM implementation.

I have look into Option 3, I am using Pool of worker threads using node-thread-worker-pool
The way it works it just pick an idle worker thread to perform a task.

I haven't figure out how to send data to all threads in pool to update their arrays.

vladmandic · 2021-09-29T17:18:07Z

vladmandic
Sep 29, 2021
Maintainer

I was recently playing with AssemblyScript and got it working cleanly and got a functional loader for both Chrome and NodeJS,
but didn't have a use-case what to use it for

Just occurred to me that porting similarity and match methods would be good use-case :)

You can take a look at https://github.com/vladmandic/wasm-assemblyscript,
specifically <src/test-as.js> as NodeJS test app and <assembly/as.ts> as source for WASM

Passing non-trivial structures is weird in WASM, but I got it working. Also memory buffers :)

Also, generated WASM file is tiny - so tiny (<10kb), it could be base64 encoded and embedded in JS itself, so there are no external network or file requests at all. Which means you could have a worker thread with zero dependencies (Human or otherwise).

0 replies

vladmandic · 2021-09-30T18:53:47Z

vladmandic
Sep 30, 2021
Maintainer

I played quite a lot with WASM and have it fully working for similarity and match methods,

but at the end I found a way to accelerate built-in JS methods by 10x
(on a test database of 20k descriptors, it went down from ~600ms to ~60ms to find a match)

plus optional threshold setting which allows match to return first match thats below threshold allows skipping further similarity checks, so that adds another 2x on average

but...its a breaking change as input params and output structure to match method have changed

new version of human is on github and documented in:

and entire similarity/match implementation is fully separate so you can import it directly
without requiring entire human library in a worker thread:
https://github.com/vladmandic/human/blob/main/src/face/match.ts

and just in case you're interested in wasm implementation: https://github.com/vladmandic/human-match

there are some additional notes there on how to reduce descriptor dimension without loss of functionality, so if you're dealing with a very large database and memory becomes an issue, that is also fully solvable - database can be compressed 8x without huge impact.

1 reply

uzair004 Sep 30, 2021
Author

That's very interesting, This will definitely improve performance and will allow using human for more real world projects, Thanks for contribution to community :)

but...its a breaking change as input params and output structure to match method have changed

Question
How's new match syntax different from older one, it still takes Descriptor, array, and options ?
Does it only take array of descriptors as second argument instead of facesDB(annotated array with name and url fields) ?

I am still looking into details, updating array of each thread (in a pool) is still little problematic, as it by default pick idle worker to execute task. Sending new entry will update array of particular thread but not all of them.

I will move step by step, I will try to get JS variant with worker threads working. I will look into reducing size of descriptors then.
For WASM I will consider it for future, little over the head now ;) , also It's been added in notes that new JS variant is faster then WASM or hybrid

vladmandic · 2021-09-30T20:40:06Z

vladmandic
Sep 30, 2021
Maintainer

How's new match syntax different from older one, it still takes Descriptor, array, and options ?

yes, it only takes array of descriptors and options (options are also different) and returns index and similarity.
if you have old-style annotated array, you unpack it

const arr = annotatedArray.map((rec) => rec.embedding);

and when you have an index and want to get the label, just lookup old array

const name = annotatedArray[res.index].name

I am still looking into details, updating array of each thread (in a pool) is still little problematic, as it by default pick idle worker to execute task. Sending new entry will update array of particular thread but not all of them.

maybe you'll end up with manual worker thread pool implementation - create n worker threads and send messages yourself. it's pretty simple - you can take a look at my worker process implementation as a reference, concept is the same.

I will move step by step, I will try to get JS variant with worker threads working. I will look into reducing size of descriptors then.
For WASM I will consider it for future, little over the head now ;)

you can ignore wasm implementation completely, i noted that git repository as it documents how to perform descriptor dimension reduction.

1 reply

uzair004 Oct 1, 2021
Author

I get it Thanks, BTW how do you measure performance ? Do you use human perofrmance property , i have just noted that while uploading and processing image it takes quite some time. Not sure if it is because of my slow network (uploading image to cloudinary) or it is because of human performance.

Also getting Tensorflow exceed 10% extra memory warning, tensors are disposed ? Why is that then

vladmandic · 2021-10-01T16:21:43Z

vladmandic
Oct 1, 2021
Maintainer

how do you measure performance

when testing, i use something like this:

const t0 = process.hrtime.bigint();
const res = human.match(desc, arr);
const t1 = process.hrtime.bigint();
console.log('match time:', t1 - t0);

this is only available in nodejs and gets time in nanoseconds, so very precise
its not internally used by human since it has to be cross-platform

Do you use human performance property

that is useful to see where is time spent inside long calls, for example where did human.detect() spend most time?
but for human.performance to make sense, you need to have config.async = false
(since if model execution is parallel, timings will not make any sense)

Also getting Tensorflow exceed 10% extra memory warning, tensors are disposed

that warning comes from inside tensorflow.so used by tfjs-node
its not harmful, it simply means that tensorflow has hit high watermark and is performing garbage collection, its not actually dropping used tensors

i so dont care about internal messages like that (imo, internal messages should not be printed by a library unless explicitly enabled)
that i have it disabled in my .bashrc

export TF_CPP_MIN_LOG_LEVEL=2

(0 is info, 1 is warning, 2 is error, 3 is fatal)

or you can set the env variable from within your app before you load tfjs-node:

process.env.TF_CPP_MIN_LOG_LEVEL = '2';
const tf = require('@tensorflow/tfjs-node');

(or before you load human since human loads tfjs-node)

btw, you should see how chatty tensorflow.so gets when using tfjs-node-gpu and nvidia cuda - sooo annoying

1 reply

uzair004 Oct 1, 2021
Author

great, Its clear now :)

vladmandic · 2021-10-05T12:29:03Z

vladmandic
Oct 5, 2021
Maintainer

take a look at https://github.com/vladmandic/human-match/tree/main/multithread

i think i got multi-threading working nicely and with shared buffer array and without any libraries or dependencies :)

1 reply

uzair004 Oct 10, 2021
Author

That's Great, how better it will be compared with each thread having it's own data ?
Thanks for adding such use cases, will definitely help to switch to multithteading for new users.

vladmandic · 2021-10-10T16:19:59Z

vladmandic
Oct 10, 2021
Maintainer

how better it will be compared with each thread having it's own data ?

First of all, memory utilization is much much better since there is only one copy of data
(and its optimized to start with as as there is no array or object in the main thread as well
and I've switched to using f32 instead of js number type which is always f64)

So memory is fixed - since each descriptor has 1024 elements, thats 4KB per descriptor fixed size
Without this, its (numWorkers + 1) * 8 * 1024 per each descriptor. Lets say 8 threads, thats 9 * 8 = 72KB (compare to 4KB)!

Performance-wise, calculating match is the same as each thread having its own data

But appending additional data or creating additional workers is now near-free with shared memory
Creating worker or adding more data when each worker maintains its copy it takes significant time to update

Overall, tons of benefits

The key to make this possible was splitting face database array of objects into separate array of descriptors and array of labels
Makes managing descriptors with memory offsets doable instead of performing serialization

2 replies

uzair004 Oct 10, 2021
Author

That's useful addition, multithteading support is kinda necessary for using ML or other CPU intensive stuff in NodeJS. Using multithteading for human was kinda hare because of NodeJS nature (sharing data, sharedBuffer etc).

Can we still bring in another npm package for workers poop creation, assigning tasks, selecting task etc. I believe using Native Node worker pool for creating pool and assigning tasks based on current load is kinda hard & require more code.

Using any workerpool library make life easier. I have been using one in my current importation, don't need to bother much about how will be task assigned, who will get it etc

vladmandic Oct 10, 2021
Maintainer

Methods in my example are already well defined, no need to manage anything manually - it does pool creation, expansion, shutdown, job dispatching, adding records, etc.

See https://github.com/vladmandic/human-match/tree/main/multithread#readme

vladmandic · 2021-10-13T14:17:55Z

vladmandic
Oct 13, 2021
Maintainer

FYI, example has been cleaned up, documented and committed to the main branch under /demo/facematch

1 reply

uzair004 Oct 13, 2021
Author

Alright, one question. SharedArrayBuffer still uses postMessage which involves serialization & deserialization, Will this method less performance efficient in case buffer size grows ?

vladmandic · 2021-10-13T15:55:00Z

vladmandic
Oct 13, 2021
Maintainer

If postMessage parameter is SharedBufferArray it sends it as reference, not as content
so its basically just informing worker of it and size is irrelevant

Plus this is only done on worker start, never repeated
Later, if additional records are added, it just does postMessage with param current number of records so workers know up to which point they are allowed to read in SharedBufferArray

Only meaningful transfer was descriptor itself to compare with for each match job, but I just changed that to transferrable buffer as well - no big difference, but now there is NO data transferred except for status messages :)

0 replies

jacobg · 2021-12-05T11:37:21Z

jacobg
Dec 5, 2021

Hi Vladimir, Thanks for making this great project!

Regarding this copy of the image data buffer passed to the worker:

human/demo/multithread/index.js

Line 167 in 0ea905e

    
           if (workers.face) workers.face.postMessage({ image: imageData.data.buffer, width: canvas.width, height: canvas.height, config: config.face, type: 'face' }, [imageData.data.buffer.slice(0)]);

The second argument is a single-item array containing a copy of the buffer, while the first argument contains the original buffer without copying. I'm not understanding what the 2nd argument is used for here?

The documented function signature of postMessage says that the second argument is the targetOrigin:
https://developer.mozilla.org/en-US/docs/Web/API/Window/postMessage

0 replies

vladmandic · 2021-12-05T12:56:28Z

vladmandic
Dec 5, 2021
Maintainer

@jacobg

The second argument is a single-item array containing a copy of the buffer, while the first argument contains the original buffer without copying. I'm not understanding what the 2nd argument is used for here?

the first parameter says what to transfer, but the parameter in [] says which part to handle as transferrable object instead of cloning
and parameter targetOrigin is not present here at all as its optional

normally i love MDN site but here its wrong - see actual specification: https://html.spec.whatwg.org/multipage/web-messaging.html#posting-messages

1 reply

jacobg Dec 5, 2021

@vladmandic Thanks for explaining. Clearly you acquired very intricate knowledge of the implementation details of how postMessage serializes and transfers data.

ButzYung · 2021-12-05T13:27:23Z

ButzYung
Dec 5, 2021

Just curious, what's the purpose of .slice(0) in the second parameter? Wouldn't the JS engine consider buffer in the first parameter and buffer.slice(0) in the second parameter as two distinct objects and defeat the purpose of transfering (as the first buffer may be cloned instead of transferred?)?

0 replies

vladmandic · 2021-12-05T13:42:12Z

vladmandic
Dec 5, 2021
Maintainer

@ButzYung

JS engine doesn't care if the names are same or not, just if structure matches - which it does, so its transferrable

and if working with a single web worker, then slice(0) is definitely not needed

here its present because same buffer data is transferred to multiple web workers in parallel and same data buffer can only be transferred once

and i find its faster to create a copy of a buffer (using slice) in the main thread and then transfer it than not use transferrable data and let JS engine perform a serialization to achieve deep clone

0 replies

Sharing data b/w pool of worker threads ? #170

uzair004 Sep 23, 2021

Problem:

Possible solutions:

option 1

option 2

Replies: 16 comments · 11 replies

vladmandic Sep 23, 2021 Maintainer

option 1

option 2

option 3

option 4

uzair004 Sep 23, 2021 Author

option 2

option 3

option 4

vladmandic Sep 23, 2021 Maintainer

uzair004 Sep 23, 2021 Author

vladmandic Sep 23, 2021 Maintainer

vladmandic Sep 29, 2021 Maintainer

uzair004 Sep 29, 2021 Author

vladmandic Sep 29, 2021 Maintainer

vladmandic Sep 30, 2021 Maintainer

uzair004 Sep 30, 2021 Author

vladmandic Sep 30, 2021 Maintainer

uzair004 Oct 1, 2021 Author

vladmandic Oct 1, 2021 Maintainer

uzair004 Oct 1, 2021 Author

vladmandic Oct 5, 2021 Maintainer

uzair004 Oct 10, 2021 Author

vladmandic Oct 10, 2021 Maintainer

uzair004 Oct 10, 2021 Author

vladmandic Oct 10, 2021 Maintainer

vladmandic Oct 13, 2021 Maintainer

uzair004 Oct 13, 2021 Author

vladmandic Oct 13, 2021 Maintainer

jacobg Dec 5, 2021

vladmandic Dec 5, 2021 Maintainer

jacobg Dec 5, 2021

ButzYung Dec 5, 2021

vladmandic Dec 5, 2021 Maintainer

uzair004
Sep 23, 2021

Replies: 16 comments 11 replies

vladmandic
Sep 23, 2021
Maintainer

uzair004 Sep 23, 2021
Author

vladmandic
Sep 23, 2021
Maintainer

uzair004 Sep 23, 2021
Author

vladmandic
Sep 23, 2021
Maintainer

vladmandic
Sep 29, 2021
Maintainer

uzair004 Sep 29, 2021
Author

vladmandic
Sep 29, 2021
Maintainer

vladmandic
Sep 30, 2021
Maintainer

uzair004 Sep 30, 2021
Author

vladmandic
Sep 30, 2021
Maintainer

uzair004 Oct 1, 2021
Author

vladmandic
Oct 1, 2021
Maintainer

uzair004 Oct 1, 2021
Author

vladmandic
Oct 5, 2021
Maintainer

uzair004 Oct 10, 2021
Author

vladmandic
Oct 10, 2021
Maintainer

uzair004 Oct 10, 2021
Author

vladmandic Oct 10, 2021
Maintainer

vladmandic
Oct 13, 2021
Maintainer

uzair004 Oct 13, 2021
Author

vladmandic
Oct 13, 2021
Maintainer

jacobg
Dec 5, 2021

vladmandic
Dec 5, 2021
Maintainer

ButzYung
Dec 5, 2021

vladmandic
Dec 5, 2021
Maintainer