Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zero-copy support #359

Open
NewbiZ opened this issue Sep 10, 2023 · 4 comments
Open

Zero-copy support #359

NewbiZ opened this issue Sep 10, 2023 · 4 comments

Comments

@NewbiZ
Copy link

NewbiZ commented Sep 10, 2023

I am willing to use the concurrent queue for a low latency network application. Overall, I need to enqueue network packets on one thread, and decode them in another thread - in practice that would be multiple producers/consumers, but it does not matter here. The queuing is necessary as with high packet rates, decoding synchronously would be too long and start dropping incoming packets.

In pseudo code, that means my flow is roughly:

// Producer
Packet p;
while (true) {
    read(&p); // Read (copy) packet to a temporary stack variable
    q.enqueue(p); // Copy to the queue
} 

// Consumer
Packet p;
while (true) {
    q.dequeue(p); // Copy from the queue to a stack variable
    decode(&p); // Decode from the stack variable
} 

My actual packets are copied all over the place here.

  1. From kernel to userspace with read()
  2. From producer stack to queue
  3. From queue to consumer stack

Now, I am more of a C programmer than C++, so the subtleties of move operations are a bit magical to me, but do I have some way to ensure that p will not be copied upon enqueuing and dequeuing? It seems a bit magical to me.

Also the possible move is only on the producer side, not on the consumer side, as far as I understand.

I would propose an explicit API to handle that case, which splits enqueue and dequeue in two halves. The first half grabs the slot from the queue, and the second half commits the operation.

That would look like:

// Producer
while (true) {
    Packet* p = q.enqueue_stage(); // Grab the slot
    read(p); // Read (copy) the packet directly in the slot
    q.enqueue_commit(p); // Finish the enqueue
} 

// Consumer
while (true) {
    Packet* p = q.dequeue_stage(); // Grab the slot
    decode(p); // Decode the packet directly from the slot
    q.dequeue_commit(); // Finish the dequeue
}

This would allow true (producer + consumer) zero copy behavior, without relying on compiler optimizations.

@cameron314
Copy link
Owner

cameron314 commented Sep 10, 2023

Changing the API would require completely redesigning the entire implementation :-) It's also almost impossible to make a two-stage approach thread-safe for the MPMC case (while being efficient).

It sounds like you only have one producer, and one consumer. A single-producer/single-consumer (SPSC) queue, like my ReaderWriterQueue, would be more efficient in that case.

In any case, objects are moved both on enqueue and dequeue if the type supports it (using the move constructor and move assignment operator, respectively).

Instead of copying entire buffers, consider using pointers to buffers. With two queues, you can have one containing pointers to buffers to be processed, and one containing pointers to free buffers (essentially acting as a pool to avoid dynamically allocating buffers on the hot path).

@NewbiZ
Copy link
Author

NewbiZ commented Sep 10, 2023

It sounds like you only have one producer, and one consumer.

No, I just layed out a simplified example. In the real world, my network stream is split over a dozen UDP multicast groups. I have one producer thread per multicast group.

It's also almost impossible to make a two-stage approach thread-safe for the MPMC cas

Is it really? I did not read the actual code, but I would expect e. g. the producer enqueue to look something like:

enqueue(e) {
    newtail = tail.load() + 1;
    block[newtail] = e;
    tail.store(new tail);
}

Which could be changed to something like:

T&, size_t enqueue_stage() {
    newtail = tail.load() + 1;
    return block[newtail], newtail;
}

enqueue_commit(newtail) {
    tail.store(newtail);
}

// Producer
Packet& p, size_t& nt = q.enqueue_stage();
read(&p);
q.enqueue_commit(nt);

Basically, I would expect that the enqueuing and dequeuing decomposition would naturally follow the flow of the existing functions. It would just be a matter of returning the intermediate states to the caller during _stage() so that _commit() can pick up where it left off. Note that you have the implicit guarantee that each _stage() will stricly be followed by its corresponding _commit(), as we're not trying to have asynchronous producers/consumers here.

If the split by returning intermediate states is annoying to implement, I guess it could also be implemented by enqueue/dequeue accepting a function pointer or lambda that operates on the slot instead of an element to copy.

Something along the lines of:

// Producer
Packet p;
while (true) {
    q.enqueue([] (Packet* p) {read(p);} );
} 

// Consumer
Packet p;
while (true) {
    Message m;
    q.dequeue([&m] (Packet& p) m = decode(p) ); // Decode packet to message with 0 copy, directly from the queue
} 

@cameron314
Copy link
Owner

It's not that simple, as there's quite a bit of state beyond just the tail index. This style of API is also easy to accidentally misuse, leading to bugs in the calling application.

If you want to use the queue as backing storage for a C-style API (e.g. recv), then yes, a lambda-based approach similar to std::string::resize_and_overwrite() could work. However, I still suggest that managing a pool of backing memory elsewhere, and using the queue merely to push pointers to that memory would be more efficient.

@NewbiZ
Copy link
Author

NewbiZ commented Sep 11, 2023

It's not that simple, as there's quite a bit of state beyond just the tail index. This style of API is also easy to accidentally misuse, leading to bugs in the calling application.

I see, indeed. And I agree with you that 2 stage operations are more error prone. I still think it's worth it though. Packet processing is a common task for concurrent queues.

Here is what 2 stage/zero copy looks like in DPDK's ring buffer for instance:
http://doc.dpdk.org/guides/prog_guide/ring_lib.html#ring-peek-zero-copy-api

I still suggest that managing a pool of backing memory elsewhere

The problem I have with that is creating an efficient, thread safe memory pool is not a trivial task. The queue buffer is naturally contiguous, already in the cache, avoids further dereferences, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants