nsq: DRAINING mode #1302

jehiah · 2020-11-23T19:42:52Z

To facilitate running nsqd in environments where the host isn't long running and to facilitate operations around managing a cluster a new "draining" mode will be introduced to nsqd.

A nsqd instance in "draining" mode will:

Not accept any new messages
Allow consumers to receive all remaining messages
Exit after all topics and channels are empty.
Indicate draining status via /info endpoint

Clients that use a HA approach of pooling multiple nsqds for publishing messages (i.e. nsqio/go-nsq#311 ) are expected to transparently tolerate a host in draining mode.

Implementation Plan

A new --sigterm=drain CLI flag will enable this new behavior. Existing functionality will be preserved with the argument --sigterm=clean-shutdown

A new PUT /config/drain endpoint can also initiate a drain, and a PUT /config/shutdown would initiate a clean shutdown.

When in a draining mode new messages will be rejected with an error. E_PUB_FAILED will be the response for new messages over the TCP protocol, and HTTP 503 for http protocol.

An attempt to create new topics and channels (via subscribe) will be rejected if nsqd is in drain mode.

Once initiated a drain operation can only be completed, it can't be canceled. TBD: PUT /config/shutdown may be able to override the drain and close all connections and exit nsqd.

Open Questions

Should each topic (and/or channel) be closed as they are drained or should they only be closed after all are drained? If this functionality is per-topic should the HTTP API expose that same behavior, and is there a need to expand the lookupd protocol to initiate a tombstone before existing to avoid race conditions w/ clients?
Should draining close a topic/channel or should it still be configured on a nsqd instance after restart? (i.e. is this similar to POST /topic/delete)

cc: #1254
Closes #1022

The text was updated successfully, but these errors were encountered:

mreiferson · 2020-11-25T03:50:27Z

SGTM, I suspect this one is going to be a bit tricky.

Should each topic (and/or channel) be closed as they are drained or should they only be closed after all are drained? If this functionality is per-topic should the HTTP API expose that same behavior, and is there a need to expand the lookupd protocol to initiate a tombstone before existing to avoid race conditions w/ clients?

My gut tells me that, as a first pass, trying an implementation that waits for all topics/channels to be empty and then exits will likely avoid the "premature client reconnect after close" problem.

Should draining close a topic/channel or should it still be configured on a nsqd instance after restart? (i.e. is this similar to POST /topic/delete)

A little confused by this — my understanding is that this proposal isn't intended to modify the existence of topics/channels, so my answer would be topics/channels should remain present if an nsqd pointed at the same --data-path starts up again after draining.

TBD: PUT /config/shutdown may be able to override the drain and close all connections and exit nsqd.

IMO yes, we must provide a mechanism to force a (clean) shutdown. Maybe even offer a timeout?

Minor:

Feel like we can come up with something slightly better than --sigterm, how about --term-mode?
/config/{drain,shutdown} don't really feel like "configurations"

mreiferson · 2020-11-25T03:51:13Z

Also love that this and #1300 are labeled chore 😂

jehiah · 2020-11-25T05:05:22Z

I should comment that i don't yet have a perfectly clear idea of the implementation for this; it will be a chore!

A little confused by this — my understanding is that this proposal isn't intended to modify the existence of topics/channels, so my answer would be topics/channels should remain present if an nsqd pointed at the same --data-path starts up again after draining.

Perhaps there is a case for both? My intention is to targeting a use case where a nsqd is going away (i.e. removed from rotation), and by the time it's done draining - there is nothing left on that nsqd instance. From that context i'm leaning towards "delete" functionality where topics disappear as they are drained.

If you are trying to remove a nsqd instance from rotation where that nsqd instance had 10 different topics, but just one or two with notable backlogs, it would be desirable to have the topics that drain quickly deleted. Deleting promotes a better cluster hygiene where you don't have a nsqd instance which is no longer getting messages on a topic still getting consumer connections where it causes RDY to be spread thin. (i.e. think a topic that takes a day to drain in some odd circumstance.)

I've used the word "drain" because i think it's best, but what i really mean is the process of removing a nsqd from rotation.

Feel like we can come up with something slightly better than --sigterm, how about --term-mode?

I had --term-mode in a prototype of this feature but felt it wasn't obvious enough that this was about signal handling. --sigterm-mode ?

/config/{drain,shutdown} don't really feel like "configurations"

agreed. ideas? /state/{drain,shutdown} were some other naming ideas i had.

mreiferson · 2020-11-25T05:15:59Z

Got it. In that case, simplest way might be to to proactively send a tombstone to nsqlookupd (to avoid new clients discovering that node) but not closing (which may force connected clients to reconnect)? However, one can imagine scenarios where you need clients to reconnect in order to fully drain 😜.

--sigterm-mode 👍

agreed. ideas? /state/{drain,shutdown} were some other naming ideas i had.

🤷 might make sense at the top-level?

jehiah · 2020-11-25T12:36:17Z

simplest way might be to to proactively send a tombstone to nsqlookupd (to avoid new clients discovering that node) but not closing (which may force connected clients to reconnect)? However, one can imagine scenarios where you need clients to reconnect in order to fully drain 😜.

I think we are on the same page; you wouldn't toombstone until the actual removal so i don't think that affects clients draining. Currently the TCP protocol for lookupd doesn't support toombstone, but that would be easy to resolve if needed. It might also not be critical if nsqd rejects the creation of new topics when it's draining. That would inhibit new subscriptions after a topic is deleted.

might make sense at the top-level?

👍

I think i have enough feedback here to start on an implementation; then we can move to discussion the tradeoffs of a concrete implementation.

jehiah added feature chore labels Nov 23, 2020

jehiah self-assigned this Nov 23, 2020

jehiah changed the title ~~nsq: DRAINING mode [RFC]~~ nsq: DRAINING mode Nov 25, 2020

jehiah linked a pull request Nov 25, 2020 that will close this issue

nsqd: support draining messages / removing nsqd from rotation #1305

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nsq: DRAINING mode #1302

nsq: DRAINING mode #1302

jehiah commented Nov 23, 2020

mreiferson commented Nov 25, 2020 •

edited

Loading

mreiferson commented Nov 25, 2020

jehiah commented Nov 25, 2020

mreiferson commented Nov 25, 2020

jehiah commented Nov 25, 2020

nsq: DRAINING mode #1302

nsq: DRAINING mode #1302

Comments

jehiah commented Nov 23, 2020

Implementation Plan

Open Questions

mreiferson commented Nov 25, 2020 • edited Loading

mreiferson commented Nov 25, 2020

jehiah commented Nov 25, 2020

mreiferson commented Nov 25, 2020

jehiah commented Nov 25, 2020

mreiferson commented Nov 25, 2020 •

edited

Loading