Incorrect sequence of messages while reading dataChan after closing #1257

Gilthoniel · 2024-07-26T06:54:42Z

Expected behavior

Closing a producer should never produce an inconsistent sequence of messages according to Send / SendAsync order of calls.

Actual behavior

When a producer is starting a reconnect loop, and is requested to close at the same time, it can happen that one or more pending batches are dropped and following ones are published.

Steps to reproduce

I can't provide a consistent way of reproducing this because it is very random and rare as you need to get unlucky on the sequence of events.

Here is the list of logs that lead me to that discovery:

{"ts": "2024-07-23T20:50:42.023Z", "msg": "Closing producer", "producerID": 36}
{"ts": "2024-07-23T20:50:42.023Z", "msg": "Connected producer", "producerID": 36, "epoch": 10}
{"ts": "2024-07-23T20:50:42.038Z", "msg": "Failing 1 messages on closing producer", "producerID": 36}

I think that what happens is that the producer is in reconnect loop and during that time we are accumulating sending requests and a close request. After successfully reconnecting, the close will eventually be processed but it only closes the channel so remaining sending requests can be written to the connection, in parallel of the producer closing (different channels in the client).

In the logs above, we can assume one sending request went through and has been dropped later on by the close but then at least one message has been written to the connection and successfully published in the broker.

From what I can see, closing does not actually prevent further messages to go through because closing the producer is done on a different channel:

	go func() {
		for {
			select {
			// ...
			case req := <-c.incomingRequestsCh:
                                // ...
				c.internalSendRequest(req)
			}
		}
	}()

	for {
		select {
		// ...
		case cmd := <-c.incomingCmdCh:
			c.internalReceivedCommand(cmd.cmd, cmd.headersAndPayload)
		case data := <-c.writeRequestsCh:
			// ....
			c.internalWriteData(data)
		}
	}

System configuration

Pulsar version: v3.0.5
Pulsar Go client: v12.1

The text was updated successfully, but these errors were encountered:

Gilthoniel · 2024-07-26T13:34:56Z

If I'm correct, that could be fixed by simply emptying the channel after closing it to ensure that it is not processing further.

gunli · 2024-07-29T11:36:54Z

@Gilthoniel Good catch, could pls check if #1249 can fix this?

Gilthoniel · 2024-08-05T05:12:39Z

@gunli Yes I think that's enough to avoid this situation. Please note that using a context like #1249 is not really what context is intended to and that should be handled via a channel.

Gilthoniel changed the title ~~Incorrect sequence of messages due to a race between reconnect and close in producer~~ Incorrect sequence of messages while reading dataChan after closing Jul 26, 2024

Gilthoniel closed this as completed Aug 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect sequence of messages while reading dataChan after closing #1257

Incorrect sequence of messages while reading dataChan after closing #1257

Gilthoniel commented Jul 26, 2024 •

edited

Loading

Gilthoniel commented Jul 26, 2024

gunli commented Jul 29, 2024

Gilthoniel commented Aug 5, 2024

Incorrect sequence of messages while reading dataChan after closing #1257

Incorrect sequence of messages while reading dataChan after closing #1257

Comments

Gilthoniel commented Jul 26, 2024 • edited Loading

Expected behavior

Actual behavior

Steps to reproduce

System configuration

Gilthoniel commented Jul 26, 2024

gunli commented Jul 29, 2024

Gilthoniel commented Aug 5, 2024

Gilthoniel commented Jul 26, 2024 •

edited

Loading