Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Producers getting timed out intermittently #23920

Open
3 tasks done
soumya-ghosh opened this issue Feb 3, 2025 · 4 comments
Open
3 tasks done

[Bug] Producers getting timed out intermittently #23920

soumya-ghosh opened this issue Feb 3, 2025 · 4 comments
Labels
type/bug The PR fixed a bug or issue reported a bug

Comments

@soumya-ghosh
Copy link

soumya-ghosh commented Feb 3, 2025

Search before asking

  • I searched in the issues and found nothing similar.

Read release policy

  • I understand that unsupported versions don't get bug fixes. I will attempt to reproduce the issue on a supported version of Pulsar client and Pulsar broker.

Version

Setup summary:

  • Pulsar 4.0.1, 3 brokers, 5 bookies, 3 ZKs, JVM - 17.0.13
  • Produce and Consume through Pulsar Java client (4.0.1) - Clients are using Java 11

Also tried with 4.0.2 version

Pulsar Deployment - VM based, we are running on EC2 machines.

As far as configurations are concerned, we had following settings enabled

  • Extensible load balnacer and TransferShedder
  • Bundle split algorithm - flow_or_qps_equally_divide
  • Tiered storage to S3

We updated the load balancer configurations to use the Pulsar defaults, i.e. Modular Load balancer, ThresholdShedder and range_equally_divide for bundle split strategy. Reason for this change was because we observed errors in broker logs related to topic shedding.

Then we also disabled tiered storage.

In all cases, the reported issue is observed.

Minimal reproduce step

This issue is continuation of Apache Pulsar's slack thread - https://apache-pulsar.slack.com/archives/C5Z4T36F7/p1738158630009249
To and fro discussion with @lhotari can be found there.

We are observing an intermittent issue where Pulsar producer client is unable to produce message to a partitioned topic, it is failing with org.apache.pulsar.client.api.PulsarClientException$TimeoutException after 30 seconds.

// Pulsar client
this.pulsarClient =
          PulsarClient.builder()
              .serviceUrl(connectionUrl)
              .memoryLimit(0, SizeUnit.MEGA_BYTES)
              .build();

// Producer configurations
ProducerBuilder<T> builder =
        pulsarClient
            .newProducer(schema)
            .producerName(producerName)
            .topic(topicName)
            .compressionType(CompressionType.LZ4)
            .enableBatching(true)
            .batchingMaxPublishDelay(30, TimeUnit.MILLISECONDS)
            .blockIfQueueFull(true);

Pulsar brokers, bookies and client instances are running in the same AWS VPC, so chances of network outage are quite low. Brokers are configured to use advertisedAddress , advertisedListeners is not configured.

In the client application, the Pulsar client and Producer object is initialized once and then messages are produced in sporadic intervals.
There is no other producer or consumer running on the cluster and there is more than sufficient resources available to cater to produce request.

Pulsar topics are partitioned.
We also have applied publish rate and dispatch rate policies of 1 MBps on the said topic (per partition)
But the throughput of messages being produced is far less than applied quota, for example, 1-5 messages in 2-5 minutes.

Image

We did try to remove the publish and dispatch rates on the topic and issue was resolved.
Although we could not conclusively say that issue got resolved after removing those policies.
We also applied those policies back and did not observe issue for a day and when issue occurred again, we unload the namespace bundles and assigned to different brokers.

We did this several times so unloading namespace bundles seems to quick fix for the issue.

We also enabled debug logs on client and the server side.

Observed a "Setting auto read to false" false in broker logs (see below)

2025-02-02T16:49:45,932+0000 [BookKeeperClientWorker-OrderedExecutor-4-0] DEBUG org.apache.pulsar.broker.service.AbstractTopic - [persistent://public/default/default_shivam_stream-partition-8] Producer352f00f6-a898-4ed4-9469-8eb62c2ed0bf Got request to create producer 
2025-02-02T16:49:45,932+0000 [BookKeeperClientWorker-OrderedExecutor-4-0] DEBUG org.apache.pulsar.broker.service.AbstractTopic - [persistent://public/default/default_shivam_stream-partition-8] [Producer352f00f6-a898-4ed4-9469-8eb62c2ed0bf] Added producer -- count: 1
2025-02-02T16:49:45,932+0000 [BookKeeperClientWorker-OrderedExecutor-4-0] DEBUG org.apache.pulsar.broker.service.schema.SchemaRegistryServiceImpl - [public/default/default_shivam_stream] Put schema finished
2025-02-02T16:49:45,933+0000 [broker-topic-workers-OrderedExecutor-5-0] INFO  org.apache.pulsar.broker.service.ServerCnx - [/10.22.2.15:39103] Created new producer: Producer{topic=PersistentTopic{topic=persistent://public/default/default_shivam_stream-partition-8}, client=[id: 0x8eacecff, L:/10.10.17.40:6650 - R:/10.22.2.15:39103] [SR:10.22.2.15, state:Connected], producerName=Producer352f00f6-a898-4ed4-9469-8eb62c2ed0bf, producerId=8}, role: null
2025-02-02T16:49:45,933+0000 [BookKeeperClientWorker-OrderedExecutor-4-0] DEBUG org.apache.pulsar.broker.service.schema.SchemaRegistryServiceImpl - [public/default/default_shivam_stream] Schema is already exists
2025-02-02T16:49:45,933+0000 [BookKeeperClientWorker-OrderedExecutor-4-0] DEBUG org.apache.pulsar.broker.service.AbstractTopic - [persistent://public/default/default_shivam_stream-partition-9] Producer352f00f6-a898-4ed4-9469-8eb62c2ed0bf Got request to create producer 
2025-02-02T16:49:45,933+0000 [BookKeeperClientWorker-OrderedExecutor-4-0] DEBUG org.apache.pulsar.broker.service.AbstractTopic - [persistent://public/default/default_shivam_stream-partition-9] [Producer352f00f6-a898-4ed4-9469-8eb62c2ed0bf] Added producer -- count: 1
2025-02-02T16:49:45,933+0000 [broker-topic-workers-OrderedExecutor-5-0] INFO  org.apache.pulsar.broker.service.ServerCnx - [/10.22.2.15:39103] Created new producer: Producer{topic=PersistentTopic{topic=persistent://public/default/default_shivam_stream-partition-9}, client=[id: 0x8eacecff, L:/10.10.17.40:6650 - R:/10.22.2.15:39103] [SR:10.22.2.15, state:Connected], producerName=Producer352f00f6-a898-4ed4-9469-8eb62c2ed0bf, producerId=9}, role: null
2025-02-02T16:49:45,933+0000 [BookKeeperClientWorker-OrderedExecutor-4-0] DEBUG org.apache.pulsar.broker.service.schema.SchemaRegistryServiceImpl - [public/default/default_shivam_stream] Put schema finished
2025-02-02T16:49:46,175+0000 [pulsar-io-3-12] DEBUG org.apache.pulsar.broker.service.ServerCnx - [/10.22.2.15:39103] Received send message request. producer: 9:0 Producer352f00f6-a898-4ed4-9469-8eb62c2ed0bf:0 size: 133, partition key is: 1, ordering key is null, uncompressedSize is 45
2025-02-02T16:49:46,175+0000 [pulsar-io-3-12] DEBUG org.apache.pulsar.broker.service.ServerCnxThrottleTracker - [[id: 0x8eacecff, L:/10.10.17.40:6650 - R:/10.22.2.15:39103] [SR:10.22.2.15, state:Connected]] Setting auto read to false
2025-02-02T16:49:46,180+0000 [BookKeeperClientWorker-OrderedExecutor-4-0] DEBUG org.apache.pulsar.broker.service.Producer - [PersistentTopic{topic=persistent://public/default/default_shivam_stream-partition-9}] [Producer352f00f6-a898-4ed4-9469-8eb62c2ed0bf] [9] triggered send callback. cnx /10.22.2.15:39103, sequenceId 0
2025-02-02T16:49:46,180+0000 [pulsar-io-3-12] DEBUG org.apache.pulsar.broker.service.Producer - [PersistentTopic{topic=persistent://public/default/default_shivam_stream-partition-9}] [Producer352f00f6-a898-4ed4-9469-8eb62c2ed0bf] [9] Persisted message. cnx [id: 0x8eacecff, L:/10.10.17.40:6650 - R:/10.22.2.15:39103] [SR:10.22.2.15, state:Connected], sequenceId 0
2025-02-02T16:49:47,082+0000 [pulsar-io-3-13] DEBUG org.apache.pulsar.broker.service.ServerCnx - New connection from /127.0.0.1:57030
2025-02-02T16:49:47,083+0000 [pulsar-io-3-13] INFO  org.apache.pulsar.broker.service.ServerCnx - Closed connection from /127.0.0.1:57030
2025-02-02T16:49:51,001+0000 [pulsar-backlog-quota-checker-OrderedScheduler-0-0] DEBUG org.apache.pulsar.broker.service.persistent.PersistentTopic - [persistent://public/default/default_shivam_stream-partition-6] No backlog. Update old position info is null
2025-02-02T16:49:51,001+0000 [pulsar-backlog-quota-checker-OrderedScheduler-0-0] DEBUG org.apache.pulsar.broker.service.persistent.PersistentTopic - [persistent://public/default/default_shivam_stream-partition-6] Time backlog quota = [-1]. Checking if exceeded.
2025-02-02T16:49:51,001+0000 [pulsar-backlog-quota-checker-OrderedScheduler-0-0] DEBUG org.apache.pulsar.broker.service.BrokerService - quota not exceeded for [persistent://public/default/default_shivam_stream-partition-6]
2025-02-02T16:49:51,001+0000 [pulsar-backlog-quota-checker-OrderedScheduler-0-0] DEBUG org.apache.pulsar.broker.service.persistent.PersistentTopic - [persistent://public/default/default_shivam_stream-partition-7] No backlog. Update old position info is null
2025-02-02T16:49:51,001+0000 [pulsar-backlog-quota-checker-OrderedScheduler-0-0] DEBUG org.apache.pulsar.broker.service.persistent.PersistentTopic - [persistent://public/default/default_shivam_stream-partition-7] Time backlog quota = [-1]. Checking if exceeded.
2025-02-02T16:49:51,001+0000 [pulsar-backlog-quota-checker-OrderedScheduler-0-0] DEBUG org.apache.pulsar.broker.service.BrokerService - quota not exceeded for [persistent://public/default/default_shivam_stream-partition-7]
--
2025-02-02T16:52:39,369+0000 [BookKeeperClientWorker-OrderedExecutor-1-0] DEBUG org.apache.pulsar.broker.service.AbstractTopic - [persistent://public/default/default_shivam_stream-partition-7] Producer3e526aac-10f4-4a5e-8c51-2dbda8545bbf Got request to create producer 
2025-02-02T16:52:39,369+0000 [BookKeeperClientWorker-OrderedExecutor-1-0] DEBUG org.apache.pulsar.broker.service.AbstractTopic - [persistent://public/default/default_shivam_stream-partition-7] [Producer3e526aac-10f4-4a5e-8c51-2dbda8545bbf] Added producer -- count: 1
2025-02-02T16:52:39,369+0000 [BookKeeperClientWorker-OrderedExecutor-1-0] DEBUG org.apache.pulsar.broker.service.schema.SchemaRegistryServiceImpl - [public/default/default_shivam_stream] Put schema finished
2025-02-02T16:52:39,369+0000 [broker-topic-workers-OrderedExecutor-4-0] INFO  org.apache.pulsar.broker.service.ServerCnx - [/10.22.2.15:54474] Created new producer: Producer{topic=PersistentTopic{topic=persistent://public/default/default_shivam_stream-partition-7}, client=[id: 0x83526df1, L:/10.10.17.40:6650 - R:/10.22.2.15:54474] [SR:10.22.2.15, state:Connected], producerName=Producer3e526aac-10f4-4a5e-8c51-2dbda8545bbf, producerId=7}, role: null
2025-02-02T16:52:39,369+0000 [BookKeeperClientWorker-OrderedExecutor-1-0] DEBUG org.apache.pulsar.broker.service.schema.SchemaRegistryServiceImpl - [public/default/default_shivam_stream] Schema is already exists
2025-02-02T16:52:39,369+0000 [BookKeeperClientWorker-OrderedExecutor-1-0] DEBUG org.apache.pulsar.broker.service.AbstractTopic - [persistent://public/default/default_shivam_stream-partition-4] Producer3e526aac-10f4-4a5e-8c51-2dbda8545bbf Got request to create producer 
2025-02-02T16:52:39,369+0000 [BookKeeperClientWorker-OrderedExecutor-1-0] DEBUG org.apache.pulsar.broker.service.AbstractTopic - [persistent://public/default/default_shivam_stream-partition-4] [Producer3e526aac-10f4-4a5e-8c51-2dbda8545bbf] Added producer -- count: 1
2025-02-02T16:52:39,369+0000 [BookKeeperClientWorker-OrderedExecutor-1-0] DEBUG org.apache.pulsar.broker.service.schema.SchemaRegistryServiceImpl - [public/default/default_shivam_stream] Put schema finished
2025-02-02T16:52:39,369+0000 [broker-topic-workers-OrderedExecutor-3-0] INFO  org.apache.pulsar.broker.service.ServerCnx - [/10.22.2.15:54474] Created new producer: Producer{topic=PersistentTopic{topic=persistent://public/default/default_shivam_stream-partition-4}, client=[id: 0x83526df1, L:/10.10.17.40:6650 - R:/10.22.2.15:54474] [SR:10.22.2.15, state:Connected], producerName=Producer3e526aac-10f4-4a5e-8c51-2dbda8545bbf, producerId=4}, role: null
2025-02-02T16:52:39,596+0000 [pulsar-io-3-7] DEBUG org.apache.pulsar.broker.service.ServerCnx - [/10.22.2.15:54474] Received send message request. producer: 9:0 Producer3e526aac-10f4-4a5e-8c51-2dbda8545bbf:0 size: 133, partition key is: 1, ordering key is null, uncompressedSize is 45
2025-02-02T16:52:39,596+0000 [pulsar-io-3-7] DEBUG org.apache.pulsar.broker.service.ServerCnxThrottleTracker - [[id: 0x83526df1, L:/10.10.17.40:6650 - R:/10.22.2.15:54474] [SR:10.22.2.15, state:Connected]] Setting auto read to false
2025-02-02T16:52:39,601+0000 [BookKeeperClientWorker-OrderedExecutor-4-0] DEBUG org.apache.pulsar.broker.service.Producer - [PersistentTopic{topic=persistent://public/default/default_shivam_stream-partition-9}] [Producer3e526aac-10f4-4a5e-8c51-2dbda8545bbf] [9] triggered send callback. cnx /10.22.2.15:54474, sequenceId 0
2025-02-02T16:52:39,601+0000 [pulsar-io-3-7] DEBUG org.apache.pulsar.broker.service.Producer - [PersistentTopic{topic=persistent://public/default/default_shivam_stream-partition-9}] [Producer3e526aac-10f4-4a5e-8c51-2dbda8545bbf] [9] Persisted message. cnx [id: 0x83526df1, L:/10.10.17.40:6650 - R:/10.22.2.15:54474] [SR:10.22.2.15, state:Connected], sequenceId 0
2025-02-02T16:52:51,001+0000 [pulsar-backlog-quota-checker-OrderedScheduler-0-0] DEBUG org.apache.pulsar.broker.service.persistent.PersistentTopic - [persistent://public/default/default_shivam_stream-partition-6] No backlog. Update old position info is null
2025-02-02T16:52:51,001+0000 [pulsar-backlog-quota-checker-OrderedScheduler-0-0] DEBUG org.apache.pulsar.broker.service.persistent.PersistentTopic - [persistent://public/default/default_shivam_stream-partition-6] Time backlog quota = [-1]. Checking if exceeded.
2025-02-02T16:52:51,001+0000 [pulsar-backlog-quota-checker-OrderedScheduler-0-0] DEBUG org.apache.pulsar.broker.service.BrokerService - quota not exceeded for [persistent://public/default/default_shivam_stream-partition-6]
2025-02-02T16:52:51,001+0000 [pulsar-backlog-quota-checker-OrderedScheduler-0-0] DEBUG org.apache.pulsar.broker.service.persistent.PersistentTopic - [persistent://public/default/default_shivam_stream-partition-7] No backlog. Update old position info is null
2025-02-02T16:52:51,001+0000 [pulsar-backlog-quota-checker-OrderedScheduler-0-0] DEBUG org.apache.pulsar.broker.service.persistent.PersistentTopic - [persistent://public/default/default_shivam_stream-partition-7] Time backlog quota = [-1]. Checking if exceeded.
2025-02-02T16:52:51,001+0000 [pulsar-backlog-quota-checker-OrderedScheduler-0-0] DEBUG org.apache.pulsar.broker.service.BrokerService - quota not exceeded for [persistent://public/default/default_shivam_stream-partition-7]
2025-02-02T16:52:51,001+0000 [pulsar-backlog-quota-checker-OrderedScheduler-0-0] DEBUG org.apache.pulsar.broker.service.persistent.PersistentTopic - [persistent://public/default/default_shivam_stream-partition-8] No backlog. Update old position info is null
2025-02-02T16:52:51,001+0000 [pulsar-backlog-quota-checker-OrderedScheduler-0-0] DEBUG org.apache.pulsar.broker.service.persistent.PersistentTopic - [persistent://public/default/default_shivam_stream-partition-8] Time backlog quota = [-1]. Checking if exceeded.
--
2025-02-02T16:54:29,252+0000 [BookKeeperClientWorker-OrderedExecutor-1-0] DEBUG org.apache.pulsar.broker.service.AbstractTopic - [persistent://public/default/default_shivam_stream-partition-4] Producerbc584de6-13d3-46c3-9d7f-9f5c4f7619b2 Got request to create producer 
2025-02-02T16:54:29,252+0000 [BookKeeperClientWorker-OrderedExecutor-1-0] DEBUG org.apache.pulsar.broker.service.AbstractTopic - [persistent://public/default/default_shivam_stream-partition-4] [Producerbc584de6-13d3-46c3-9d7f-9f5c4f7619b2] Added producer -- count: 1
2025-02-02T16:54:29,252+0000 [BookKeeperClientWorker-OrderedExecutor-1-0] DEBUG org.apache.pulsar.broker.service.schema.SchemaRegistryServiceImpl - [public/default/default_shivam_stream] Put schema finished
2025-02-02T16:54:29,252+0000 [broker-topic-workers-OrderedExecutor-3-0] INFO  org.apache.pulsar.broker.service.ServerCnx - [/10.22.2.15:55876] Created new producer: Producer{topic=PersistentTopic{topic=persistent://public/default/default_shivam_stream-partition-4}, client=[id: 0x4b78fe29, L:/10.10.17.40:6650 - R:/10.22.2.15:55876] [SR:10.22.2.15, state:Connected], producerName=Producerbc584de6-13d3-46c3-9d7f-9f5c4f7619b2, producerId=4}, role: null
2025-02-02T16:54:29,253+0000 [BookKeeperClientWorker-OrderedExecutor-1-0] DEBUG org.apache.pulsar.broker.service.schema.SchemaRegistryServiceImpl - [public/default/default_shivam_stream] Schema is already exists
20...

We also observed that pulsar.pulsar_publish_rate_limit_times for the topics with this issue and found that they had non-zero values for this metric.

Image

Another observation when we encounter this issue is that no messages are delivered to subscriptions. We tried consuming data from topic using Pulsar's CLI client from earliest position but no records were delivered.
We enabled unblockStuckSubscriptionEnabled=true in broker.conf but still observed this issue.

Currently we are removing the topicPolicies - publish rate and dispatch rate and will observe if the issue recurs.

As per our tests, these policies are working fine when there is sufficient records being produced and consumed from pulsar topics.
But when producer load is sparse, we are encountering the issue where producer is unable to produce message.

Let me know if any other details are required around this.

What did you expect to see?

Records getting produced successfully

What did you see instead?

Records not being produced or consumed at sporadic intervals.
The issue happens intermittently and gets some solved on it's own, we haven't been able to nail the scenarios which deterministically leads to this issue.

The client patterns would be sparse producer and non zero values of pulsar.pulsar_publish_rate_limit_times during the duration of issue.

Anything else?

No response

Are you willing to submit a PR?

  • I'm willing to submit a PR!
@soumya-ghosh soumya-ghosh added the type/bug The PR fixed a bug or issue reported a bug label Feb 3, 2025
@lhotari
Copy link
Member

lhotari commented Feb 3, 2025

Another observation when we encounter this issue is that no messages are delivered to subscriptions. We tried consuming data from topic using Pulsar's CLI client from earliest position but no records were delivered.

This detail hints into the direction that there could be a deadlock on the broker side when the problem occurs.
In the Pulsar docker containers jstack -l 1 is how you can capture a thread dump with lock information.
If you'd like to capture multiple thread dumps and a heapdump, there's a script that can automate this when there's kubectl access to the cluster: https://github.com/lhotari/pulsar-contributor-toolbox/blob/master/scripts/collect_jvm_diagnostics_from_pod.sh
Notice: the heapdump could contain sensitive information, so it should never be shared publicly. Usually the only possibly sensisitive information in threaddumps is the Zookeeper client thread names that contain a IP address.

Would you be able to capture a thread dump with jstack -l 1 to ensure that there isn't a deadlock?

In Pulsar 4.0.1 there's a known deadlock bug that has been fixed in 4.0.2 with #23854. That applies to Key_Shared subscriptions.

In this case, please share (an possibly sanitize before sharing) the topic stats

Commands for creating stats and stats-internal for a partitioned topic:

pulsar-admin topics partitioned-stats --per-partition persistent://public/default/picks-teams-stream > topic_stats_picks-teams-stream_$(date +%s).json
pulsar-admin topics partitioned-stats-internal persistent://public/default/picks-teams-stream > topic_stats_internal_picks-teams-stream_$(date +%s).json

You might want to sanitize IP addresses in stats etc. before sharing.
then use GitHub cli tool gh to create a secret gist at gist.github.com:

gh gist create topic_stats_picks-teams-stream_*.json topic_stats_picks-teams-stream_*.json

After creating a gist, you can check the content and then share the link in this issue.

@soumya-ghosh
Copy link
Author

Now looking back at the datadog metrics when pulsar.pulsar_publish_rate_limit_times went back down to 0, we observed that the producer got back to normal and records were delivered to consumers.

Currently we are using Failover subscription mode. I have captured some threaddumps and a heapdump from one Pulsar broker, will try to share it.

Regarding stats of the partitioned topic during the consumer not getting records issue, we have captured using pulsar-admin topics partitioned-stats persistent://public/default/topic_name (basically without the --per-partiton flag) - https://gist.github.com/soumya-ghosh/e97e19f716e8b917eae9cb55f91b9aa9
We had reported this issue on slack - https://apache-pulsar.slack.com/archives/C5Z4T36F7/p1737060570361949

Non-zero backlog metric when a console consume is started from pulsar cli.
Image

Messages are delivered after 10 minutes and backlog goes to 0.
Image

Will try to capture the partitioned level stats and internal stats when issue occurs again.

We suspect both producer and consumer issues are inter-related as observed them happening at the same.

@lhotari
Copy link
Member

lhotari commented Feb 3, 2025

I filed issue #23921 . A heap dump could be used to check if the issue applies. This can be done by searching org.apache.pulsar.broker.service.ServerCnx$PendingBytesPerThreadTracker instances in the heap dump and checking the pendingBytes and limitExceeded field values.

@soumya-ghosh
Copy link
Author

Some observations from heap which was captured on the broker of Pulsar cluster exhibiting this issue.

org.apache.pulsar.broker.service.ServerCnx$PendingBytesPerThreadTracker - pendingBytes and limitExceeded field values

Image

org.apache.pulsar.broker.service.ServerCnxThrottleTracker instance fields

Image

org.apache.pulsar.broker.service.PublishRateLimiterImpl --> unthrottlingQueue MpscUnboundedArrayQueue

Image

We observed high negative value for tokens for the PublishRateLimiterImpl object which have non-zero value for throttledProducersCount

SELECT * FROM org.apache.pulsar.broker.service.PublishRateLimiterImpl WHERE (throttledProducersCount.value > 0) and tokenBucketOnByte.tokens < 0

Image

We will try to capture some more heap dumps once issue occurs again and confirm this above observations.

Thanks @lhotari for assisting in heap dump analysis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

No branches or pull requests

2 participants