Chaos: Disconnect zbchaos command fails with runtime error #331

shahamit · 2023-03-14T13:44:18Z

Chaos Experiment

We tried the disconnect zbchaos command against a locally installed zeebe cluster (v - 8.1.6). The command fails with a runtime error invalid memory address or nil pointer dereference

Please find the screenshots attached with different flags. All of them lead to the same error. Kindly share some insights. Thanks.

The text was updated successfully, but these errors were encountered:

Zelldon · 2023-03-14T15:48:33Z

Hey @shahamit zbchaos doesn't support local installation. The expected setup is either deployment via helm-charts in kubernetes or internally setup in our SaaS.

Zelldon · 2023-03-14T15:49:22Z

I will try to document this better

shahamit · 2023-03-15T02:35:09Z

The local setup was done through helm on k8s. We updated the podaffinity rules to null to install multiple gateways and brokers. Would that be a problem too? Thanks

…

On Tue 14 Mar, 2023, 9:18 PM Christopher Zell, ***@***.***> wrote: Hey @shahamit <https://github.com/shahamit> zbchaos doesn't support local installation. The expected setup is either deployment via helm-charts in kubernetes or internally setup in our SaaS. — Reply to this email directly, view it on GitHub <#331 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJY7BQBR7URT5L3UZYUGETW4CHNZANCNFSM6AAAAAAV2O3FPQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Zelldon · 2023-03-16T09:20:13Z

Can you rerun the same with verbosity enabled?

shahamit · 2023-03-21T14:11:32Z

Sorry for the delayed response. It took us some time to get a distributed cluster up on aws.

We ran this test against a cluster that was under load. The config is 2 gateways, 6 brokers, 6 partitions, 2 replication factor.

The disconnect command does disconnect the gateway but this leads to errors on the client and on the gateway. The disconnect command verbose output is also something I didn't follow - It says "Gateway deployment not fully available. Available replicas 2/3'. Is this because one new gateway replica gets created by k8s when the first one got disconnected?

Overall it seems the cluster stops functioning if one gateway nodes gets disconnected, which isn't good. Thoughts?

Disconnect command output

Benchmarking tool (client side) logs

Gateway Logs

io.camunda.zeebe.gateway - Failed to activate jobs for type benchmark-task-benchmarkStarter1-completed from partition 5
java.net.ConnectException: Failed to connect channel for address dev-zeebe-4.dev-zeebe.default.svc:26501
        at io.atomix.cluster.messaging.impl.NettyMessagingService.lambda$bootstrapClient$36(NettyMessagingService.java:721) ~[zeebe-atomix-cluster-8.1.6.jar:8.1.6]
        at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:571) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:550) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:609) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.fulfillConnectPromise(AbstractEpollChannel.java:674) ~[netty-transport-classes-epoll-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:693) ~[netty-transport-classes-epoll-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:567) ~[netty-transport-classes-epoll-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:489) ~[netty-transport-classes-epoll-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:397) ~[netty-transport-classes-epoll-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
        at java.lang.Thread.run(Unknown Source) ~[?:?]
2023-03-21 13:03:39.177 [ActivateJobsHandler] [gateway-scheduler-zb-actors-3] WARN
      io.camunda.zeebe.gateway - Failed to activate jobs for type benchmark-task-benchmarkStarter1-completed from partition 5
java.net.ConnectException: Failed to connect channel for address dev-zeebe-4.dev-zeebe.default.svc:26501
        at io.atomix.cluster.messaging.impl.NettyMessagingService.lambda$bootstrapClient$36(NettyMessagingService.java:721) ~[zeebe-atomix-cluster-8.1.6.jar:8.1.6]
        at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:571) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:550) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:609) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.fulfillConnectPromise(AbstractEpollChannel.java:674) ~[netty-transport-classes-epoll-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:693) ~[netty-transport-classes-epoll-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:567) ~[netty-transport-classes-epoll-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:489) ~[netty-transport-classes-epoll-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:397) ~[netty-transport-classes-epoll-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
        at java.lang.Thread.run(Unknown Source) ~[?:?]

shahamit · 2023-03-31T09:34:52Z

@Zelldon - there are a couple of blockers that we observed when executing the chaos tool against an under-load zeebe cluster. One of them is this issue and the other one is gateway termination logged here.

There are more failures that we observed when executing the restart gateway chaos experiment but we thought of re-executing it once there is some analysis done on these logged ones.

Should I move these issues on the zeebe repo to gain traction since anyways there are no issues with the experiment itself but its outcome?

Thanks

Zelldon · 2023-03-31T09:39:12Z

Hey @shahamit

Should I move these issues on the zeebe repo to gain traction since anyways there are no issues with the experiment itself but its outcome?

Sounds reasonable to me, but lets first collect a bit more information in #336

shahamit added the Chaos Experiment This issue describes a chaos experiments, which should be created. label Mar 14, 2023

Zelldon closed this as completed Mar 14, 2023

Zelldon reopened this Mar 23, 2023

Zelldon self-assigned this Mar 23, 2023

shahamit mentioned this issue Apr 5, 2023

Gateway Reports Stream Write Error During Load Tests Leading To Low Throughput. camunda/camunda#11864

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chaos: Disconnect zbchaos command fails with runtime error #331

Chaos: Disconnect zbchaos command fails with runtime error #331

shahamit commented Mar 14, 2023

Zelldon commented Mar 14, 2023

Zelldon commented Mar 14, 2023

shahamit commented Mar 15, 2023 via email

Zelldon commented Mar 16, 2023

shahamit commented Mar 21, 2023

shahamit commented Mar 31, 2023

Zelldon commented Mar 31, 2023

Chaos: Disconnect zbchaos command fails with runtime error #331

Chaos: Disconnect zbchaos command fails with runtime error #331

Comments

shahamit commented Mar 14, 2023

Chaos Experiment

Zelldon commented Mar 14, 2023

Zelldon commented Mar 14, 2023

shahamit commented Mar 15, 2023 via email

Zelldon commented Mar 16, 2023

shahamit commented Mar 21, 2023

shahamit commented Mar 31, 2023

Zelldon commented Mar 31, 2023