Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chaos: Disconnect zbchaos command fails with runtime error #331

Open
shahamit opened this issue Mar 14, 2023 · 7 comments
Open

Chaos: Disconnect zbchaos command fails with runtime error #331

shahamit opened this issue Mar 14, 2023 · 7 comments
Assignees
Labels
Chaos Experiment This issue describes a chaos experiments, which should be created.

Comments

@shahamit
Copy link

Chaos Experiment

We tried the disconnect zbchaos command against a locally installed zeebe cluster (v - 8.1.6). The command fails with a runtime error invalid memory address or nil pointer dereference

Please find the screenshots attached with different flags. All of them lead to the same error. Kindly share some insights. Thanks.
Screenshot from 2023-03-14 16-08-38
Screenshot from 2023-03-14 14-53-57

@shahamit shahamit added the Chaos Experiment This issue describes a chaos experiments, which should be created. label Mar 14, 2023
@Zelldon
Copy link
Member

Zelldon commented Mar 14, 2023

Hey @shahamit zbchaos doesn't support local installation. The expected setup is either deployment via helm-charts in kubernetes or internally setup in our SaaS.

@Zelldon Zelldon closed this as completed Mar 14, 2023
@Zelldon
Copy link
Member

Zelldon commented Mar 14, 2023

I will try to document this better

@shahamit
Copy link
Author

shahamit commented Mar 15, 2023 via email

@Zelldon
Copy link
Member

Zelldon commented Mar 16, 2023

Can you rerun the same with verbosity enabled?

@shahamit
Copy link
Author

Sorry for the delayed response. It took us some time to get a distributed cluster up on aws.

We ran this test against a cluster that was under load. The config is 2 gateways, 6 brokers, 6 partitions, 2 replication factor.

The disconnect command does disconnect the gateway but this leads to errors on the client and on the gateway. The disconnect command verbose output is also something I didn't follow - It says "Gateway deployment not fully available. Available replicas 2/3'. Is this because one new gateway replica gets created by k8s when the first one got disconnected?

Overall it seems the cluster stops functioning if one gateway nodes gets disconnected, which isn't good. Thoughts?

Disconnect command output
ksnip_20230321-183849 (1)

Benchmarking tool (client side) logs
ksnip_20230321-183548 (1)

Gateway Logs

io.camunda.zeebe.gateway - Failed to activate jobs for type benchmark-task-benchmarkStarter1-completed from partition 5
java.net.ConnectException: Failed to connect channel for address dev-zeebe-4.dev-zeebe.default.svc:26501
        at io.atomix.cluster.messaging.impl.NettyMessagingService.lambda$bootstrapClient$36(NettyMessagingService.java:721) ~[zeebe-atomix-cluster-8.1.6.jar:8.1.6]
        at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:571) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:550) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:609) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.fulfillConnectPromise(AbstractEpollChannel.java:674) ~[netty-transport-classes-epoll-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:693) ~[netty-transport-classes-epoll-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:567) ~[netty-transport-classes-epoll-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:489) ~[netty-transport-classes-epoll-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:397) ~[netty-transport-classes-epoll-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
        at java.lang.Thread.run(Unknown Source) ~[?:?]
2023-03-21 13:03:39.177 [ActivateJobsHandler] [gateway-scheduler-zb-actors-3] WARN
      io.camunda.zeebe.gateway - Failed to activate jobs for type benchmark-task-benchmarkStarter1-completed from partition 5
java.net.ConnectException: Failed to connect channel for address dev-zeebe-4.dev-zeebe.default.svc:26501
        at io.atomix.cluster.messaging.impl.NettyMessagingService.lambda$bootstrapClient$36(NettyMessagingService.java:721) ~[zeebe-atomix-cluster-8.1.6.jar:8.1.6]
        at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:571) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:550) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:609) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.fulfillConnectPromise(AbstractEpollChannel.java:674) ~[netty-transport-classes-epoll-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:693) ~[netty-transport-classes-epoll-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:567) ~[netty-transport-classes-epoll-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:489) ~[netty-transport-classes-epoll-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:397) ~[netty-transport-classes-epoll-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
        at java.lang.Thread.run(Unknown Source) ~[?:?]

@Zelldon Zelldon reopened this Mar 23, 2023
@Zelldon Zelldon self-assigned this Mar 23, 2023
@shahamit
Copy link
Author

@Zelldon - there are a couple of blockers that we observed when executing the chaos tool against an under-load zeebe cluster. One of them is this issue and the other one is gateway termination logged here.

There are more failures that we observed when executing the restart gateway chaos experiment but we thought of re-executing it once there is some analysis done on these logged ones.

Should I move these issues on the zeebe repo to gain traction since anyways there are no issues with the experiment itself but its outcome?

Thanks

@Zelldon
Copy link
Member

Zelldon commented Mar 31, 2023

Hey @shahamit

Should I move these issues on the zeebe repo to gain traction since anyways there are no issues with the experiment itself but its outcome?

Sounds reasonable to me, but lets first collect a bit more information in #336

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Chaos Experiment This issue describes a chaos experiments, which should be created.
Projects
None yet
Development

No branches or pull requests

2 participants