Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase consumer group test timeout #187

Draft
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

orange-kao
Copy link

Test test/test_consumer_group.py::test_group failed. Increase timeout to find out if it is just slow or real failure.

@wbarnha
Copy link
Owner

wbarnha commented Jul 24, 2024

In my opinion, it's slower than it used to be. If you re-run the tests, it eventually passes but it's still strange. I'm suspecting something internally with GitHub's CI changed but I need to review things further.

@wbarnha
Copy link
Owner

wbarnha commented Jul 24, 2024

I started to observe this behavior after merging #184 but I can't imagine anything in there being causally related. Maybe I should rethink how brokers are spun up for each CI test.

@orange-kao
Copy link
Author

Test failed again after increasing the timeout. This time I reverted 5e461a7e017130fb9115add8d64291d6966267e9 in this draft PR, see if the test can pass or not.

@wbarnha
Copy link
Owner

wbarnha commented Aug 8, 2024

Hmm, seems like the revert didn't fix it. Thanks for checking, though. I'm baffled this is now an issue.

@orange-kao
Copy link
Author

I have a suspicion that the test failure might be related to the CPU time available in github workflow. So I have started a Ubuntu 24.04 VM on GCP using VM type n2d-standard-4 (4 vCPU, 2 core, 16G RAM), all tests can pass (master branch). But if I limit the CPU to 0.1 core (using cgroup), the following tests will fail.

test/test_admin_integration.py sEEEEEEEEE
test/test_consumer_group.py sEEEEEEE
test/test_consumer_integration.py sEEEEEEEEEsE
test/test_producer.py s.EEEEE.EEEEE
test/test_sasl_integration.py sEEEEEEEEE
test/test_ssl_integration.py sEEE

I am still investigating but I don't have enough evidence at the moment.

I also started testing using e2-micro (0.25~2 vCPU, 1 shared core, 1G RAM), will update result here later.

@orange-kao
Copy link
Author

n2d-standard-4 limit to 1 core

test/test_ssl_integration.py sEEE  # RuntimeError: Child thread died already.

n2d-standard-4 limit to 0.5 core: passed twice

e2-micro 1st time

test/test_admin_integration.py ...xF.....
test/test_consumer_group.py ..F.....

e2-micro 2nd time

test/test_admin_integration.py .EEEEEEEEE
test/test_consumer_group.py .EEEEEEE
test/test_consumer_integration.py .EEEEEEEEEsE
test/test_producer.py ..EEEEE.EEEEE
test/test_sasl_integration.py .EEEEEEEEE
test/test_ssl_integration.py .EEE

Most of them raised RuntimeError: Child thread died already.

Result: Inconclusive.

I'm not familiar with tox but Is there a way to run a single tests? I tried something like tox -e py312 -- kafka/scram.py but it always fail...

@orange-kao
Copy link
Author

I am using this PR as a testing ground, but I have submitted #192, I assume it will fix part of the problem.

@orange-kao
Copy link
Author

This PR still fail because test/test_ssl_integration.py raised RuntimeError: Child thread died already.. As far as I know this means Java has died. I will try to find out why...

@orange-kao
Copy link
Author

I haven't figure out why java dies in test/test_ssl_integration.py. However, I noticed that there could be up to 7~9 Kafka subprocesses when running test/test_sasl_integration.py (SASL, not SSL). I use PDB and objgraph to observer, and looks like kafka_broker_factory will create 7 KafkaFixture and each of them have their own SapwnedService and Popen object.

I guess that GitHub runner may not have enough memory, and I can reproduce slowliness on resource constrained VMs when there are too many Kafka running.

@wbarnha
Copy link
Owner

wbarnha commented Aug 13, 2024

I haven't figure out why java dies in test/test_ssl_integration.py. However, I noticed that there could be up to 7~9 Kafka subprocesses when running test/test_sasl_integration.py (SASL, not SSL). I use PDB and objgraph to observer, and looks like kafka_broker_factory will create 7 KafkaFixture and each of them have their own SapwnedService and Popen object.

I guess that GitHub runner may not have enough memory, and I can reproduce slowliness on resource constrained VMs when there are too many Kafka running.

I agree. My goal has been to run one Kafka instance per test in order to conserve memory, as a solution. I was planning on tinkering with concurrency groups in https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/using-concurrency to improve this issue.

@orange-kao
Copy link
Author

Thank you. #186 and #194 has passed the test and ready for review.

I also noticed test for Kafka 0.8.22 Python 3.12 always timeout (after test/test_partitioner.py, before test/test_producer.py). I will try to find out why...

@orange-kao orange-kao force-pushed the orange-test-consumer-group-timeout branch from 7d8bfac to 1608037 Compare August 16, 2024 02:24
@orange-kao
Copy link
Author

^ I cannot reproduce the timeout in my environment (Kafka 0.8.2.2, Python 3.12, after test/test_partitioner.py, before test/test_producer.py). I have updated this branch to trigger the test to find out...

@wbarnha
Copy link
Owner

wbarnha commented Aug 16, 2024

Thank you for your meticulous investigation, I really do appreciate it. It's been troubling me as to why it's been an issue the past month. Possibly Microsoft scaling down runner resources as a cost-cutting measure?

@orange-kao
Copy link
Author

No worries. I am not sure. On paper, public repo runner have plenty of resource. But I am not sure what's behind the scene, and I never tried to benchmark it...

Note:
First attempt: timeout during test/test_producer.py::test_kafka_producer_proper_record_metadata[gzip]
Second attempt: test_kafka_producer_proper_record_metadata gzip/snappy/lz4/zstd and timeout

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants