Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault on GC runs in multithreaded context #1761

Open
4 of 7 tasks
progval opened this issue Jun 4, 2024 · 1 comment
Open
4 of 7 tasks

Segmentation fault on GC runs in multithreaded context #1761

progval opened this issue Jun 4, 2024 · 1 comment

Comments

@progval
Copy link

progval commented Jun 4, 2024

Description

I am getting the following crash in very specific conditions, while the Python GC is running.

I can reliably reproduce it in a pytest test suite of an (unfortunately) non-trivial code base.
Each test of the suite starts a producer, writes a few messages, then starts a multiprocessing.Manager,
then a consumer and consumes topics to the end.

The third case to run segfaults when I manually trigger the GC in a specific place (or the fourth,
when the GC eventually runs on its own).

==3552160== Invalid read of size 4
==3552160==    at 0x4A28B17: __pthread_clockjoin_ex (pthread_join_common.c:43)
==3552160==    by 0x4A2F96F: thrd_join@@GLIBC_2.34 (thrd_join.c:26)
==3552160==    by 0x774B3DB: rd_kafka_destroy_app (in /home/dev/.local/lib/python3.11/site-packages/confluent_kafka.libs/librdkafka-55260171.so.1)
==3552160==    by 0x75A86C3: Producer_dealloc (Producer.c:103)
==3552160==    by 0x5CCB14: UnknownInlinedFun (object.c:2390)
==3552160==    by 0x5CCB14: UnknownInlinedFun (object.h:538)
==3552160==    by 0x5CCB14: UnknownInlinedFun (dictobject.c:5566)
==3552160==    by 0x5CCB14: subtype_clear.lto_priv.0 (typeobject.c:1288)
==3552160==    by 0x50ABCD: UnknownInlinedFun (gcmodule.c:1013)
==3552160==    by 0x50ABCD: gc_collect_main.lto_priv.0 (gcmodule.c:1287)
==3552160==    by 0x6303E6: gc_collect_with_callback.lto_priv.0 (gcmodule.c:1400)
==3552160==    by 0x6756CF: UnknownInlinedFun (gcmodule.c:1538)
==3552160==    by 0x6756CF: gc_collect.lto_priv.0 (gcmodule.c.h:110)
==3552160==    by 0x53ACF6: cfunction_vectorcall_FASTCALL_KEYWORDS.lto_priv.0 (methodobject.c:443)
==3552160==    by 0x53AC2B: UnknownInlinedFun (pycore_call.h:92)
==3552160==    by 0x53AC2B: PyObject_Vectorcall (call.c:299)
==3552160==    by 0x52B93F: _PyEval_EvalFrameDefault (ceval.c:4772)
==3552160==    by 0x5855A3: UnknownInlinedFun (pycore_ceval.h:73)
==3552160==    by 0x5855A3: UnknownInlinedFun (ceval.c:6435)
==3552160==    by 0x5855A3: UnknownInlinedFun (call.c:393)
==3552160==    by 0x5855A3: _PyObject_VectorcallTstate.lto_priv.0 (pycore_call.h:92)
==3552160==  Address 0x1201f990 is not stack'd, malloc'd or (recently) free'd
==3552160== 
Fatal Python error: Segmentation fault

Current thread 0x0000000004b80880 (most recent call first):
  Garbage-collecting
  File "/home/dev/swh-environment/swh-dataset/build/__editable__.swh.dataset-1.5.2.dev7+g4fccea4.d20240523-py3-none-any/swh/dataset/journalprocessor.py", line 348 in progress_worker
  File "/usr/lib/python3.11/concurrent/futures/process.py", line 257 in _process_worker
  File "/usr/lib/python3.11/multiprocessing/process.py", line 108 in run
  File "/usr/lib/python3.11/multiprocessing/process.py", line 314 in _bootstrap
  File "/usr/lib/python3.11/multiprocessing/popen_fork.py", line 71 in _launch
  File "/usr/lib/python3.11/multiprocessing/popen_fork.py", line 19 in __init__
  File "/usr/lib/python3.11/multiprocessing/context.py", line 281 in _Popen
  File "/usr/lib/python3.11/multiprocessing/process.py", line 121 in start
  File "/usr/lib/python3.11/concurrent/futures/process.py", line 761 in _spawn_process
  File "/usr/lib/python3.11/concurrent/futures/process.py", line 750 in _launch_processes
  File "/usr/lib/python3.11/concurrent/futures/process.py", line 723 in _start_executor_manager_thread
  File "/usr/lib/python3.11/concurrent/futures/process.py", line 785 in submit
  File "/home/dev/swh-environment/swh-dataset/build/__editable__.swh.dataset-1.5.2.dev7+g4fccea4.d20240523-py3-none-any/swh/dataset/journalprocessor.py", line 316 in run
  File "/home/dev/swh-environment/swh-dataset/swh/dataset/test/test_journal_processor.py", line 170 in test_parallel_journal_processor_origin_visit_status
  File "/home/dev/swh-environment/swh-dataset/swh/dataset/test/test_journal_processor.py", line 50 in newf
  File "/home/dev/.local/lib/python3.11/site-packages/_pytest/python.py", line 195 in pytest_pyfunc_call
  File "/home/dev/.local/lib/python3.11/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/home/dev/.local/lib/python3.11/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/home/dev/.local/lib/python3.11/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/home/dev/.local/lib/python3.11/site-packages/_pytest/python.py", line 1772 in runtest
  File "/home/dev/.local/lib/python3.11/site-packages/_pytest/runner.py", line 172 in pytest_runtest_call
  File "/home/dev/.local/lib/python3.11/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/home/dev/.local/lib/python3.11/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/home/dev/.local/lib/python3.11/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/home/dev/.local/lib/python3.11/site-packages/_pytest/runner.py", line 240 in <lambda>
  File "/home/dev/.local/lib/python3.11/site-packages/_pytest/runner.py", line 340 in from_call
  File "/home/dev/.local/lib/python3.11/site-packages/_pytest/runner.py", line 239 in call_and_report
  File "/home/dev/.local/lib/python3.11/site-packages/_pytest/runner.py", line 134 in runtestprotocol
  File "/home/dev/.local/lib/python3.11/site-packages/_pytest/runner.py", line 115 in pytest_runtest_protocol
  File "/home/dev/.local/lib/python3.11/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/home/dev/.local/lib/python3.11/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/home/dev/.local/lib/python3.11/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/home/dev/.local/lib/python3.11/site-packages/_pytest/main.py", line 364 in pytest_runtestloop
  File "/home/dev/.local/lib/python3.11/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/home/dev/.local/lib/python3.11/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/home/dev/.local/lib/python3.11/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/home/dev/.local/lib/python3.11/site-packages/_pytest/main.py", line 339 in _main
  File "/home/dev/.local/lib/python3.11/site-packages/_pytest/main.py", line 285 in wrap_session
  File "/home/dev/.local/lib/python3.11/site-packages/_pytest/main.py", line 332 in pytest_cmdline_main
  File "/home/dev/.local/lib/python3.11/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/home/dev/.local/lib/python3.11/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/home/dev/.local/lib/python3.11/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/home/dev/.local/lib/python3.11/site-packages/_pytest/config/__init__.py", line 174 in main
  File "/home/dev/.local/lib/python3.11/site-packages/_pytest/config/__init__.py", line 197 in console_main
  File "/home/dev/.local/bin/pytest", line 8 in <module>

Extension modules: confluent_kafka.cimpl, msgpack._cmsgpack, psutil._psutil_linux, psutil._psutil_posix, psycopg_c.pq, psycopg_c._psycopg, markupsafe._speedups, _brotli, simplejson._speedups, zmq.backend.cython.context, zmq.backend.cython.message, zmq.backend.cython.socket, zmq.backend.cython._device, zmq.backend.cython._poll, zmq.backend.cython._proxy_steerable, zmq.backend.cython._version, zmq.backend.cython.error, zmq.backend.cython.utils, tornado.speedups, plyvel._plyvel (total: 20)
==3552160== 
==3552160== Process terminating with default action of signal 11 (SIGSEGV)
==3552160==    at 0x4A28E2C: __pthread_kill_implementation (pthread_kill.c:44)
==3552160==    by 0x49D9FB1: raise (raise.c:26)
==3552160==    by 0x49DA04F: ??? (in /usr/lib/x86_64-linux-gnu/libc.so.6)
==3552160==    by 0x4A28B16: __pthread_clockjoin_ex (pthread_join_common.c:43)
==3552160== 
==3552160== HEAP SUMMARY:
==3552160==     in use at exit: 29,294,834 bytes in 30,069 blocks
==3552160==   total heap usage: 219,797 allocs, 189,728 frees, 522,042,917 bytes allocated
==3552160== 
==3552160== LEAK SUMMARY:
==3552160==    definitely lost: 4,547 bytes in 25 blocks
==3552160==    indirectly lost: 12,610 bytes in 30 blocks
==3552160==      possibly lost: 423,923 bytes in 214 blocks
==3552160==    still reachable: 28,853,754 bytes in 29,800 blocks
==3552160==         suppressed: 0 bytes in 0 blocks
==3552160== Rerun with --leak-check=full to see details of leaked memory
==3552160== 
==3552160== For lists of detected and suppressed errors, rerun with: -s
==3552160== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
==3552159== 
==3552159== Process terminating with default action of signal 15 (SIGTERM)
==3552159==    at 0x4A23E96: __futex_abstimed_wait_common64 (futex-internal.c:57)
==3552159==    by 0x4A23E96: __futex_abstimed_wait_common (futex-internal.c:87)
==3552159==    by 0x4A2683B: __pthread_cond_wait_common (pthread_cond_wait.c:503)
==3552159==    by 0x4A2683B: pthread_cond_timedwait@@GLIBC_2.3.2 (pthread_cond_wait.c:643)
==3552159==    by 0x4A2F5D8: cnd_timedwait@@GLIBC_2.34 (cnd_timedwait.c:27)
==3552159==    by 0x7782B10: rd_kafka_q_serve_rkmessages (in /home/dev/.local/lib/python3.11/site-packages/confluent_kafka.libs/librdkafka-55260171.so.1)
==3552159==    by 0x7743C3B: rd_kafka_consume_batch_queue (in /home/dev/.local/lib/python3.11/site-packages/confluent_kafka.libs/librdkafka-55260171.so.1)
==3552159==    by 0x75A5BB6: Consumer_consume (Consumer.c:1056)
==3552159==    by 0x550BBD: method_vectorcall_VARARGS_KEYWORDS.lto_priv.0 (descrobject.c:364)
==3552159==    by 0x53AC2B: UnknownInlinedFun (pycore_call.h:92)
==3552159==    by 0x53AC2B: PyObject_Vectorcall (call.c:299)
==3552159==    by 0x52B93F: _PyEval_EvalFrameDefault (ceval.c:4772)
==3552159==    by 0x5855A3: UnknownInlinedFun (pycore_ceval.h:73)
==3552159==    by 0x5855A3: UnknownInlinedFun (ceval.c:6435)
==3552159==    by 0x5855A3: UnknownInlinedFun (call.c:393)
==3552159==    by 0x5855A3: _PyObject_VectorcallTstate.lto_priv.0 (pycore_call.h:92)
==3552159==    by 0x585173: method_vectorcall.lto_priv.0 (classobject.c:59)
==3552159==    by 0x56A380: UnknownInlinedFun (call.c:257)
==3552159==    by 0x56A380: UnknownInlinedFun (call.c:328)
==3552159==    by 0x56A380: PyObject_Call (call.c:355)
==3552159== 
==3552159== HEAP SUMMARY:
==3552159==     in use at exit: 29,630,515 bytes in 30,288 blocks
==3552159==   total heap usage: 220,309 allocs, 190,021 frees, 522,280,354 bytes allocated
==3552159== 
==3552143== 
==3552143== HEAP SUMMARY:
==3552143==     in use at exit: 29,243,757 bytes in 29,955 blocks
==3552143==   total heap usage: 218,912 allocs, 188,957 frees, 522,173,350 bytes allocated
==3552143== 
==3552143== LEAK SUMMARY:
==3552143==    definitely lost: 576 bytes in 18 blocks
==3552143==    indirectly lost: 0 bytes in 0 blocks
==3552143==      possibly lost: 371,032 bytes in 200 blocks
==3552143==    still reachable: 28,872,149 bytes in 29,737 blocks
==3552143==         suppressed: 0 bytes in 0 blocks
==3552143== Rerun with --leak-check=full to see details of leaked memory
==3552143== 
==3552143== For lists of detected and suppressed errors, rerun with: -s
==3552143== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
==3552159== LEAK SUMMARY:
==3552159==    definitely lost: 608 bytes in 19 blocks
==3552159==    indirectly lost: 0 bytes in 0 blocks
==3552159==      possibly lost: 676,795 bytes in 245 blocks
==3552159==    still reachable: 28,953,112 bytes in 30,024 blocks
==3552159==         suppressed: 0 bytes in 0 blocks
==3552159== Rerun with --leak-check=full to see details of leaked memory
==3552159== 
==3552159== For lists of detected and suppressed errors, rerun with: -s
==3552159== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

How to reproduce

I'm afraid I was not able to reproduce it with a simple script.

Clone https://gitlab.softwareheritage.org/swh/devel/swh-dataset.git , check out
f8b5c8665ac802f8b9dd15feb11dad2c2eadaee3, then applying this patch (segfaults
also happen without it, but it makes them more deterministic):

diff --git a/swh/dataset/journalprocessor.py b/swh/dataset/journalprocessor.py
index ab62cb1..87bdb32 100644
--- a/swh/dataset/journalprocessor.py
+++ b/swh/dataset/journalprocessor.py
@@ -24,6 +24,7 @@
     Set,
     Tuple,
 )
+import gc

 from confluent_kafka import Message, TopicPartition
 import tqdm
@@ -344,6 +345,7 @@ def progress_worker(self, queue=None) -> None:
         active_workers = self.processes
         offset_diff = sum((hi - lo) for lo, hi in self.offsets.values())
         desc = "  - Export"
+        gc.collect()
         with tqdm.tqdm(total=offset_diff, desc=desc, unit_scale=True) as pbar:
             while active_workers:
                 item = queue.get()
Installed Python packages
androguard==3.4.0a1
apksigcopier==1.1.1
appdirs==1.4.4
argcomplete==2.0.0
asn1crypto==1.5.1
asttokens==2.2.1
attrs==22.2.0
backcall==0.2.0
beautifulsoup4==4.11.2
beniget==0.4.1
binwalk==2.3.3
black==22.10.0
Brotli==1.0.9
cassandra-pylib==0.0.0
certifi==2022.9.24
cfgv==3.3.1
chardet==5.1.0
charset-normalizer==3.0.1
click==8.1.3
colorama==0.4.6
contourpy==1.0.7
cryptography==38.0.4
cssselect==1.2.0
cupshelpers==1.0
cycler==0.11.0
dbus-python==1.3.2
decorator==5.1.1
defusedxml==0.7.1
diffoscope==240
distlib==0.3.6
distro==1.8.0
dulwich==0.21.2
EditorConfig==0.12.3
elementpath==2.5.3
enjarify==1.0.3
executing==1.2.0
fastimport==0.9.14
filelock==3.9.0
fonttools==4.38.0
fs==2.4.16
gast==0.5.2
GDAL==3.6.2
gunicorn==20.1.0
html5lib==1.1
httpie==3.2.1
httplib2==0.20.4
identify==2.5.17
idna==3.3
importlib-metadata==4.12.0
iotop==0.6
ipython==8.5.0
jedi==0.18.2
jsbeautifier==1.14.4
jsondiff==2.0.0
kiwisolver==0.0.0
libarchive-c==2.9
libevdev==0.5
lxml==4.9.2
lz4==4.0.2+dfsg
markdown-it-py==2.1.0
matplotlib==3.6.3
matplotlib-inline==0.1.6
mdurl==0.1.2
mechanize==0.4.8
more-itertools==8.10.0
mpmath==0.0.0
msgpack==1.0.3
multidict==6.0.4
mypy-extensions==1.0.0
netifaces==0.11.0
networkx==2.8.8
nodeenv==0.13.4
numpy==1.24.2
olefile==0.46
oscrypto==1.3.0
packaging==23.0
parso==0.8.3
pathspec==0.11.1
pbr==5.10.0
pdfminer.six==20221105
pexpect==4.8.0
pickleshare==0.7.5
Pillow==9.4.0
platformdirs==2.6.0
ply==3.11
powerline-status==2.8.3
pre-commit==3.0.4
progressbar==2.5
prompt-toolkit==3.0.36
psutil==5.9.4
ptyprocess==0.7.0
pure-eval==0.0.0
pycairo==1.20.1
pycryptodomex==3.11.0
pycups==2.0.1
pycurl==7.45.2
pydot==1.4.2
Pygments==2.14.0
PyGObject==3.42.2
pygraphviz==1.7
PyOpenGL==3.1.6
pyparsing==3.0.9
pypdf==3.4.1
pyperclip==1.8.2
pyprof2calltree==1.4.5
PyQt5==5.15.9
PyQt5-sip==12.11.1
PyQt6==6.4.2
PyQt6-sip==13.4.1
pyqtgraph==0.13.1
PySimpleSOAP==1.16.2
pysmbc==1.0.23
PySocks==1.7.1
pyte==0.8.0
python-apt==2.6.0
python-dateutil==2.8.2
python-debian==0.1.49
python-debianbts==4.0.1
python-magic==0.4.26
pythran==0.11.0
pytz==2022.7.1
pyudev==0.24.0
pyxattr==0.8.1
PyYAML==6.0
reportbug==12.0.0
requests==2.28.1
requests-toolbelt==0.10.1
rich==13.3.1
rpm==4.18.0
rubber==1.6.0
scipy==1.10.1
scour==0.38.2
six==1.16.0
soupsieve==2.3.2
stack-data==0.6.2
stevedore==4.0.2
sympy==1.11.1
tlsh==0.2.0
tqdm==4.64.1
traitlets==5.5.0
ufoLib2==0.14.0
unicodedata2==15.0.0
urllib3==1.26.12
virtualenv==20.17.1+ds
virtualenv-clone==0.3.0
virtualenvwrapper==4.8.4
wcwidth==0.2.5
webencodings==0.5.1
xkcdpass==1.19.3
xmlschema==1.10.0
xmltodict==0.13.0
zipp==1.0.0
zstd==1.5.2.5

Checklist

Please provide the following information:

  • confluent-kafka-python and librdkafka version (confluent_kafka.version() and confluent_kafka.libversion()): ('2.4.0', 33816576) and ('2.4.0', 33816831)

  • Apache Kafka broker version: none

  • Client configuration:
    {'auto.offset.reset': 'earliest', 'bootstrap.servers': '127.0.0.1:45305', 'debug': 'all', 'enable.auto.commit': False, 'enable.partition.eof': True, 'error_cb': <function _error_cb at 0x7efc3959aa20>, 'group.id': 'test-consumer-gqizofzwretest_parallel_journal_processor', 'logger': <Logger swh.journal.client.rdkafka (DEBUG)>, 'message.max.bytes': '524288000', 'on_commit': <function _on_commit at 0x7efc3959bce0>}

  • Operating system: Debian 11

  • Provide client logs (with 'debug': '..' as necessary): crash_log.txt

  • Provide broker log excerpts

  • Critical issue

@KudoKhang
Copy link

Interested 🗿

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants