Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sqlite can end up with valid utf8 sequences which describe invalid codepoints, which break synapse_port_db #3538

Open
matrixbot opened this issue Dec 16, 2023 · 0 comments

Comments

@matrixbot
Copy link
Collaborator

matrixbot commented Dec 16, 2023

This issue has been migrated from #3538.


Error ends up looking like:

2018-07-15 22:54:50,837 - synapse.metrics - 256 - INFO - Collecting gc 0
2018-07-15 22:54:50,936 - synapse_port_db - 562 - ERROR -
Traceback (most recent call last):
File "/usr/bin/synapse_port_db", line 552, in run
consumeErrors=True,
FirstError: FirstError[#16, [Failure instance: Traceback: <class 'psycopg2.DataError'>: invalid byte sequence for encoding "UTF8": 0xed 0xb3 0xb6

/usr/lib/python2.7/dist-packages/twisted/internet/defer.py:434:errback
/usr/lib/python2.7/dist-packages/twisted/internet/defer.py:501:_startRunCallbacks
/usr/lib/python2.7/dist-packages/twisted/internet/defer.py:588:_runCallbacks
/usr/lib/python2.7/dist-packages/twisted/internet/defer.py:1184:gotResult
--- <exception caught here> ---
/usr/lib/python2.7/dist-packages/twisted/internet/defer.py:1126:_inlineCallbacks
/usr/lib/python2.7/dist-packages/twisted/python/failure.py:389:throwExceptionIntoGenerator
/usr/bin/synapse_port_db:269:handle_table
/usr/lib/python2.7/dist-packages/twisted/internet/defer.py:1126:_inlineCallbacks
/usr/lib/python2.7/dist-packages/twisted/python/failure.py:389:throwExceptionIntoGenerator
/usr/bin/synapse_port_db:428:handle_search_table
/usr/lib/python2.7/dist-packages/twisted/python/threadpool.py:246:inContext
/usr/lib/python2.7/dist-packages/twisted/python/threadpool.py:262:<lambda>
/usr/lib/python2.7/dist-packages/twisted/python/context.py:118:callWithContext
/usr/lib/python2.7/dist-packages/twisted/python/context.py:81:callWithContext
/usr/lib/python2.7/dist-packages/twisted/enterprise/adbapi.py:298:_runWithConnection
/usr/bin/synapse_port_db:138:r
/usr/bin/synapse_port_db:415:insert
/usr/lib/python2.7/dist-packages/synapse/storage/_base.py:90:executemany
/usr/lib/python2.7/dist-packages/synapse/storage/_base.py:117:_do_execute
]]

in the event_search logic.

Turns out that 0xed 0xb3 0xb6 is valid utf8, but describes \uDCF7 which is not a valid defined codepoint, which postgres barfs on when you try to insert it.

Python2 doesn't recognise there being anything invalid about it, however.

The workaround in the end was to use iconv_codecs to use iconv to strip invalid codepoints out of the string before handing to postgres, with something like:

row["value"].encode("iconv:utf8", "ignore").decode("utf8")

Which seemed to work on linux, but fails on macOS.

Thanks to @flux:matrix.org for reporting and debugging this!

The original cause of the bad data is matrix-org/synapse#3537

@matrixbot matrixbot changed the title Dummy issue Sqlite can end up with valid utf8 sequences which describe invalid codepoints, which break synapse_port_db Dec 21, 2023
@matrixbot matrixbot reopened this Dec 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant