-
-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rtf_to_text() converts RTF cp1252 russian text bad #50
Comments
Hi, according to wikipedia cyrilic rtf should be encoded in cp1251 and not in cp1252. If I change the rtf content to cp1251 it works fine. cp1252 is the western encoding. |
MS Word 2016 (test-2016.zip) save with 1251, but new MS Word 2021 (or below) after 2016 save as 1252 (test-rus). |
If a file (whether it's RTF or any other encoding) lists the wrong encoding, you are going to get mojibake … I don't think there's anything striprtf can realistically do about buggy RTF files. |
I have created a small test case myself with word 365 and indeed it saves it with encoding 1252. I have no idea how in this case word finds out which is the right encoding. Some online rtf viewers (https://products.groupdocs.app/de/viewer/rtf, https://jumpshare.com/viewer/rtf) are also able to display the content correctly. Also Wordpad shows it correctly. The question is how do they figure out the right encoding? |
Thanks. I did like this: |
@svladimirs: Glad you got a workaround. If I am running your code I get: LookupError: unknown encoding: ansi. How can this run? |
Maybe they do charset detection? |
I tried the chardet library and it told me with nearly 80% confidence that the encoding is ISO-8859-8 which is Hebrew.
|
@joshy, You're probably using python < 3.6. See 7.2.4.1. Text Encodings. From that link on stackoverflow author used .encode('iso-8859-1').decode('cp1251'), but I tried to write universal code. https://learn.microsoft.com/en-us/cpp/text/support-for-multibyte-character-sets-mbcss?view=msvc-170 What do you think about: def rtf_to_text(text, encoding="mbcs", errors="strict"). Will it work? |
@svladimirs As you can see I am using python 3.9. Regarding to your proposals:
|
Well, mbcs won't work either... |
As a library I can't do that, you as a user can do that. The reason is that the specified encoding in the rtf file is correct and the library would convert it to a wrong encoding. |
striprtf 0.0.26
{\rtf1\ansi\ansicpg1251
{\rtf1\adeflang1025\ansi\ansicpg1251
rtf_to_text() converting RTFs cp1251 is well (Russian text).
{\rtf1\adeflang1025\ansi\ansicpg1252
But not cp1252:
абвгдеёжзийклмнопрст -> àáâãäå¸æçèéêëìíîïðñò
encoding=... do not help.
This helps:
https://ru.stackoverflow.com/questions/1145225/Ошибка-обработки-файлов-rtf-на-python?ysclid=lqagyqz7x5798462943
or
rtf_to_text(rtf.read()).encode('cp1252').decode('ansi')
test-rus.zip
The text was updated successfully, but these errors were encountered: