Encoding is variably UTF-8 or ISO-8859-1 #8

dhdaines · 2024-01-05T16:59:53Z

Hi! Great work on this dataset - unfortunately, reading the files with Python fails because the encodings vary in some of the records. Luckily they all seem to be either UTF-8 or ISO-8859-1 for the moment. You can fix them with this script:

#!/usr/bin/env python3

import chardet
import logging
from pathlib import Path

for path in Path(".").glob("*.csv"):
    tmp = path.with_suffix(".csv.tmp")
    with open(path, "rb") as infh, open(tmp, "wt") as outfh:
        for idx, spam in enumerate(infh):
            try:
                line = spam.decode("utf8")
            except UnicodeDecodeError:
                det = chardet.detect(spam)
                logging.warning(
                    "Line %d of %s is not UTF8, probably %s, re-encoding it",
                    idx,
                    path,
                    det["encoding"],
                )
                line = spam.decode(det["encoding"])
            outfh.write(line)
    tmp.rename(path)

nonviolent-action-lab · 2024-01-05T18:44:31Z

Apologies for the hassle, and thank you for posting this workaround.

In a recent R update, the way encoding gets handled was changed. I think that's why this is happening now, but I haven't been able to dig into it yet. I hope I can get this sorted in the near future.

nonviolent-action-lab closed this as completed Jan 5, 2024

nonviolent-action-lab reopened this Jan 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encoding is variably UTF-8 or ISO-8859-1 #8

Encoding is variably UTF-8 or ISO-8859-1 #8

dhdaines commented Jan 5, 2024

nonviolent-action-lab commented Jan 5, 2024

Encoding is variably UTF-8 or ISO-8859-1 #8

Encoding is variably UTF-8 or ISO-8859-1 #8

Comments

dhdaines commented Jan 5, 2024

nonviolent-action-lab commented Jan 5, 2024