Skip to content

Comments

fix: resolve Windows test failures from encoding and path issues#152

Open
saurabh12nxf wants to merge 2 commits intom-lab:mainfrom
saurabh12nxf:fix/windows-unicode-and-path-compat
Open

fix: resolve Windows test failures from encoding and path issues#152
saurabh12nxf wants to merge 2 commits intom-lab:mainfrom
saurabh12nxf:fix/windows-unicode-and-path-compat

Conversation

@saurabh12nxf
Copy link

Fixes #150

Problem

Running uv run pytest on Windows produces 11 test failures:

  • 10 failures from UnicodeDecodeError — downloads_by_country.sql contains emoji (⚠️) in comments. On Windows, read_text() defaults to cp1252 which can't decode these UTF-8 bytes.
  • 1 failure from path separator mismatch — str(Path("/custom/path")) returns \custom\path on Windows.

Changes

library/src/iqb/pipeline/pipeline.py

  • Added encoding="utf-8" to read_text() call that loads SQL query templates. This is production code — without this fix, the pipeline itself breaks on Windows.

library/tests/iqb/queries_test.py

  • Same encoding="utf-8" fix for all read_text() calls in tests.

library/tests/iqb/cache/cache_test.py

  • Changed str() comparison to Path() comparison for cross-platform compatibility. Added Path import.

Verification

Before: 279 passed, 11 failed
After: 290 passed, 0 failed

@bassosimone
Copy link
Collaborator

What gives me guarantees that this code runs correctly on Windows? Perhaps you could try to add a GitHub workflow that ensures the tests are working as intended? Also, can you comment on each line of code you changed and explain to me it is correct, please? Maybe I am missing something and I'd like to understand your reasoning.

@saurabh12nxf
Copy link
Author

Hi @bassosimone, thanks for the review! I've addressed both points:

  1. GitHub Workflow for Windows CI
    I pushed a second commit (9d66025) that adds a test-windows job to
    .github/workflows/ci.yml
    . It runs the library tests on windows-latest using the same setup pattern as the existing test-all job. Once the workflow is approved, it will prove the fix works in CI, not just locally.

  2. Line-by-line reasoning
    library/src/iqb/pipeline/pipeline.py

  • line 195
 #Before
template_text = query_file.read_text()
 #After
template_text = query_file.read_text(encoding="utf-8")

Root cause: The SQL files (e.g.
downloads_by_country.sql
line 42) contain the ⚠️ emoji in comments. These files are saved as UTF-8.

On Linux/macOS, read_text() defaults to UTF-8 - works fine. On Windows, Python defaults to cp1252 (via locale.getpreferredencoding()). The emoji's UTF-8 byte sequence contains 0x8F, which is unmappable in cp1252, so it raises:

UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 2391

The fix is explicit encoding="utf-8" - which is also consistent with the very next line that already assumes UTF-8:

template_hash = hashlib.sha256(template_text.encode("utf-8")).hexdigest()

This is production code - without this fix,
_load_query_template()
(and therefore the entire pipeline) would crash on Windows.

Why it's safe on Linux: Passing encoding="utf-8" on a system that already defaults to UTF-8 is a no-op - identical behavior.

library/tests/iqb/queries_test.py

  • 12 lines (test code)
    All 12 changes are identical - same fix as above:
# Before
content = query_file.read_text()
# After
content = query_file.read_text(encoding="utf-8")

Why: These tests read the same SQL files that contain ⚠️ emojis. Same cp1252 crash on Windows, same fix. Applied to both
downloads_by_country.sql
and uploads_by_country.sql tests for consistency (both files contain the emoji).

library/tests/iqb/cache/cache_test.py

  • lines 4, 25 (test code)
# Before
assert str(cache.data_dir) == "/custom/path"

# After
from pathlib import Path
assert cache.data_dir == Path("/custom/path")

Why: cache.data_dir is a pathlib.Path. On Linux, str(Path("/custom/path")) returns "/custom/path" - matches the assertion. On Windows, it returns "\custom\path" (Windows normalizes forward slashes to backslashes), so the string comparison fails.

The fix compares Path objects directly. Path("/custom/path") == Path("/custom/path") is True on all platforms because both sides go through the same OS-level normalization. This tests the actual intent - "does the cache store the correct path?" - rather than "does the path's string representation look a certain way on this OS?

@bassosimone
Copy link
Collaborator

After listing our project for GSoC, we received a large amount of pull requests across several repositories. We are dealing with the backlog, but this would take time. We will get back to this pull request eventually. In the meanwhile, if you are a GSoC applicant, please read our updated GSoC policy: https://github.com/m-lab/gsoc/.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Tests fail on Windows: UnicodeDecodeError when reading SQL query templates

2 participants