Skip to content

Feature/qualitative analysis notebook#64

Draft
tanhaow wants to merge 4 commits intodevelopfrom
feature/qualitative-analysis-notebook
Draft

Feature/qualitative analysis notebook#64
tanhaow wants to merge 4 commits intodevelopfrom
feature/qualitative-analysis-notebook

Conversation

@tanhaow
Copy link
Copy Markdown

@tanhaow tanhaow commented Apr 6, 2026

Associated Issue(s): resolves #61

Changes in this PR

Please find the PDF version of the notebook here.

Notes

The PDF format doesn't look very good. I tried both document and slides format export, and the document format looks slightly better. But I still recommend viewing the notebook by running it in Marimo.

Reviewer Checklist

  • Review the notebook

@tanhaow tanhaow marked this pull request as draft April 6, 2026 13:34
@tanhaow tanhaow requested a review from laurejt April 7, 2026 14:04
Copy link
Copy Markdown

@laurejt laurejt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was unable to run the notebook because there is something wrong with the eval data. They do not appear to have matching tr_id with the machine translations on the drive (perhaps the eval data was generated on the wrong files?)

Once the data issue is fixed, please update the notebook so at the top a variable to the phase-1 data directory can be provided. Then all subsequent paths can assume the canonical organization (and names) dictated by TigerData and the project drive (which will likely become the staler copy).

Also, to make this notebook runnable: update the pyproject.toml so that it includes the new package dependencies and tell ruff to ignore files in the notebooks directory (we did something similar for remarx)

- Add DATA_DIR variable pointing to Phase 1 data directory
- Update all data paths to canonical Phase 1 structure
- Add scipy and pandas to pyproject.toml dependencies
- Exclude notebook from ruff linting
@laurejt
Copy link
Copy Markdown

laurejt commented Apr 8, 2026

I was unable to run the notebook because there is something wrong with the eval data. They do not appear to have matching tr_id with the machine translations on the drive (perhaps the eval data was generated on the wrong files?)

Once the data issue is fixed, please update the notebook so at the top a variable to the phase-1 data directory can be provided. Then all subsequent paths can assume the canonical organization (and names) dictated by TigerData and the project drive (which will likely become the staler copy).

Also, to make this notebook runnable: update the pyproject.toml so that it includes the new package dependencies and tell ruff to ignore files in the notebooks directory (we did something similar for remarx)

The tr_ids do not match for the paragraph data, which suggests that the wrong data was evaluated (i.e., not the translation files available on Google Drive / TigerData).

@tanhaow
Copy link
Copy Markdown
Author

tanhaow commented Apr 8, 2026

I was unable to run the notebook because there is something wrong with the eval data. They do not appear to have matching tr_id with the machine translations on the drive (perhaps the eval data was generated on the wrong files?)
Once the data issue is fixed, please update the notebook so at the top a variable to the phase-1 data directory can be provided. Then all subsequent paths can assume the canonical organization (and names) dictated by TigerData and the project drive (which will likely become the staler copy).
Also, to make this notebook runnable: update the pyproject.toml so that it includes the new package dependencies and tell ruff to ignore files in the notebooks directory (we did something similar for remarx)

The tr_ids do not match for the paragraph data, which suggests that the wrong data was evaluated (i.e., not the translation files available on Google Drive / TigerData).

@laurejt Thanks for pointing this out. I found the mismatch is caused by that I re-ran the full pipeline locally (generating new translations and evaluations together), but only uploaded the new eval data to Google Drive, not the updated translations. The local copies on my computer match each other. I can push the updated translation files to Google Drive to resolve this, unless there's a reason to keep the current versions there.

@laurejt
Copy link
Copy Markdown

laurejt commented Apr 8, 2026

Why did you rerun all of the machine translations for the paragraph data? We should not be replacing what we originally created, especially if there wasn't any issues with them. As a reminder, running Google TLLM costs money, so we shouldn't be rerunning this when it's unnecessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants