Skip to content

Replace latexcodec with pylatexenc, using braces-all mode #4284

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jan 1, 2025

Conversation

mbollmann
Copy link
Member

Cleaner duplicate of #4279 going directly into master, in order to see if this would address the current issues with the BibTeX encoding, see #4280.

This PR tests switching to pylatexenc for LaTeX-encoding strings, which is recommended by the latexcodec documentation and is also faster in my testing. I used the "braces-all" mode of pylatexenc, which should hopefully address #4280.

Copy link

codecov bot commented Jan 1, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.55%. Comparing base (d4b57ed) to head (716f75b).
Report is 3 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4284      +/-   ##
==========================================
+ Coverage   92.54%   92.55%   +0.01%     
==========================================
  Files          32       32              
  Lines        2294     2298       +4     
==========================================
+ Hits         2123     2127       +4     
  Misses        171      171              
Files with missing lines Coverage Δ
python/acl_anthology/utils/latex.py 100.00% <100.00%> (ø)

Copy link

github-actions bot commented Jan 1, 2025

@mbollmann
Copy link
Member Author

@danielgildea @nschneid I believe this fixes the issues with the BibTeX-encoding. Could you maybe have a look at the preview branch to spot-check if you see any issues with the way it encodes accents in BibTeX?

@nschneid
Copy link
Contributor

nschneid commented Jan 1, 2025

How to check the preview? BibTeX isn't generated for previews right?

@mbollmann
Copy link
Member Author

It is generated for the first three papers of each volume. Those are also compiled in the anthology.bib.gz.

@danielgildea
Copy link
Collaborator

Looks great, thank you!

@nschneid
Copy link
Contributor

nschneid commented Jan 1, 2025

@mbollmann
Copy link
Member Author

preview.aclanthology.org/fix-bibtex-encoding/2023.cl-2.3 Lu{\'\i}sa - extra backslash?

preview.aclanthology.org/fix-bibtex-encoding/2020.cl-3.3 is a good test case. Also has {\'\i}.

\i is a dotless i, but maybe we shouldn’t have this:

Older versions of LaTeX would not remove the dot on top of the i and j letters when adding a diacritic. To correct this, one had to use the dotless version of these letters, by typing \i and \j. For example:

\^{\i} should be used for i-circumflex î;
\"{\i} should be used for i-umlaut ï.

However, current versions of LaTeX do not need this anymore (and may, in fact, crash with an error).

https://en.wikibooks.org/wiki/LaTeX/Special_Characters#Escaped_codes

Interesting that pylatexenc produces this by default...

@nschneid
Copy link
Contributor

nschneid commented Jan 1, 2025

Tried compiling a bibliography with {\'\i}, and it gave an error, whereas {\'i} worked.

@mbollmann
Copy link
Member Author

I added conversion rules plus tests for í ì î ï to use the regular "i" instead of \i.

@nschneid
Copy link
Contributor

nschneid commented Jan 1, 2025

Great—and what about capitalized equivalents?

@mbollmann
Copy link
Member Author

Great—and what about capitalized equivalents?

Added to a test case now, which already passes.

@mbollmann mbollmann merged commit 75a6a5b into master Jan 1, 2025
13 checks passed
@mbollmann mbollmann deleted the fix-bibtex-encoding branch January 1, 2025 20:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants