-
-
Notifications
You must be signed in to change notification settings - Fork 83
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
New: apply unicode normalization while resolving notes
The unicode standard allows for certain (visually) identical characters to be represented in different ways. For example the character ä may be represented as a single combined codepoint "Latin Small Letter A with Diaeresis" (U+00E4) or by the combination of "Latin Small Letter A" (U+0061) followed by "Combining Diaeresis" (U+0308). When encoded with UTF-8, these are represented as respectively the two bytes 0xC3 0xA4, and the three bytes 0x61 0xCC 0x88. A user linking to notes with these characters in their titles would expect these two variants to link to the same file, given they are visually identical and have the exact same semantic meaning. The unicode standard defines a method to deconstruct and normalize these forms, so that a byte comparison on the normalized forms of these variants ends up comparing the same thing. This is called Unicode Normalization, defined in Unicode® Standard Annex #15 (http://www.unicode.org/reports/tr15/). The W3C Working Group has written an excellent explanation of the problems regarding string matching, and how unicode normalization helps with this process: https://www.w3.org/TR/charmod-norm/#unicodeNormalization With this change, obsidian-export will perform unicode normalization (specifically the C (or NFC) normalization form) on all note titles while looking up link references, ensuring visually identical links are treated as being similar, even if they were encoded as different variants. A special thanks to Hans Raaf (@oderwat) for reporting and helping track down this issue. --- Closes #126
- Loading branch information
Showing
3 changed files
with
281 additions
and
11 deletions.
There are no files selected for viewing
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters