You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Today I was scratching my head, because TNT Search didn’t find pages that definitely contain my search words. Until I realized that most of my text contains soft hyphens, to give the renderer some hyphenation hints for our looong German words:
Now of course the search won’t find something like Text­datei or Textdatei (invisible U+00AD inside), and a user cannot know how I hyphenated my text.
It’s even worse with ligatures, which are heavily used in German Fraktursatz as well as in Arabic/Persian/Indian languages, to control how a word actually looks like. This is mostly done using Unicode U+200Czero-width non-joiner and U+200Dzero-width joiner.
Here’s my proposal for better search:
Since we’re already "cleaning" the searched pages in getCleanContent() (in file user/plugins/tntsearch/classes/GravTNTSearch.php), we might as well remove these in-word Unicode control characters before looking for a match.
I have tried this here, using Grav v1.7.43, Admin v1.10.43, TNT Search v3.4.0, and it works well, just by adding:
in getCleanContent(). As you see, we have to check the four most common use cases for each character, since article editors could use any variant in their Markdown text. Some lucky ones even have keyboards with these characters on them.
I guess this change will improve the TNT Search Plugin a lot, being able to find text even if it has been typographically enhanced on the web site. Of course one couldn’t search for the replaced entities anymore (like ­) but that shouldn’t be a problem, I think.
Strictly spoken, user input from the search box should also have these removed, but a website user would probably never enter soft hyphen or ligature control in the search box, I assume. At least I wouldn’t enter Text­datei, Brot‌zeit or Auf‌lage (or use the invisible keys) but instead use a simple textdatei, brotzeit or auflage for searching:
If there are no objections, I could prepare a pull request.
The text was updated successfully, but these errors were encountered:
Today I was scratching my head, because TNT Search didn’t find pages that definitely contain my search words. Until I realized that most of my text contains soft hyphens, to give the renderer some hyphenation hints for our looong German words:
Now of course the search won’t find something like
Text­datei
orTextdatei
(invisibleU+00AD
inside), and a user cannot know how I hyphenated my text.It’s even worse with ligatures, which are heavily used in German Fraktursatz as well as in Arabic/Persian/Indian languages, to control how a word actually looks like. This is mostly done using Unicode
U+200C
zero-width non-joiner andU+200D
zero-width joiner.Here’s my proposal for better search:
Since we’re already "cleaning" the searched pages in getCleanContent() (in file
user/plugins/tntsearch/classes/GravTNTSearch.php
), we might as well remove these in-word Unicode control characters before looking for a match.I have tried this here, using Grav v1.7.43, Admin v1.10.43, TNT Search v3.4.0, and it works well, just by adding:
in
getCleanContent()
. As you see, we have to check the four most common use cases for each character, since article editors could use any variant in their Markdown text. Some lucky ones even have keyboards with these characters on them.I guess this change will improve the TNT Search Plugin a lot, being able to find text even if it has been typographically enhanced on the web site. Of course one couldn’t search for the replaced entities anymore (like
­
) but that shouldn’t be a problem, I think.Strictly spoken, user input from the search box should also have these removed, but a website user would probably never enter soft hyphen or ligature control in the search box, I assume. At least I wouldn’t enter
Text­datei
,Brot‌zeit
orAuf‌lage
(or use the invisible keys) but instead use a simpletextdatei
,brotzeit
orauflage
for searching:If there are no objections, I could prepare a pull request.
The text was updated successfully, but these errors were encountered: