Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implementing hashtags per UAX31 standard #40

Merged
merged 30 commits into from
Aug 24, 2023
Merged

Conversation

farooqkz
Copy link
Collaborator

@farooqkz farooqkz commented Aug 9, 2023

closes #8, closes #32

Old

Hello. This is a draft. I am stuck with Rust stuff :)

I need some hints. Like how to pass the sets to hashtag and advice on implementing From<PropertiesError> for CustomError. Furthermore, I've upgraded the rust toolchain to 1.71.0 because icu_properties requires it.

@Simon-Laux
Copy link
Member

icu_properties seems big, I would prefer if we could live without it. Like there is a difference between an emoji regex that checks for each emoji and one that checks for unicode ranges instead. Like if code point is in this range instead of comparing with all code points of that range - if possible that would be faster, less code and less binary size.

@farooqkz
Copy link
Collaborator Author

icu_properties seems big, I would prefer if we could live without it. Like there is a difference between an emoji regex that checks for each emoji and one that checks for unicode ranges instead. Like if code point is in this range instead of comparing with all code points of that range - if possible that would be faster, less code and less binary size.

It turns out we have to extract from here:

https://www.unicode.org/Public/14.0.0/ucdxml/

@Simon-Laux
Copy link
Member

allowing only a subset and expanding later is better that allowing more and restricting later. It does not have to be 100% perfect on the first try

@farooqkz
Copy link
Collaborator Author

icu_properties seems big, I would prefer if we could live without it. Like there is a difference between an emoji regex that checks for each emoji and one that checks for unicode ranges instead. Like if code point is in this range instead of comparing with all code points of that range - if possible that would be faster, less code and less binary size.

Are you sure it's big? The binary size is only about 100 KB larger.

@farooqkz
Copy link
Collaborator Author

Okay we've got two ways:

  • Create our own hashtag content character detector crate which I guess will be about 20-30 KB binary size and use it in this crate.
  • Use icu_properties and make it optional. With this option, the hashtag detection will be done exactly as UAX31 has specified. Without it, we consider any non whitespace character a content character but ~100KB less binary size.

@farooqkz
Copy link
Collaborator Author

Okay. My new estimate is that if we implement our own solution, it'll be only about 10KB additional binary size.

@farooqkz
Copy link
Collaborator Author

Okay. My new estimate is that if we implement our own solution, it'll be only about 10KB additional binary size.

Okay I was wrong. See my latest commit. It's about 36KB :)

Also @Simon-Laux, for some reason I can't tell, the tests are not passing :/

@farooqkz farooqkz marked this pull request as ready for review August 14, 2023 10:51
@Simon-Laux
Copy link
Member

Simon-Laux commented Aug 14, 2023

Also @Simon-Laux, for some reason I can't tell, the tests are not passing :/

if you read the clip error it complains being run from a branch on your fork. But also litemap seems to use an unstable feature.

src/parser/parse_from_text/hashtag_content_char_ranges.rs Outdated Show resolved Hide resolved
src/parser/parse_from_text/text_elements.rs Outdated Show resolved Hide resolved
src/parser/parse_from_text/text_elements.rs Outdated Show resolved Hide resolved
@farooqkz
Copy link
Collaborator Author

Okay. In these final commits:

  • I've moved the function for finding range to its own file alongside with the ranges themselves.
  • I've used RangeInclusive and its contains method instead of [u32; 2].
  • A new enum has been defined as the result of the find range function.

But still the tests for German umlauts fail. Even though the left and right of the assertions seem to be equal:

---- text_to_ast::text_only::german_umlaut_hashtag stdout ----
thread 'text_to_ast::text_only::german_umlaut_hashtag' panicked at 'assertion failed: `(left != right)`
  left: `[Tag("#bücher"), Text(" "), Tag("#Ängste")]`,
 right: `[Tag("#bücher"), Text(" "), Tag("#Ängste")]`', tests/text_to_ast/text_only.rs:114:5

Copy link
Member

@Simon-Laux Simon-Laux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still want a unit test that checks all ranges, and more example tests with different "real" use cases. But other than that the code looks good I'd say (didn't test nor benchmark it yet)

@farooqkz
Copy link
Collaborator Author

I still want a unit test that checks all ranges, and more example tests with different "real" use cases. But other than that the code looks good I'd say (didn't test nor benchmark it yet)

So you want test cases for the range function separately? I will add more real use case tests. But I also need hint from you why these tests fail :/

@Simon-Laux
Copy link
Member

So you want test cases for the range function separately? I will add more real use case tests. But I also need hint from you why these tests fail :/

yes, you can also test for the german umlaute there: ö, ä, ü and the sharp s: ß

@farooqkz
Copy link
Collaborator Author

@Simon-Laux Ready to merge :D

Simon-Laux and others added 2 commits August 23, 2023 15:37
add 2 more unit tests
add space to early exit
@Simon-Laux Simon-Laux merged commit c049958 into deltachat:master Aug 24, 2023
3 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Hashtags accept only ASCII characters Better Hashtag parsing (allow emojis and other local characters)
2 participants