Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exclude not working for apostrofs #27

Open
drummerroma opened this issue Oct 16, 2017 · 5 comments
Open

Exclude not working for apostrofs #27

drummerroma opened this issue Oct 16, 2017 · 5 comments
Labels

Comments

@drummerroma
Copy link

w = "L’Unesco a perdu tout crédit et Donald Trump en tire les conséquences"

tokeniser = WordsCounted::Tokeniser.new(w).tokenise(exclude: "’")

[ [ 0] "l", [ 1] "unesco", [ 2] "a", [ 3] "perdu", [ 4] "tout", [ 5] "crédit", [ 6] "et", [ 7] "donald", [ 8] "trump", [ 9] "en", [10] "tire", [11] "les", [12] "conséquences"

@abitdodgy
Copy link
Owner

@drummerroma interesting. Are you sure it's the same character in the text and exclude clause?

@drummerroma
Copy link
Author

drummerroma commented Oct 16, 2017

I used cut and paste in the rails console. So I am sure.
I have a french (or italian) text pasted from Word (or any other processor) and would like to have treated apostrofated frases counted as single token. Example:
l'appartamento
l'Unesco
l’abandon
Maybe there is a difference between "'" and "’"?

@abitdodgy
Copy link
Owner

@drummerroma yes, there is. These are different characters. The apostrophe, code is ’, while the single quote is '. The trouble is, many people don't use the right chars. For example, they use double quotes " instead “ and ”, the proper quoting chars. For this type of use case, I would exclude both because you can never be certain what the originally typed text contains, or whether it was transformed by an editor.

@drummerroma
Copy link
Author

Yes, I agree. But the exclude instruction should work anyway.

@abitdodgy
Copy link
Owner

I just confirmed that this doesn't exclude the apostrophe but does exclude the single quote. Strange. I'll have to investigate why.

@abitdodgy abitdodgy added the bug label Oct 17, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants