-
-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved accuracy for small documents #100
Comments
I’ve been meaning to backport some of the nice additions from: https://github.com/greyblake/whatlang-rs#how-is_reliable-calculated. I believe you’re right that n-grams are less suited for small input. However, I think an n-gram based approach has the nice benefit of supporting many more languages, including some that are in danger of going extinct. A different idea to investigate is: #83 |
That sounds potentially like a good hint, although if a letter is not present in the known trigrams/alphabet at all the trigrams from the document containing it will already be ranked pretty poorly, it might be worth to rank them worse than other non-matching trigrams that at least use the right letters though 🤔 I'm not sure how much of a difference that will make in practice though. I wonder if one could "just" somehow feed all known words of all known languages straight from the dictionaries used for spell checking to some neural network, then one could try to classify all the words in the document and average the output probabilities for each language out 🤔 the problem is I have no clue if that would work at all (I think that way one wouldn't take advantage of the fact that words like "the" are used all the time in English, and it may introduce some bias toward languages with gazillions of words) and I know nothing about machine learning. It's interesting how CLD3 is able to segmentate a document into sections where different languages are being used, I haven't checked how well that works though. No clue about how that works either. |
I’m probably remembering the details wrong, but I think this is how Google is very fast, on tiny input, at detecting for example that “turkish I” (or so).
Yep, indeed. That’s what I’m afraid of too.
I’d guess that they don’t just do paragraphs, as it’s possible to have one phrase of french words in an otherwise english sentence. Then I’d score each phrase. Then there should be some bias towards the repeatedly occurring languages: if a language is mostly french and english, it’s unlikely that one sentence is scottish instead of english. |
This would be a great addition, Im doing an MBA and I use Obsidian to store my notes in English, Spanish And French, but I only extract the lang from html tags, and most of the time they are incorrect. So I would like to send a small sample of the document, around 140 characters (the summary), and get the correct language. |
I my opinion, the use of translated texts is what makes the accuracy of language detection diminish. You do not need texts to be translations of another original text, because some translations will introduce words that, otherwise, in a text originally composed in that language, they might not use. A certain bias is introduced with the use of translations.
is the root of the problem |
Yes. But it is also exactly what makes franc, and its support for more languages than anything else, possible. See #100 (comment) and some of the linked discussions :) |
Would love this! |
I might give the following approach a shot in my spare time:
The idea being that the result might be more accurate because:
Also the resulting program may be easier to improve, the network could be tweaked, the underlying dataset could be swapped with a bigger one (I couldn't find a reasonably sized one though, and certainly not one that supports as many languages as the UDHR is translated). I'll report back any findings 👍 Edit: this may be a good dataset: https://paperswithcode.com/dataset/wili-2018 |
I'm very excited in your findings! See wooorm/udhr and wooorm/trigrams for inspiration on 1 through 3. |
The ML approach may be usable, some findings:
For now I've put the project on hold, but I'll get back to it in the future, it seems promising, with the right dataset. |
Alright I've spent some more time on this, I got something but it doesn't seem super interesting.
Comparing it against CLD3, franc, franc-all and franc-min I got the following list of accuracies divided by language: - afr
- cld3: 0
- franc: 0.6203170028818443
- francAll: 0.5770893371757925
- francMin: 0
- lande: 0
- ara
- cld3: 0
- franc: 0
- francAll: 0
- francMin: 0
- lande: 0.9746333333333334
- aze
- cld3: 0
- franc: 0
- francAll: 0
- francMin: 0
- lande: 0.004921618665694495
- bel
- cld3: 0
- franc: 0.7479899101371591
- francAll: 0.7040832413684377
- francMin: 0.7818855431183982
- lande: 0.8424247201639603
- ben
- cld3: 0
- franc: 0.9493717664449371
- francAll: 0.9493717664449371
- francMin: 0.9493717664449371
- lande: 0.9358832224685883
- bul
- cld3: 0
- franc: 0.5330446396050023
- francAll: 0.5196891820794043
- francMin: 0.6722651665385082
- lande: 0.48059411550447206
- cat
- cld3: 0
- franc: 0.5487940630797774
- francAll: 0.3800865800865801
- francMin: 0
- lande: 0.2902906617192331
- ces
- cld3: 0
- franc: 0.32463333333333333
- francAll: 0.26616666666666666
- francMin: 0.44126666666666664
- lande: 0.732
- ckb
- cld3: 0
- franc: 0
- francAll: 0
- francMin: 0
- lande: 0.9240833333333334
- cmn
- cld3: 0
- franc: 0.5978666666666667
- francAll: 0.5978666666666667
- francMin: 0.5978666666666667
- lande: 0.9497
- dan
- cld3: 0
- franc: 0.4583
- francAll: 0.4345333333333333
- francMin: 0
- lande: 0.7901
- deu
- cld3: 0
- franc: 0.8122333333333334
- francAll: 0.7596
- francMin: 0.9238666666666666
- lande: 0.9237666666666666
- ell
- cld3: 0
- franc: 0.9859666666666667
- francAll: 0.9859666666666667
- francMin: 0.9859666666666667
- lande: 0.991
- eng
- cld3: 0
- franc: 0.5408333333333334
- francAll: 0.4644666666666667
- francMin: 0.7935
- lande: 0.9475333333333333
- est
- cld3: 0
- franc: 0
- francAll: 0
- francMin: 0
- lande: 0
- eus
- cld3: 0
- franc: 0
- francAll: 0.6401423257318454
- francMin: 0
- lande: 0.5725376031052887
- fin
- cld3: 0
- franc: 0.7455666666666667
- francAll: 0.46116666666666667
- francMin: 0
- lande: 0.9574333333333334
- fra
- cld3: 0
- franc: 0.8001333333333334
- francAll: 0.6492
- francMin: 0.8789
- lande: 0.9019333333333334
- hau
- cld3: 0
- franc: 0.8678492849284929
- francAll: 0.8256325632563256
- francMin: 0.9248716538320498
- lande: 0.9421983865053172
- heb
- cld3: 0
- franc: 0.9465333333333333
- francAll: 0.9465333333333333
- francMin: 0
- lande: 0.9901666666666666
- hin
- cld3: 0
- franc: 0.5303505843071786
- francAll: 0.5278130217028381
- francMin: 0.5303505843071786
- lande: 0.7825041736227045
- hrv
- cld3: 0
- franc: 0.2578211262421789
- francAll: 0.18862716231137283
- francMin: 0.32664703717335297
- lande: 0
- hun
- cld3: 0
- franc: 0.7023
- francAll: 0.6457666666666667
- francMin: 0.7784
- lande: 0.9340333333333334
- hye
- cld3: 0
- franc: 0.9631589261218891
- francAll: 0.9631589261218891
- francMin: 0
- lande: 0.9102488732118362
- ind
- cld3: 0
- franc: 0.3552997221119492
- francAll: 0.3319344411047468
- francMin: 0.39426076107298813
- lande: 0.8465944535813531
- isl
- cld3: 0
- franc: 0
- francAll: 0.7015968063872255
- francMin: 0
- lande: 0.8826193766313527
- ita
- cld3: 0
- franc: 0.5973333333333334
- francAll: 0.3994666666666667
- francMin: 0.7073333333333334
- lande: 0.883
- jpn
- cld3: 0
- franc: 0.9564666666666667
- francAll: 0.9564666666666667
- francMin: 0.9564666666666667
- lande: 0.955
- kat
- cld3: 0
- franc: 0.9072665479851109
- francAll: 0.9072665479851109
- francMin: 0
- lande: 0.9705453956950963
- kaz
- cld3: 0
- franc: 0.8507681053401609
- francAll: 0.7539624481833699
- francMin: 0.9290416971470373
- lande: 0
- kor
- cld3: 0
- franc: 0.8159621948017852
- francAll: 0.8159621948017852
- francMin: 0.8159621948017852
- lande: 0.9372538724074561
- lit
- cld3: 0
- franc: 0.5288
- francAll: 0.42156666666666665
- francMin: 0
- lande: 0.9143666666666667
- mar
- cld3: 0
- franc: 0.7049666666666666
- francAll: 0.6956
- francMin: 0.7049666666666666
- lande: 0.9022333333333333
- mkd
- cld3: 0
- franc: 0.5904666666666667
- francAll: 0.5827333333333333
- francMin: 0
- lande: 0.7509333333333333
- nld
- cld3: 0
- franc: 0.5688666666666666
- francAll: 0.5407333333333333
- francMin: 0.8422
- lande: 0.9222
- nob
- cld3: 0
- franc: 0.25186988009022915
- francAll: 0.23881040009497803
- francMin: 0
- lande: 0.32286596224623054
- pes
- cld3: 0
- franc: 0.45455898771864534
- francAll: 0.45199106810569406
- francMin: 0.9215481950130257
- lande: 0.955563825828061
- pol
- cld3: 0
- franc: 0.6814333333333333
- francAll: 0.6161333333333333
- francMin: 0.7497333333333334
- lande: 0.9480666666666666
- por
- cld3: 0
- franc: 0.5497333333333333
- francAll: 0.44143333333333334
- francMin: 0.7293666666666667
- lande: 0.8747666666666667
- ron
- cld3: 0
- franc: 0.5804217174289498
- francAll: 0.4924111235611694
- francMin: 0.6998404128892058
- lande: 0.8934501375165529
- run
- cld3: 0
- franc: 0.33597285067873306
- francAll: 0.290158371040724
- francMin: 0.4326923076923077
- lande: 0.016025641025641024
- rus
- cld3: 0
- franc: 0.45686666666666664
- francAll: 0.4451
- francMin: 0.48556666666666665
- lande: 0.8157
- slk
- cld3: 0
- franc: 0.2931611117518164
- francAll: 0.24414715719063546
- francMin: 0
- lande: 0.5831507323261447
- spa
- cld3: 0
- franc: 0.49323333333333336
- francAll: 0.26563333333333333
- francMin: 0.6714333333333333
- lande: 0.8227
- srp
- cld3: 0
- franc: 0.25716666666666665
- francAll: 0.1847
- francMin: 0.3091333333333333
- lande: 0.6424
- swe
- cld3: 0
- franc: 0.4616
- francAll: 0.42083333333333334
- francMin: 0.6951333333333334
- lande: 0.8498666666666667
- tgl
- cld3: 0
- franc: 0.40868407032498666
- francAll: 0.3979754928076718
- francMin: 0.630207778369739
- lande: 0.9145444858817262
- tur
- cld3: 0
- franc: 0.4473
- francAll: 0.26206666666666667
- francMin: 0.5488
- lande: 0.9459333333333333
- ukr
- cld3: 0
- franc: 0.5976333333333333
- francAll: 0.5812666666666667
- francMin: 0.6293333333333333
- lande: 0.7911666666666667
- vie
- cld3: 0
- franc: 0.7861802255148634
- francAll: 0.6792470412822663
- francMin: 0.8431180691454664
- lande: 0.9648681390364365 Which looking at it it seems to have ignored some smaller languages, as I guess it was more useful for it to instead focus on the languages that have more sentences, that's something that I should try to address somehow 🤔 In summary I made some progress, it would be interesting to try this approach using UDHRs as the dataset, but so far it didn't seem to have turned out particularly well or be particularly promising. |
interesting stuff! |
Some more findings, in case they could be useful:
Accuracy output:
Benchmark:
|
lol, famous last words. I thought I was already encoding weights with 2 bytes rather than 4, so I tried to add a way to encode them with just 1 byte hoping that the accuracy wouldn't go down too much. Turns out I was still using 4 bytes, so now all the weights take 25% of the space. As a result the library is just a tiny bit less accurate, but it now weighs 78.7kb, which makes it much more interesting imo. |
Last (?) update:
Basically I think after some tweaks it seems to have turned out pretty well. If anybody would like to spend the time needed to turn the UDHRs into a dataset, and if @wooorm is interested in that, I'd be happy to train dedicated models that do what franc/franc-all/franc-min can do, I'd guess for roughly similar bundle sizes this approach could deliver much higher accuracies, if there's enough data to learn from in the UDHRs, so it'd be interesting to try it. Updated comparison:
|
@fabiospampinato Wow! Nice job. I'm testing the accuracy of your library vs franc in a project, but at the moment there is no relevant differences For me, the problem with franc is accuracy. The weight is not a problem in my case. |
If the weight is not a problem CLD3 should be pretty good. If even that is not good enough depending on your use case maybe you'd have to customize my thing or something, one could make the model bigger, train it on more examples, support fewer languages, look at quadgrams too etc. all that should drive accuracy up.
Original language I believe, I'm using this. |
@porkopek I've just made it slightly more accurate by taking into account the top 100 quadgrams also. I seem to have hit some kind of ceiling though, I can't get to 90% accuracy by just making the network bigger 🤔 though I haven't tried increasing multiple numbers at the same time. |
The conversation currently is about how different projects, specifically around neural networks, work. This issue is about improving Can questions about other specific other tools be discussed elsewhere? Exploratory work is of course interesting, so perhaps discuss it in other places, and link to those here? |
I'd like to play with patching franc, or making some alternative to it, that can detect the language of small documents much more accurately.
First of all is this something that could be interesting to merge into franc itself?
Secondly I'm almost clueless about language classification, could trying the following things make sense?
From a shallow reading of this paper on n-grams it sounds to me like ngrams may be fundamentally not well suited for short documents because there just isn't enough data to reconstruct the top 300 or whatever ngrams reliably from that, maybe 🤔.
CLD3 seems to feed unigrams bigrams and trigrams to some neural network and that seems to work much better for smaller texts somehow, I'm not sure how or why, maybe that's the way to go.
Any other ideas that I should try?
The text was updated successfully, but these errors were encountered: