Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto detect language feature ? #6

Open
Utsaww opened this issue May 20, 2023 · 4 comments
Open

Auto detect language feature ? #6

Utsaww opened this issue May 20, 2023 · 4 comments

Comments

@Utsaww
Copy link

Utsaww commented May 20, 2023

It's a really good project working out of the box, much appreciated man!
I was wondering if language auto detection feature is there it will be really helpful.

@thammegowda
Copy link
Owner

I agree language ID would be a great feature.
Many available language ID models (e.g. https://fasttext.cc/docs/en/language-identification.html) recognize fewer than 200 languages (they fall short of NLLB models). Are you familiar with any good lang ID detection models that recognize all 200 languages in NLLB?

Pull requests will be greatly appreciated!!

@Utsaww
Copy link
Author

Utsaww commented May 21, 2023

I am working on fasttext Language identification as it has 176 language identification, for now I guess this works, if you are good to go, then I will surely work and create a pull request.

@thammegowda
Copy link
Owner

@Utsaww yes, please! Sorry I missed replying to this message.

Suggestion on how to integrate:
here ...

src_lang = args.get('src_lang') or def_src_lang

if src_lang == '[auto]':
  src_lang = <lang_id_detection>(sources)

@omercandemir
Copy link

omercandemir commented Aug 28, 2024

This is how I detect lang with fasttext. But I haven't tried it yet. I think I need to match the Langids.

    def language_detection_fasttext(self, text: str) -> str:
        """
        Given a text, detects the language code and returns the ISO language code. It supports 176 languages. Uses
        the fasttext model for language detection:
        https://fasttext.cc/blog/2017/10/02/blog-post.html
        https://fasttext.cc/docs/en/language-identification.html


        """
        if self._fasttext_lang_id is None:
            import fasttext
            fasttext.FastText.eprint = lambda x: None   #Silence useless warning: https://github.com/facebookresearch/fastText/issues/1067
            model_path = os.path.join(self._cache_folder, 'lid.176.ftz')
            if not os.path.exists(model_path):
                http_get('https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.ftz', model_path)
            self._fasttext_lang_id = fasttext.load_model(model_path)

        return self._fasttext_lang_id.predict(text.lower().replace("\r\n", " ").replace("\n", " ").strip())[0][0].split('__')[-1]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants