Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

n_words: 0 with CJK texts #133

Open
yokyoku-taikan opened this issue Aug 29, 2023 · 1 comment
Open

n_words: 0 with CJK texts #133

yokyoku-taikan opened this issue Aug 29, 2023 · 1 comment

Comments

@yokyoku-taikan
Copy link

Hello,

I am currently trying to use TopicsExplorer with a corpus of CJK texts (all of the .txt files are UTF-8 encoded).
After I click on "Train topic model" the program outputs the following error message:

Closing connection to database... Fetching stopwords... Cleaning corpus... Connecting to database... Insert stopwords into database... Closing connection to database... Successfully preprocessed data. Connecting to database... Insert token frequencies into database... Closing connection to database... Creating topic model... n_documents: 16 vocab_size: 0 n_words: 0 n_topics: 10 n_iter: 100 all zero row in document-term matrix found ERROR: Cannot cast array data from dtype('float64') to dtype('int64') according to the rule 'safe' Redirect to error page...

It seems to me as though TopicsExplorer is unable to recognise CJK tokens/words (cf. vocab_size: 0; n_words: 0). Is there a workaround for this problem?

Thank you in advance!

@severinsimmler
Copy link
Collaborator

Hi @yokyoku-taikan,

thank you for reporting this. I'm afraid the regular expression we use to tokenize the text is not ideal for CJK texts. You could try to add a token_pattern keyword argument here with your own regex that overwrites the default pattern \p{L}+\p{Connector_Punctuation}?\p{L}+:

yield cophi.text.model.Document(content, title)

If this even makes sense for CJK texts since it is way more complex to tokenize than e.g. English (at least as far as I know).

Alternatively you could try overwriting the tokenization of the Document class that is used for further processing, maybe something like:

from cophi.text.model import Document as _Document

def custom_cjk_tokenizeation(text):
    # maybe use https://github.com/fxsjy/jieba
    ...

class Document(_Document):
    def __init__(
        text,
        title=None,
        lowercase=True,
        n=None,
        maximum=None,
    ):
        self.text = text
        self.title = title
        self.lowercase = lowercase
        if n is not None and n < 1:
            raise ValueError(f"Arg 'n' must be greater than {n}.")
        self.n = n
        self.maximum = maximum
        self.tokens = custom_cjk_tokenizeation(text)
        if self.lowercase:
            self.tokens = utils.lowercase_tokens(self.tokens)

Note that these changes require to run the whole application locally and not using the provided executables.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants