n_words: 0 with CJK texts #133

yokyoku-taikan · 2023-08-29T16:40:06Z

Hello,

I am currently trying to use TopicsExplorer with a corpus of CJK texts (all of the .txt files are UTF-8 encoded).
After I click on "Train topic model" the program outputs the following error message:

Closing connection to database... Fetching stopwords... Cleaning corpus... Connecting to database... Insert stopwords into database... Closing connection to database... Successfully preprocessed data. Connecting to database... Insert token frequencies into database... Closing connection to database... Creating topic model... n_documents: 16 vocab_size: 0 n_words: 0 n_topics: 10 n_iter: 100 all zero row in document-term matrix found ERROR: Cannot cast array data from dtype('float64') to dtype('int64') according to the rule 'safe' Redirect to error page...

It seems to me as though TopicsExplorer is unable to recognise CJK tokens/words (cf. vocab_size: 0; n_words: 0). Is there a workaround for this problem?

Thank you in advance!

The text was updated successfully, but these errors were encountered:

severinsimmler · 2023-09-04T14:40:10Z

Hi @yokyoku-taikan,

thank you for reporting this. I'm afraid the regular expression we use to tokenize the text is not ideal for CJK texts. You could try to add a token_pattern keyword argument here with your own regex that overwrites the default pattern \p{L}+\p{Connector_Punctuation}?\p{L}+:

TopicsExplorer/topicsexplorer/utils.py

Line 117 in 8d04d3c

yield cophi.text.model.Document(content, title)

If this even makes sense for CJK texts since it is way more complex to tokenize than e.g. English (at least as far as I know).

Alternatively you could try overwriting the tokenization of the Document class that is used for further processing, maybe something like:

from cophi.text.model import Document as _Document

def custom_cjk_tokenizeation(text):
    # maybe use https://github.com/fxsjy/jieba
    ...

class Document(_Document):
    def __init__(
        text,
        title=None,
        lowercase=True,
        n=None,
        maximum=None,
    ):
        self.text = text
        self.title = title
        self.lowercase = lowercase
        if n is not None and n < 1:
            raise ValueError(f"Arg 'n' must be greater than {n}.")
        self.n = n
        self.maximum = maximum
        self.tokens = custom_cjk_tokenizeation(text)
        if self.lowercase:
            self.tokens = utils.lowercase_tokens(self.tokens)

Note that these changes require to run the whole application locally and not using the provided executables.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

n_words: 0 with CJK texts #133

n_words: 0 with CJK texts #133

yokyoku-taikan commented Aug 29, 2023

severinsimmler commented Sep 4, 2023

n_words: 0 with CJK texts #133

n_words: 0 with CJK texts #133

Comments

yokyoku-taikan commented Aug 29, 2023

severinsimmler commented Sep 4, 2023