You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am currently trying to use TopicsExplorer with a corpus of CJK texts (all of the .txt files are UTF-8 encoded).
After I click on "Train topic model" the program outputs the following error message:
Closing connection to database... Fetching stopwords... Cleaning corpus... Connecting to database... Insert stopwords into database... Closing connection to database... Successfully preprocessed data. Connecting to database... Insert token frequencies into database... Closing connection to database... Creating topic model... n_documents: 16 vocab_size: 0 n_words: 0 n_topics: 10 n_iter: 100 all zero row in document-term matrix found ERROR: Cannot cast array data from dtype('float64') to dtype('int64') according to the rule 'safe' Redirect to error page...
It seems to me as though TopicsExplorer is unable to recognise CJK tokens/words (cf. vocab_size: 0; n_words: 0). Is there a workaround for this problem?
Thank you in advance!
The text was updated successfully, but these errors were encountered:
thank you for reporting this. I'm afraid the regular expression we use to tokenize the text is not ideal for CJK texts. You could try to add a token_pattern keyword argument here with your own regex that overwrites the default pattern \p{L}+\p{Connector_Punctuation}?\p{L}+:
Hello,
I am currently trying to use TopicsExplorer with a corpus of CJK texts (all of the .txt files are UTF-8 encoded).
After I click on "Train topic model" the program outputs the following error message:
Closing connection to database... Fetching stopwords... Cleaning corpus... Connecting to database... Insert stopwords into database... Closing connection to database... Successfully preprocessed data. Connecting to database... Insert token frequencies into database... Closing connection to database... Creating topic model... n_documents: 16 vocab_size: 0 n_words: 0 n_topics: 10 n_iter: 100 all zero row in document-term matrix found ERROR: Cannot cast array data from dtype('float64') to dtype('int64') according to the rule 'safe' Redirect to error page...
It seems to me as though TopicsExplorer is unable to recognise CJK tokens/words (cf. vocab_size: 0; n_words: 0). Is there a workaround for this problem?
Thank you in advance!
The text was updated successfully, but these errors were encountered: