topic_allocation; visualisation not found #5

xiaowei-xw · 2020-01-30T10:54:07Z

Show the top 5 words by cluster, it helps to make the topic_dict below

top_words(mgp.cluster_word_distribution, top_index, 5)

topic_allocation not found

ilya-palachev · 2020-06-11T01:29:55Z

It seems that visualization can be done (after the fitting is done) as follows:

import pyLDAvis
vocabulary = list(vocab)
doc_topic_dists = [mgp.score(doc) for doc in tqdm(docs)]
doc_lengths = [len(doc) for doc in tqdm(docs)]
term_counts_map = {}
for doc in tqdm(docs):
    for term in doc:
        term_counts_map[term] = term_counts_map.get(term, 0) + 1
term_counts = [term_counts_map[term] for term in tqdm(vocabulary)]

matrix = []
for cluster in mgp.cluster_word_distribution:
    total = sum([occurance for word, occurance in cluster.items()])
    row = [cluster.get(term, 0) / total for term in vocabulary]
    matrix.append(row)

vis_data = pyLDAvis.prepare(matrix, doc_topic_dists, doc_lengths, vocabulary, term_counts)
pyLDAvis.enable_notebook()
pyLDAvis.display(vis_data)

Felipehonorato1 · 2020-07-15T23:54:21Z

It seems that visualization can be done (after the fitting is done) as follows:

import pyLDAvis
vocabulary = list(vocab)
doc_topic_dists = [mgp.score(doc) for doc in tqdm(docs)]
doc_lengths = [len(doc) for doc in tqdm(docs)]
term_counts_map = {}
for doc in tqdm(docs):
    for term in doc:
        term_counts_map[term] = term_counts_map.get(term, 0) + 1
term_counts = [term_counts_map[term] for term in tqdm(vocabulary)]

matrix = []
for cluster in mgp.cluster_word_distribution:
    total = sum([occurance for word, occurance in cluster.items()])
    row = [cluster.get(term, 0) / total for term in vocabulary]
    matrix.append(row)

vis_data = pyLDAvis.prepare(matrix, doc_topic_dists, doc_lengths, vocabulary, term_counts)
pyLDAvis.enable_notebook()
pyLDAvis.display(vis_data)

Hey, i just tried to use this algorithm to plot my pyLDAvis graph but i had this weird error:

100%|██████████| 10000/10000 [00:10<00:00, 968.73it/s]
100%|██████████| 10000/10000 [00:00<00:00, 1697274.20it/s]
100%|██████████| 10000/10000 [00:00<00:00, 508566.92it/s]
100%|██████████| 5849/5849 [00:00<00:00, 1455588.23it/s]
---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
<ipython-input-14-d15fc6d7c170> in <module>()
     11 for cluster in mgp.cluster_word_distribution:
     12     total = sum([occurance for word, occurance in cluster.items()])
---> 13     row = [cluster.get(term, 0) / total for term in vocabulary]
     14     matrix.append(row)
     15 

<ipython-input-14-d15fc6d7c170> in <listcomp>(.0)
     11 for cluster in mgp.cluster_word_distribution:
     12     total = sum([occurance for word, occurance in cluster.items()])
---> 13     row = [cluster.get(term, 0) / total for term in vocabulary]
     14     matrix.append(row)
     15 

ZeroDivisionError: division by zero

It appears to occur whenever the n# of topics drops.Do you know any solution for that?

ilya-palachev · 2020-07-16T02:00:15Z

Hey, i just tried to use this algorithm to plot my pyLDAvis graph but i had this weird error:

100%|██████████| 10000/10000 [00:10<00:00, 968.73it/s]
100%|██████████| 10000/10000 [00:00<00:00, 1697274.20it/s]
100%|██████████| 10000/10000 [00:00<00:00, 508566.92it/s]
100%|██████████| 5849/5849 [00:00<00:00, 1455588.23it/s]
---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
<ipython-input-14-d15fc6d7c170> in <module>()
     11 for cluster in mgp.cluster_word_distribution:
     12     total = sum([occurance for word, occurance in cluster.items()])
---> 13     row = [cluster.get(term, 0) / total for term in vocabulary]
     14     matrix.append(row)
     15 

<ipython-input-14-d15fc6d7c170> in <listcomp>(.0)
     11 for cluster in mgp.cluster_word_distribution:
     12     total = sum([occurance for word, occurance in cluster.items()])
---> 13     row = [cluster.get(term, 0) / total for term in vocabulary]
     14     matrix.append(row)
     15 

ZeroDivisionError: division by zero

It appears to occur whenever the n# of topics drops.Do you know any solution for that?

Yes, sure, it happens when some topic becomes empty. I have a workaround for that. There are also some other issues, so my final code looks as follows:

import pandas as pd
import pyLDAvis
import math

def prepare_data(mgp):
    vocabulary = list(vocab)
    doc_topic_dists = [mgp.score(doc) for doc in docs]
    for doc in doc_topic_dists:
        for f in doc:
            assert not isinstance(f, complex)

    doc_lengths = [len(doc) for doc in docs]
    term_counts_map = {}
    for doc in docs:
        for term in doc:
            term_counts_map[term] = term_counts_map.get(term, 0) + 1
    term_counts = [term_counts_map[term] for term in vocabulary]
    doc_topic_dists2 = [[v if not math.isnan(v) else 1/K for v in d] for d in doc_topic_dists]
    doc_topic_dists2 = [d if sum(d) > 0 else [1/K]*K for d in doc_topic_dists2]
    for doc in doc_topic_dists2:
        for f in doc:
            assert not isinstance(f, complex)
    
    assert (pd.DataFrame(doc_topic_dists2).sum(axis=1) < 0.999).sum() == 0
    matrix = []
    for cluster in mgp.cluster_word_distribution:
        total = sum([occurance for word, occurance in cluster.items()])
        assert not math.isnan(total)
        # assert total > 0
        if total == 0:
            row = [(1 / len(vocabulary))] * len(vocabulary)   # <--- The discussed workaround is here
        else:
            row = [cluster.get(term, 0) / total for term in vocabulary]
        for f in row:
            assert not isinstance(f, complex)
        matrix.append(row)
    return matrix, doc_topic_dists2, doc_lengths, vocabulary, term_counts

def prepare_visualization_data(mgp):
    vis_data = pyLDAvis.prepare(*prepare_data(mgp), sort_topics=False)
    with open(f"gsdmm-pyldavis-{K}-{alpha}-{beta}-{n_iters}-{now}.html", "w") as f:
        pyLDAvis.save_html(vis_data, f)
    return vis_data

vis_data = prepare_visualization_data(mgp)

%matplotlib inline
pyLDAvis.enable_notebook()
pyLDAvis.display(vis_data)

Felipehonorato1 · 2020-07-16T02:01:11Z

Really appreciate your help dude. Nice job

ilya-palachev · 2020-07-16T02:02:38Z

Really appreciate your help dude. Nice job

It would be great if you make a fork with this method implemented.

Felipehonorato1 · 2020-07-16T02:05:44Z

Really appreciate your help dude. Nice job

It would be great if you make a fork with this method implemented.

How can i do that?
And also what should 'now' be?

def prepare_visualization_data(mgp):
    vis_data = pyLDAvis.prepare(*prepare_data(mgp), sort_topics=False)
    with open(f"gsdmm-pyldavis-{K}-{alpha}-{beta}-{n_iters}-{now}.html", "w") as f:
        pyLDAvis.save_html(vis_data, f)
    return vis_data

ilya-palachev · 2020-07-16T02:20:09Z

I think that prepare_data function could be a method of MovieGroupProcess class.

ilya-palachev · 2020-07-16T02:40:49Z

Really appreciate your help dude. Nice job

It would be great if you make a fork with this method implemented.

How can i do that?
And also what should 'now' be?
def prepare_visualization_data(mgp):
    vis_data = pyLDAvis.prepare(*prepare_data(mgp), sort_topics=False)
    with open(f"gsdmm-pyldavis-{K}-{alpha}-{beta}-{n_iters}-{now}.html", "w") as f:
        pyLDAvis.save_html(vis_data, f)
    return vis_data

It is just for saving the HTML file. now can be any string constant, or you can choose a file name without this suffix; I use

from datetime import datetime
now = str(datetime.now()).replace(' ', '_')

so that to save all version to different files.

Felipehonorato1 · 2020-07-16T02:54:59Z

Really appreciate your help dude. Nice job

It would be great if you make a fork with this method implemented.

How can i do that?
And also what should 'now' be?
def prepare_visualization_data(mgp):
    vis_data = pyLDAvis.prepare(*prepare_data(mgp), sort_topics=False)
    with open(f"gsdmm-pyldavis-{K}-{alpha}-{beta}-{n_iters}-{now}.html", "w") as f:
        pyLDAvis.save_html(vis_data, f)
    return vis_data
It is just for saving the HTML file. now can be any string constant, or you can choose a file name without this suffix; I use
from datetime import datetime
now = str(datetime.now()).replace(' ', '_')
so that to save all version to different files.

got it. Now the error w the complex number ig

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-23-6141749a10bc> in <module>()
     46     return abs(vis_data)
     47 
---> 48 vis_data = prepare_visualization_data(mgp)
     49 
     50 get_ipython().magic('matplotlib inline')

8 frames
/usr/lib/python3.6/json/encoder.py in default(self, o)
    178         """
    179         raise TypeError("Object of type '%s' is not JSON serializable" %
--> 180                         o.__class__.__name__)
    181 
    182     def encode(self, o):

TypeError: Object of type 'complex' is not JSON serializable

ilya-palachev · 2020-07-16T05:16:42Z

got it. Now the error w the complex number ig

It seems to be better to have full source code to help you to debug the issue. Do you store it on GitHub? Maybe, in some notebook?

ernests · 2021-07-16T11:39:00Z

Spent some time to solve the issues and get this working.
Here is final, working code with comments :

from datetime import datetime
now = str(datetime.now()).replace(' ', '_')

K = 40
alpha = 0.03
beta = 0.04
n_iters = 30,

def prepare_data(mgp, docs, K):
    vocabulary = list(vocab)
    doc_topic_dists = [mgp.score(doc) for doc in docs]
    for doc in doc_topic_dists:
        for f in doc:
            assert not isinstance(f, complex)

    doc_lengths = [len(doc) for doc in docs]
    term_counts_map = {}
    for doc in docs:
        for term in doc:
            term_counts_map[term] = term_counts_map.get(term, 0) + 1
    term_counts = [term_counts_map[term] for term in vocabulary]
    doc_topic_dists2 = [[v if not math.isnan(v) else 1/K for v in d] for d in doc_topic_dists]
    doc_topic_dists2 = [d if sum(d) > 0 else [1/K]*K for d in doc_topic_dists2]
    for doc in doc_topic_dists2:
        for f in doc:
            assert not isinstance(f, complex)
    
    assert (pd.DataFrame(doc_topic_dists2).sum(axis=1) < 0.999).sum() == 0
    matrix = []
    for cluster in mgp.cluster_word_distribution:
        total = sum([occurance for word, occurance in cluster.items()])
        assert not math.isnan(total)
        # assert total > 0
        if total == 0:
            row = [(1 / len(vocabulary))] * len(vocabulary)   # <--- The discussed workaround is here
        else:
            row = [cluster.get(term, 1) / total for term in vocabulary] # <--- changed from 0 to 1
        for f in row:
            assert not isinstance(f, complex)
        matrix.append(row)
    return matrix, doc_topic_dists2, doc_lengths, vocabulary, term_counts

def prepare_visualization_data(mgp, 
                               docs, 
                               K, 
                               alpha, 
                               beta, 
                               n_iters, 
                               now, 
                               save = False):
    vis_data = pyLDAvis.prepare(*prepare_data(mgp, docs, K), sort_topics=False, mds='mmds') # <--- mds is changed from default to mmds. 
    if save:
        with open(f"gsdmm-Clusters-{K}_Alpha-{alpha}_Beta-{beta}_Iterations-{n_iters}--------{now}.html", "w") as f:
            pyLDAvis.save_html(vis_data, f)
    return vis_data

vis_data = prepare_visualization_data(mgp, 
                                      trigrams, 
                                      K = K, 
                                      alpha = alpha,
                                      beta = beta,
                                      n_iters = n_iters,
                                      now = now)
%matplotlib inline

pyLDAvis.enable_notebook()
pyLDAvis.display(vis_data)

This was referenced Apr 28, 2021

Viz Error with Empty Clusters (pyLDAvis, Int64Index) #10

Closed

TypeError: unhashable type: 'Int64Index' bmabey/pyLDAvis#202

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

topic_allocation; visualisation not found #5

topic_allocation; visualisation not found #5

xiaowei-xw commented Jan 30, 2020

ilya-palachev commented Jun 11, 2020 •

edited

Loading

Felipehonorato1 commented Jul 15, 2020

ilya-palachev commented Jul 16, 2020

Felipehonorato1 commented Jul 16, 2020

ilya-palachev commented Jul 16, 2020

Felipehonorato1 commented Jul 16, 2020 •

edited

Loading

ilya-palachev commented Jul 16, 2020 via email •

edited

Loading

ilya-palachev commented Jul 16, 2020

Felipehonorato1 commented Jul 16, 2020

ilya-palachev commented Jul 16, 2020

ernests commented Jul 16, 2021

topic_allocation; visualisation not found #5

topic_allocation; visualisation not found #5

Comments

xiaowei-xw commented Jan 30, 2020

Show the top 5 words by cluster, it helps to make the topic_dict below

ilya-palachev commented Jun 11, 2020 • edited Loading

Felipehonorato1 commented Jul 15, 2020

ilya-palachev commented Jul 16, 2020

Felipehonorato1 commented Jul 16, 2020

ilya-palachev commented Jul 16, 2020

Felipehonorato1 commented Jul 16, 2020 • edited Loading

ilya-palachev commented Jul 16, 2020 via email • edited Loading

ilya-palachev commented Jul 16, 2020

Felipehonorato1 commented Jul 16, 2020

ilya-palachev commented Jul 16, 2020

ernests commented Jul 16, 2021

ilya-palachev commented Jun 11, 2020 •

edited

Loading

Felipehonorato1 commented Jul 16, 2020 •

edited

Loading

ilya-palachev commented Jul 16, 2020 via email •

edited

Loading