-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
topic_allocation; visualisation not found #5
Comments
It seems that visualization can be done (after the fitting is done) as follows:
|
Hey, i just tried to use this algorithm to plot my pyLDAvis graph but i had this weird error:
It appears to occur whenever the n# of topics drops.Do you know any solution for that? |
Yes, sure, it happens when some topic becomes empty. I have a workaround for that. There are also some other issues, so my final code looks as follows: import pandas as pd
import pyLDAvis
import math
def prepare_data(mgp):
vocabulary = list(vocab)
doc_topic_dists = [mgp.score(doc) for doc in docs]
for doc in doc_topic_dists:
for f in doc:
assert not isinstance(f, complex)
doc_lengths = [len(doc) for doc in docs]
term_counts_map = {}
for doc in docs:
for term in doc:
term_counts_map[term] = term_counts_map.get(term, 0) + 1
term_counts = [term_counts_map[term] for term in vocabulary]
doc_topic_dists2 = [[v if not math.isnan(v) else 1/K for v in d] for d in doc_topic_dists]
doc_topic_dists2 = [d if sum(d) > 0 else [1/K]*K for d in doc_topic_dists2]
for doc in doc_topic_dists2:
for f in doc:
assert not isinstance(f, complex)
assert (pd.DataFrame(doc_topic_dists2).sum(axis=1) < 0.999).sum() == 0
matrix = []
for cluster in mgp.cluster_word_distribution:
total = sum([occurance for word, occurance in cluster.items()])
assert not math.isnan(total)
# assert total > 0
if total == 0:
row = [(1 / len(vocabulary))] * len(vocabulary) # <--- The discussed workaround is here
else:
row = [cluster.get(term, 0) / total for term in vocabulary]
for f in row:
assert not isinstance(f, complex)
matrix.append(row)
return matrix, doc_topic_dists2, doc_lengths, vocabulary, term_counts
def prepare_visualization_data(mgp):
vis_data = pyLDAvis.prepare(*prepare_data(mgp), sort_topics=False)
with open(f"gsdmm-pyldavis-{K}-{alpha}-{beta}-{n_iters}-{now}.html", "w") as f:
pyLDAvis.save_html(vis_data, f)
return vis_data
vis_data = prepare_visualization_data(mgp)
%matplotlib inline
pyLDAvis.enable_notebook()
pyLDAvis.display(vis_data) |
Really appreciate your help dude. Nice job |
It would be great if you make a fork with this method implemented. |
How can i do that?
|
I think that prepare_data function could be a method of MovieGroupProcess class.
|
It is just for saving the HTML file.
so that to save all version to different files. |
got it. Now the error w the complex number ig
|
It seems to be better to have full source code to help you to debug the issue. Do you store it on GitHub? Maybe, in some notebook? |
Spent some time to solve the issues and get this working. from datetime import datetime
now = str(datetime.now()).replace(' ', '_')
K = 40
alpha = 0.03
beta = 0.04
n_iters = 30,
def prepare_data(mgp, docs, K):
vocabulary = list(vocab)
doc_topic_dists = [mgp.score(doc) for doc in docs]
for doc in doc_topic_dists:
for f in doc:
assert not isinstance(f, complex)
doc_lengths = [len(doc) for doc in docs]
term_counts_map = {}
for doc in docs:
for term in doc:
term_counts_map[term] = term_counts_map.get(term, 0) + 1
term_counts = [term_counts_map[term] for term in vocabulary]
doc_topic_dists2 = [[v if not math.isnan(v) else 1/K for v in d] for d in doc_topic_dists]
doc_topic_dists2 = [d if sum(d) > 0 else [1/K]*K for d in doc_topic_dists2]
for doc in doc_topic_dists2:
for f in doc:
assert not isinstance(f, complex)
assert (pd.DataFrame(doc_topic_dists2).sum(axis=1) < 0.999).sum() == 0
matrix = []
for cluster in mgp.cluster_word_distribution:
total = sum([occurance for word, occurance in cluster.items()])
assert not math.isnan(total)
# assert total > 0
if total == 0:
row = [(1 / len(vocabulary))] * len(vocabulary) # <--- The discussed workaround is here
else:
row = [cluster.get(term, 1) / total for term in vocabulary] # <--- changed from 0 to 1
for f in row:
assert not isinstance(f, complex)
matrix.append(row)
return matrix, doc_topic_dists2, doc_lengths, vocabulary, term_counts
def prepare_visualization_data(mgp,
docs,
K,
alpha,
beta,
n_iters,
now,
save = False):
vis_data = pyLDAvis.prepare(*prepare_data(mgp, docs, K), sort_topics=False, mds='mmds') # <--- mds is changed from default to mmds.
if save:
with open(f"gsdmm-Clusters-{K}_Alpha-{alpha}_Beta-{beta}_Iterations-{n_iters}--------{now}.html", "w") as f:
pyLDAvis.save_html(vis_data, f)
return vis_data
vis_data = prepare_visualization_data(mgp,
trigrams,
K = K,
alpha = alpha,
beta = beta,
n_iters = n_iters,
now = now)
%matplotlib inline
pyLDAvis.enable_notebook()
pyLDAvis.display(vis_data) |
Show the top 5 words by cluster, it helps to make the topic_dict below
top_words(mgp.cluster_word_distribution, top_index, 5)
topic_allocation not found
The text was updated successfully, but these errors were encountered: