text-EDA

This repo is for my text exploratory data analysis (EDA). it is currently not being maintained or used. it will need some love before it is handy again.

Things to include:

Word counts (This can be done without SpaCy.)
Part of speech counts (SpaCy)
~~Number of documents~~
Number of sentences (SpaCy or another library)
Top N entities (SpaCy)
Distribution of entity types (SpaCy)
Top N part-of-speech words (SpaCy)
Sentiment distribution from each document (NLTK)
Top bi-tri words (SpaCy or something else)
Top N Noun chunks (SpaCy)
Most salient key-words
Word Density - Average length of the words used in the headline
Punctuation Count
Upper-Case to Lower-Case Words ratio - ratio of upper case words used and lower case words used in the text
Cluster labels to analyse things according to an unsupervised clustering of documents.

To-do

Update the eda to include all the extracted features for each documents at a row level. Just comma-separate them. That way, when pulling the data into a visualisation tool like Power BI, you can simply word cloud the column by category to get category split features. (e.g. all keyphrases by category)
Something to do with topic modelling.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

text-EDA

Things to include:

To-do

Files

README.md

Latest commit

History

README.md

File metadata and controls

text-EDA

Things to include:

To-do