This repo is for my text exploratory data analysis (EDA). it is currently not being maintained or used. it will need some love before it is handy again.
- Word counts (This can be done without SpaCy.)
- Part of speech counts (SpaCy)
-
Number of documents - Number of sentences (SpaCy or another library)
- Top N entities (SpaCy)
- Distribution of entity types (SpaCy)
- Top N part-of-speech words (SpaCy)
- Sentiment distribution from each document (NLTK)
- Top bi-tri words (SpaCy or something else)
- Top N Noun chunks (SpaCy)
- Most salient key-words
- Word Density - Average length of the words used in the headline
- Punctuation Count
- Upper-Case to Lower-Case Words ratio - ratio of upper case words used and lower case words used in the text
- Cluster labels to analyse things according to an unsupervised clustering of documents.
- Update the eda to include all the extracted features for each documents at a row level. Just comma-separate them. That way, when pulling the data into a visualisation tool like Power BI, you can simply word cloud the column by category to get category split features. (e.g. all keyphrases by category)
- Something to do with topic modelling.