Skip to content

Latest commit

 

History

History
23 lines (21 loc) · 1.28 KB

README.md

File metadata and controls

23 lines (21 loc) · 1.28 KB

text-EDA

This repo is for my text exploratory data analysis (EDA). it is currently not being maintained or used. it will need some love before it is handy again.

Things to include:

  • Word counts (This can be done without SpaCy.)
  • Part of speech counts (SpaCy)
  • Number of documents
  • Number of sentences (SpaCy or another library)
  • Top N entities (SpaCy)
  • Distribution of entity types (SpaCy)
  • Top N part-of-speech words (SpaCy)
  • Sentiment distribution from each document (NLTK)
  • Top bi-tri words (SpaCy or something else)
  • Top N Noun chunks (SpaCy)
  • Most salient key-words
  • Word Density - Average length of the words used in the headline
  • Punctuation Count
  • Upper-Case to Lower-Case Words ratio - ratio of upper case words used and lower case words used in the text
  • Cluster labels to analyse things according to an unsupervised clustering of documents.

To-do

  1. Update the eda to include all the extracted features for each documents at a row level. Just comma-separate them. That way, when pulling the data into a visualisation tool like Power BI, you can simply word cloud the column by category to get category split features. (e.g. all keyphrases by category)
  2. Something to do with topic modelling.