Skip to content

This repo is for my text exploratory data analysis (EDA).

Notifications You must be signed in to change notification settings

GiovanniStephens/text-EDA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

62 Commits
 
 
 
 
 
 
 
 

Repository files navigation

text-EDA

This repo is for my text exploratory data analysis (EDA). it is currently not being maintained or used. it will need some love before it is handy again.

Things to include:

  • Word counts (This can be done without SpaCy.)
  • Part of speech counts (SpaCy)
  • Number of documents
  • Number of sentences (SpaCy or another library)
  • Top N entities (SpaCy)
  • Distribution of entity types (SpaCy)
  • Top N part-of-speech words (SpaCy)
  • Sentiment distribution from each document (NLTK)
  • Top bi-tri words (SpaCy or something else)
  • Top N Noun chunks (SpaCy)
  • Most salient key-words
  • Word Density - Average length of the words used in the headline
  • Punctuation Count
  • Upper-Case to Lower-Case Words ratio - ratio of upper case words used and lower case words used in the text
  • Cluster labels to analyse things according to an unsupervised clustering of documents.

To-do

  1. Update the eda to include all the extracted features for each documents at a row level. Just comma-separate them. That way, when pulling the data into a visualisation tool like Power BI, you can simply word cloud the column by category to get category split features. (e.g. all keyphrases by category)
  2. Something to do with topic modelling.

About

This repo is for my text exploratory data analysis (EDA).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages