Skip to content

PDF Analysis: Extracting words and their word frequencies from PDF files; Preparation of text data for performing topic analysis on annual reports of German car manufacturers - e.g. Volkswagen, Porsche and Audi. Please note that words are only being extracted, stemming is not being applied. In order to improve this, use nltk.stem.snowball.Snowba…

License

Notifications You must be signed in to change notification settings

michael-eble/pdf-analysis-word-extraction-word-frequencies

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

PDF-Analysis-Extraction-of-Words-and-their-Word-Frequencies

PDF Analysis: Extracting words and their word frequencies from PDF files

Preparation of text data for performing topic analysis on annual reports of German car manufacturers - e.g. Volkswagen, Porsche and Audi. Please note that words are only being extracted, stemming is not being applied. In order to improve this, use nltk.stem.snowball.SnowballStemmer('german'), for example.

A very simple Python scipt that makes use of PyPDF2 and NLTK. It is provided "as it is", meaning that is comes without any warranty - i.e., "use it at your own risk".

About

PDF Analysis: Extracting words and their word frequencies from PDF files; Preparation of text data for performing topic analysis on annual reports of German car manufacturers - e.g. Volkswagen, Porsche and Audi. Please note that words are only being extracted, stemming is not being applied. In order to improve this, use nltk.stem.snowball.Snowba…

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages