Vocabulary analyzer

Are you always using the same words in your scientific papers? This is a tool to run a text analysis on PDF files containing an English corpus.

What is it?

It's a simple tool to extract keywords and Part-Of-Speech distributions from a given PDF. It employs a spaCy English model to perform tokenization and named-entities extraction. It's pretty much a word counter that employs standard NLP pre-processing, plus the NER part performed by spaCy. Additionally it produces a word cloud image.

How to install

This tool has been developed on Ubuntu 18.04 and macOS High Sierra, but has never been seriously tested. It requires Python3+ and virtualenv. With these two installed, simply clone the repo and run source install.sh

Requirements

The file requirements.txt contains all the needed python packages.

Ubuntu 18.04

Ubuntu should come with Python3+ installed, so just give source install.sh and Bob's your uncle.

macOS

Follow the following steps:

Open a Terminal and run xcode-select --install
log out and back in
get Homebrew here: copy/paste the link they provide in a terminal.

At this stage, if you get an error that says

git: error: unable to locate xcodebuild, please make sure the path to the Xcode folder is set correctly!
git: error: You can set the path to the Xcode folder using /usr/bin/xcode-select -switch

follow what's been said here, and run the following in a terminal:

sudo xcode-select -switch /Library/Developer/CommandLineTools

Once Homebrew has been downloaded and installed you can install Python3 by:

brew install python

Once Python has been brewed (a.k.a. Python installation finished successfully), you should be able to run pip install virtualenv and finally source install.sh.

Windows

Sorry, I have no clue. I don't even care.

How to run

Once the virtual environment has been created, and activated, simply run

python main.py --filepath /path/to/PDF_file

This will assume by default a maximum number of words to plot nmaxwords = 20. To modify this setting, simply specify your preference by add the --nwords flag, e.g.

python main.py --filepath /path/to/PDF_file --nmaxwords 25

Considerations

The tool is designed to run only on searchable PDF, namely PDF files in which the text can be selected and copied. That's it!

Results

Here there are the sample results obtained by running on my CV, taken from here.

Top 20 keywords

Top 20 nouns

Top 20 adjectives

Top 20 adverbs

Top 20 verbs

Top 20 entities

Top 20 entity types

Word cloud

Acknowledgements

Thanks to the people at spaCy for the NE part, and to the guys who made word cloud for the awesome word-cloud images that can be produced.

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
classes		classes
modules		modules
sample		sample
.gitignore		.gitignore
README.md		README.md
install.sh		install.sh
main.py		main.py
requirements.txt		requirements.txt
sample.pdf		sample.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vocabulary analyzer

What is it?

How to install

Requirements

Ubuntu 18.04

macOS

Windows

How to run

Considerations

Results

Top 20 keywords

Top 20 nouns

Top 20 adjectives

Top 20 adverbs

Top 20 verbs

Top 20 entities

Top 20 entity types

Word cloud

Acknowledgements

About

Releases

Packages

Contributors 2

Languages

fabriziomiano/UzaiKeyFire

Folders and files

Latest commit

History

Repository files navigation

Vocabulary analyzer

What is it?

How to install

Requirements

Ubuntu 18.04

macOS

Windows

How to run

Considerations

Results

Top 20 keywords

Top 20 nouns

Top 20 adjectives

Top 20 adverbs

Top 20 verbs

Top 20 entities

Top 20 entity types

Word cloud

Acknowledgements

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages