NLP-Project-Git-Shoes

Git Shoes!

Project Description

This is a Natural Language Processing project that undertakes the task of predicting the main coding language of repositories using their README.md files. In particular, we have chosen to go through 100+ repositories that are related to shoes and are the most starred. It is being conducted at Codeup, Data Science programme, Noether cohort, March 2023.

Project Goals

Determine the programming language of a given Github repository.
The repository must contain the word 'shoes'.

Initial Thoughts

Java could be the most common language, followed by Python. As for the predictory words, we thought that shoe company brands or shoe types would be amongst the most common words.

The Plan

Acquire data.
Initial exploration.
Clean data.
Create questions.
Split data.
Explore data on train set.
Model using Classification methods.

Data Dictionary

Feature	Definition
repo	repository of README.md file
readme_contents	text content of README.md file
clean	readme_content column cleaned by: - lowercase all words, - unicode "NKFD", - encode "ASCII", - decode "UTF-8"
stemmed	all text from the "clean" column stemmed
lemmatized	all text from the "clean" column lemmatized

Acquire

Data acquired from Github.
- Query for 'shoes' in search bar. Filter by 'Most Stars'.
198 README.md files acquired.
Each row represents a repository.
Columns represent the repository, language and README.md contents.

Prepare

Dropped nulls.
- 13 values dropped.
Cleaned text contents:
- lowercased
- unicode "NKFD"
- encode "ASCII", decode "UTF-8".
Create stemmed and lemmatized columns.
Split into train, validate and test sets (56/24/20).

Explore

Search for and establish most common words using NLP techniques.
Look at TF-IDF scores.
Create bigrams. (This led to the discovery of two separate film scripts in two separate README.md files.)
Create bar plots and word clouds to visualize trends.

Conclusion

In this project, we examined the programming languages of Github repositories based on words mentioned in README.md files that included the word 'shoes'. By acquiring using web scraping techniques, exploring the data and creating classification models, using the decision tree, we established a model to accurately predict the programming language 43 pc of the time.

Baseline: 25%
Final Model: 43%
Performance Increase: 73%

Next Steps

Explore quadgrams.
Visualize the quadgrams via word clouds and bar plots.

Recommendations

We recommend that those interested in shoe marketing use the results of this project to bolster or enhance the jargon they use.

Steps To Reproduce

Clone this Repository.
Use functions in acquire.py file.
Use functions in prepare.py to clean and prep data.
Use same configurations for models.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly