GitHub - Pobl-Group/mould-analysis: A tool for assessing whether a job description contains an exact or partial match for key words associated with damp or mould.

Mould Analysis

A tool for assessing whether a job description contains an exact or partial match for key words associated with damp or mould.
Report Bug · Request Feature

Table of Contents

About the Project
Getting Started
- Prerequisites
- Installation
Usage
Contributing
License
Acknowledgments

About the Project

This tool was designed to allow us to find damp and mould related words within job descriptions. When provided with a job description, the algorithm will return a series of scores that indicate whether there is a word in the description that is an exact or partial match for the one of keywords provided. We have included the ability to adjust the keywords that are searched for to suit your organisation’s needs. We also allow for the removal of “stop words” which are words that are to be removed from job descriptions to prevent instances of false positives.

Note: where an exact match is not found, the scores can only say that there might be a match. Like many algorithms, there is a trade off between being too precise (and reducing false positives as much as possible) and flexible enough to be useful (and capture instances where our confidence level might be slightly lower). To get the most out of the algorithm, we suggest experimenting by altering the key words and stop words and then examining the scores returned to check that the configuration works for your organisation.

(back to top)

Getting Started

Prerequisites

Python
A CSV dataset containing just two fields: job_id and description.

Installation

Clone the repo

git clone https://github.com/Pobl-Group/mould-analysis.git

Install packages
```
pip install -r requirements.txt
```

(back to top)

Usage

Data preparation

The best way to use this tool, is to provide one row per job and to assemble a description by combining all known descriptions for that job into one. For example, a job may involve multiple tasks and so, you may wish to combine the overall job description with the descriptions of each task related to that job.

It is worth noting that before producing the scores, the program will first standardise the descriptions by converting them to lower case and removing punctuation, numbers and extra spaces.

Key words

You need to specify a set of lower case keywords that would suggest the job is related to damp or mould. You should try to pick words that are only used when describing a job to remedy damp or mould. Besides from the obvious inclusion of “damp” and “mould”, consider adding names of treatments or tasks associated with removing damp and mould.

Stop words

To reduce false positives, it is a good idea to remove certain words from the job descriptions to make the scoring process as reliable as possible. There may be words that are similar to a key word but that have different meanings, like “moulding” which may be unlikely to feature as a description of mould in a property but might be used to describe the fact that the mouldings around a window are cracked.

The Scores

Once the program has run, several scores will be produced.

one_to_one_ratio: this is a simple ratio score from “thefuzz” library. It checks the similarity between each of the words in the description with each of the key words. After these comparisons are made, the highest score is returned. The higher the score, the better the match there is between a key word and a word in the description. A score of 100 represents an exact match.

set_ratio: this is the partial token set ratio from “thefuzz” library. This function appears to have been deprecated in the latest version of thefuzz but we have left this in place for you to examine the output.

min_levenshtien_score: this is the minimum levenshtien distance found when comparing each word in the job description with each of the keywords. The levenshtien distance is essentially a measure of the number of steps required to change one word to another. This also known as the “edit distance”. Unlike the other scores, the lower the levenshtien distance the better.

simple_search: this is a simple comparison between each word in the description and each of the keywords. It will return a score of 100 when an exact match is found, otherwise it will return a score of 0.

best_score: this is the best score returned out of the one_to_one_ratio, set_ratio and simple_search.

Interpreting the scores

We suggest you examine the output and familiarise yourself with the scores returned. We have used the one_to_one ratio score and the simple search scores mostly. We have chosen to view a one_to_one score of 85 or higher as a suitable match for a key word, however you may wish to adjust this based on your needs.

(back to top)

Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!

Fork the Project
Create your Feature Branch (git checkout -b feature/AmazingFeature)
Commit your Changes (git commit -m 'Add some AmazingFeature')
Push to the Branch (git push origin feature/AmazingFeature)
Open a Pull Request

(back to top)

License

Distributed under the GNU General Public License v3.0. See LICENSE.txt for more information.

(back to top)

Acknowledgments

(back to top)

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets		assets
steps		steps
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.ini		config.ini
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Mould Analysis

About the Project

Getting Started

Prerequisites

Installation

Usage

Data preparation

Key words

Stop words

The Scores

Interpreting the scores

Contributing

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

Pobl-Group/mould-analysis

Folders and files

Latest commit

History

Repository files navigation

Mould Analysis

About the Project

Getting Started

Prerequisites

Installation

Usage

Data preparation

Key words

Stop words

The Scores

Interpreting the scores

Contributing

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages