The Skills Space Method developed in “Skill-driven Recommendations for Job Transition Pathways” (Dawson, Williams & Rizoiu 2021) is a novel approach to measuring the distance between sets of skills. Sets of skills in the context of this report are only associated with occupations (job titles), but can be generalised to other groups such as company or industry.
(Dawson, Williams & Rizoiu 2021) provide the mathematical foundation for computing the Skill Set Similarity between two sets. However, practically implementing the functions and computations is not documented. A naive approach, that mimics the exact mathematical outline (e.g. using for loops to sum, lazy functions) is not feasible on large datasets due to poor execution performance.
As such, this report outlines and explains a method for implementing the Skills Space Method using matrix operations. The vectorisation of this method enables faster execution through SIMD or parallel computing (depending on the platform) and memoisation of necessary, static values in matrix form.
- Python > 3.11
- Packages in platform specific requirements:
nvidia_requirements.txt
: If you have an Nvidia GPU available and wish to utilise it for faster processing.mac_requirements
: If you are using a Mac and wish to utilise it's GPU it for faster processing.requirements.txt
: If you don't have an Nvidia or Apple GPU available or wish to only use CPU processing.
Important
To successfully install the required packages in nvidia_requirements.txt
(specifically cupy
), you may need to install other dependencies such as NVIDIA CUDA GPU and CUDA Toolkit. To determine what you need to install prior to installing nvidia_requirements.txt
read through the CuPy Installation Guide.
Note
Currently this package is not available on the Python Package Index (PIP). However, there are plans to make it available there in the future. So for now, this package has to be manually cloned into your project, and installed from source.
First run the following commands to clone this repository on your machine:
git init
git clone <github-link>
cd skill-similarity
Create a virtual environment to run this project:
python3 -m venv venv
source venv/bin/activate
Then install the necessary packages for your platform:
pip install -r <platform-option>_requirements.txt
If you wish to install nvidia_requirements
read the callout in the Dependencies section.
Finally, run the build.py
file by running:
python build.py
When you wish to stop using the virtual environment created for this project, run:
deactivate
If you are running this package from source, you must set the path of your local version of the repository as shown below:
import sys
sys.path.insert(1, "./skill_similarity")
Then import the SkillPopulation
and SkillCalculatorV3
classes:
from skill_similarity.tests.get_random_skills.get_random_occ_to_skills import get_random_occ_to_skills # only required to generate random data, omit if you have your own dataset
from skill_similarity.entities import SkillPopulation, SkillGroup
from skill_similarity.sim_calculators import SkillSimCalculatorV3
Create some fake, random jobs and skills data:
job_skills = get_random_occ_to_skills(num_occupations=100, num_skills=70)
The job_skills
variable returns a dict[str, list[list[str]]]
, where the key is a occupation/job title (e.g. Junior Accountant, Senior Software Engineer), and the value is a list of list of skill names. Where each list[str]
represents an instance of an occupation (e.g. a job posting) and it's skills.
Now that there is job-skill data, in the correct format, we must create a SkillPopulation
using it. The SkillPopulation
is necessary as it handles:
- Removing infrequent skills (skills that occur less than the
threshold
argument which is set to a default of 5). - Creating a binary presence job-skill matrix.
- Creating an ordered list of skill_group_names (in this case job titles) and skill_names that can be associated with rows and columns of the skill population matrix respectively.
- Saving and loading the SkillPopulation to disk efficiently.
For a relatively small dataset (as is the case in this example) this can be done by simply by creating an instance of the SkillPopulation
:
skill_population = SkillPopulation(job_skills)
However, for large datasets skill_population
will use a lot of memory and possibly won't be released for a long time. In this case you should use a context manager to instantiate a SkillPopulation
, save it if required and pass it to the skill calculator:
with SkillPopulation(skill_group_skills=job_skills) as skill_population:
skill_population.save(file_path="../data", output_name="skill_population")
skill_calc = SkillSimCalculatorV3(skill_population)
Afterwhich if you try to access the fields within skill_population
(outside the context manager), you will get None
.
The get_matrix_subset_by_sg
method takes a callback function for filtering the skill_population
matrix by skill_group and returns an object of MatrixSubsetIndexes
, which is essentially a wrapper for the indexes of the matrix which were selected from the filter.
def filter_by_job_name(job_name: str, skill_group: SkillGroup) -> bool:
return skill_group.id.lower() == job_name
soft_dev_subset = skill_population.get_matrix_subset_by_sg(
lambda skill_group: SkillGroup: filter_by_job_name("Software Developer", skill_group)
)
data_sci_subset = skill_population.get_matrix_subset_by_sg(
lambda skill_group: SkillGroup: filter_by_job_name("Data Scientist", skill_group)
)
As mentioned in the implementation report, the RCA and Skill Similarity matrices are precomputed as this can be done very efficiently using matrix operations and enables memoisation for even faster subsequent computation.
Precompute these matrices by running their respective methods:
skill_calc.calc_rca_matrix()
skill_calc.calc_skill_sim_matrix()
The last step is to run the skill_set_similarity
method, which takes two MatrixSubsetIndexes
objects and returns a np.float64
which is the skill set similarity value:
sss = skill_calc.skill_set_similarity(soft_dev_subset, data_sci_subset)
print("Skill Set Similarity =", sss)
The project structure of VSSM is as follows:
- root: Contains
main.py
, requirements files and other configuration files.- entities: Contains classes or dataclasses that are commonly used throughout different modules. For example
SkillPopulation
andJob
. - sim_calculators: Contains the different versions of the Skill Similarity calculators as well as the abstract class all are based off.
- tests: Contains the unittests for this project.
- random: Contains functions to generate random data for the tests.
- utils: Contains various utility functions such as
gzip_directory
orTwoDimDict
.
- entities: Contains classes or dataclasses that are commonly used throughout different modules. For example
- Run all tests:
python -m unittest discover -s tests -p "*_test.py"
- Run a specific test:
python -m unittest tests.<test-module>.<test-class>.<test-method>
VSSM by Matthew Ghannoum is licensed under CC BY 4.0