-
Notifications
You must be signed in to change notification settings - Fork 6
Data
Sockit includes an empirical model of the associations between job titles and SOC codes, based on analysis of over 42 million U.S. job postings in the National Labor Exchange (NLx) Research Hub from the years 2019 and 2021. An efficient prefix tree structure matches cleaned titles to this model and returns the empirical frequencies of associated SOC codes, which can be converted to probabilities.
Sockit also includes a manually curated list of 775 skill keywords, available
in sockit/data/indices/skills.csv
. These skill keywords have been counted in the
NLx Research Hub job descriptions to create a sparse job-skill association
matrix that is Term Frequency/Inverse Document Frequecy (TF-IDF) transformed.
The jobs are then assigned probabilistic SOC codes using the method above,
retaining codes with probability >= 10%, creating a sparse SOC-job association
matrix. The matrix product of the SOC-job matrix and the TF-IDF weighted job-skill
matrix produces a dense SOC-skill association matrix. We normalize the rows of the
SOC-skill matrix, which can be interpreted as probability distributions over skills
for each SOC code. This matrix is available in sockit/data/soc_skill_matrix.txt
.