Data

Sockit includes an empirical model of the associations between job titles and SOC codes, based on analysis of over 42 million U.S. job postings in the National Labor Exchange (NLx) Research Hub from the years 2019 and 2021. An efficient prefix tree structure matches cleaned titles to this model and returns the empirical frequencies of associated SOC codes, which can be converted to probabilities.

Sockit also includes a manually curated list of 775 skill keywords, available in sockit/data/indices/skills.csv. These skill keywords have been counted in the NLx Research Hub job descriptions to create a sparse job-skill association matrix that is Term Frequency/Inverse Document Frequecy (TF-IDF) transformed. The jobs are then assigned probabilistic SOC codes using the method above, retaining codes with probability >= 10%, creating a sparse SOC-job association matrix. The matrix product of the SOC-job matrix and the TF-IDF weighted job-skill matrix produces a dense SOC-skill association matrix. We normalize the rows of the SOC-skill matrix, which can be interpreted as probability distributions over skills for each SOC code. This matrix is available in sockit/data/soc_skill_matrix.txt.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data

Clone this wiki locally