- MUST READ: In order to protect our Candidate, this repo doesn't contain any dataset that includes real candidte information. If you need some test cases in any folder/file, please contact the Author (zhiyuzha@usc.edu) for more information!
- This repo contains all files needed for constructing NLP based bidirectional matching system between candidate profile and job description and improve the efficiency as well as the accuracy in matching process.
- This system was first constructed during summer 2022 internship but is improved and will improve continuously.
- If you have any question or just show strong interest towards our project, please do not hesitate to contact the Author via zhiyuzha@usc.edu.
- VERSION0:AUG 8, 2022;
- VERSION1:AUG 23, 2022; (CREATE THE BASIC FRAMEWORK FOR THE LAST STEP)
Matching system is one of the most popular artificial intelligence systems for companies in different industries across the world. As a world-leading recruiting company, we also wants to introduce this kind of system to fill the gap and improve the experience of clients. We aim to construct a bidirectional matching system between recruiter and potential candidates with machine learning techniques (especially advanced NLP techniques), improving the efficiency of recruitment activity and grabbing market share of our start-up.
- Core Algorithms: In order to compare the similarity of two text-based content (candidate profile and job description), we need to clean the original dataset (seperate words and remove useless words), vectorize core features (transfor from text to numerical data) and project the vector to pre-defined recruiting matrix (recruiting maxtrix that contains all features we need to evaluate). Finally, calculate cosine similarity between features and select candidate with high number.
- Basic Workflow:
- Read Candidate Profile and Job Description & transform them to uniform json format
- Use NLP NER to identify and label the key words occured in the first step
- Project the key words into predefined recruiting matrix and transform each profile/JD into a vector
- Calculate the cosine similarities between profile and JD and recommend based on rankings
- This repo contains 1 Powerpoint File and 1 Folder (Contains All Code and Data Files). All files and folders will be introduced in this section.
- File(s): Matching System Detailed Explanation. pptx
- Content: The detailed introduction and workflow of whole project.
- We use O*NET&zety online resources to construct a cleaned dataset that contains job titles in the market as many as possible.
- File(s): scrap job title.ipynb -> uncleaned_job_title.xlsx -> clean_jobtitle.ipynb -> title_final.xlsx
- Content: Get job titles from https://zety.com/blog/job-titles and do data cleaning. Results can be found in the corresponding excel files.
- File(s): title_final.xlsx & XXX_job_title.xlsx (6 files) -> add_additional_job_titles.ipynb -> title_final.xlsx (cover the previous file with same name)
- Content: Combine job titles scraped from https://www.onetonline.org with titles from the first step. Result can be found in the corresponding excel files.
- We use act.org online resource, combined with self-owned dataset, to construct a cleaned dataset that contains majors in the college as many as possible.
- File(s): webscrap_major.ipynb -> student_major.csv & student_major2.csv -> create student major.ipynb -> major.xlsx
- Content: Get college major from O * NET and do data cleaning. Result can be found in the corresponding excel files.
- File(s): major.xlsx & more_majors.xlsx -> add_more_majors.ipynb -> temp_merged_major.xlsx
- Content: Combined self-owned major dataset with dataset from the first step, continually expand the dataset. Result can be found in the corresponding excel files.
- We use O * NET online resoures, combined with some acvanced data processing techniques, to construct a cleaned json-format dataset that can be passed in NLP SpaCy Named Entity Recognition. 143 groups of and over 3000 single items of hardskills can be recognized.
- Folder(s): active_listening/math/reading_comprehension/science/speaking/writing_position
- Content: Gather hardskill related information for different types of position from O * NET website. Results can be found in XXX_skillset.xlsx file in each folder.
- Folder(s): Final_merge&analysis_hardskills
- Stream of the files: merge_skillset.ipynb -> final_skill_table.xlsx & large software company.csv -> clean_skillset.ipynb -> cleaned_skillset.xlsx -> create_hardskill_dataset.ipynb -> hardskills.json
- Content: Clean and expand hard skill dataset. Try to cover all possible situations that may occur in profile (e.g. Microsoft Powerpoint, Powerpoint, PPT may point to the same skillset). Readable result can be found in the cleaned_skillset.xlsx and SpaCy usable result can be found in the hardskills.json.
- We use O * NET online resoures, combined with some acvanced data processing techniques, to construct a cleaned json-format dataset that can be used for detecting softskills in original data. 40 groups of and over 2000 single items of hardskills can be recognized.
- Folder(s): active_listening/math/reading_comprehension/science/speaking/writing_position
- Content: Gather softskill related information (activity, content and soft skills for each position) for different types of position from O * NET website. Results can be found in XXX_skillset.xlsx file in each folder.
- Folder(s): merge_activity, merge_softskills, merge_work_content
- Content: Merge and clean the activities, softskills and work contents for each type of job.
- Folder(s): softskills dataset
- Stream of the files: pre_softskills_matrix.xlsx -> USE GOOGLE GET SOFT SKILLS.ipynb (or USE NEUR DATASET TO GET SOFT SKILLS.ipynb (template)) -> final_skill_keyword.xlsx -> softskills.json
- Content: Expand the softskill key word dataset to accommodate different expressions of same softskills with google pre-trained dataset. (NEUR can also be a choice for this step) Readable result can be found in final_skill_keyword.xlsx and NLP usable format can be found in softskills.json.
- This part may contain some sensitive information, please contact author for test case if you need.
- We use advanced NLP and data processing skills to deal with noisy data in original candidate profile and transform each profile into a managable dataset.
- Folder(s): parse from csv
- Content: Parse Candidate Profile to uniform json-format information.
- File(s): temped_merged_major.xlsx/ title_final.xlsx -> clean_candidate_profile.ipynb
- Content: Class that clean and store candidate profile.
- We use advanced NLP and data processing skills to deal with noisy data in original job description and transform each JD into a managable dataset.
- File(s): matching responsibility.xlsx (test case)/ temp_merged_major.xlsx/title_final.xlsx -> clean_dataset_before_input.ipynb
- Content: Class that clean and store job description.
- We use all information & dataset prepared above to transform each profile into a vector.
- File(s): hardskills.json & softskills.json ->project_to_recruiting_matrix.ipynb
- Content: Class that do transformation. (uncompleted, still working on)
- File(s): project_to_matrix_example.ipynb
- Content: Real example of the implementation of class listed above.
- File(s): matching_system_workflow.ipynb
- Content: Detailed Explanation of matching system workflow. (but priority is README.md file)