Disclaimer: data files redacted as per campus data sharing agreement.
Calls 2 scripts:
data_joining.py
- Aligns C2V embeddings with corresponding API pulled descriptions.
semantic_model.py
- Train models to tag course embeddings with semantics, i.e. infer topics for courses.
course_vectors.npy
frommodel
folder in timestamped directorycourse_info.tsv
fromshared/course_api/outputs
- Uses course_subject and coures_num from API to create foreign key to join on
idx2course.json
identifier
aligned_course_info.tsv
aligned_course_vecs.tsv
Generates bag of words representations for course description, and trains translation model to map from embeddings to BOW and then predicts labels for each course. Model hyperparameters currently hardcoded at top of script.
`python semantic_model.py -v ../data/aligned_course_vecs.tsv -i ../data/aligned_course_info.tsv`
-v vectorfile_path: location of word embeddings you would like to perform semantic analysis on
-i infofile_path: location of text-valued data corresponding to word embeddings
-t textcolumn: column in infofile to be preprocessed and vectorized
-b tf_bias: the bias constant for term-frequency
-r research_file: flag indicating whether or not research file should be generated. Research file includes keyword groups including top topics / keywords within description, out of description, as well as random baselines.
- Dumps pickle file (search_keywords.pkl) containing course_id, course_title, course_description, course_keywords, and course_alternative_names into the output directory
- Updating search_keywords.pkl requires restarting the backend.
Performs a 5-fold cross validation grid search across the following hyperparameters:
- use_hidden_layer: train a multinomial regression model or a multilayer perceptron
- num_epochs: number of epochs to train models
- use_idf: True to use tf-idf scores, false to only use tf in BOW representation
- tf_bias: term-frequency bias, controls specificity of words in BOW representation
- max_df: control for corpus specific words
Score results using different metrics including recall, precision, and several custom metrics located at /results
.