- cd project directory
- source venv/bin/activate
- python update_data.py
- python term_counter.py
Make sure that you are using the correct python interpreter, specifically the one in the virtual environment.
To do this, go to:
Pycharm -> Settings -> Project -> Python Interpreter -> Add -> Existing environment -> Interpreter
and set it to project_dir/venv/bin/python
- Update update_data.py to grab grants from the database
- Update extraction_utils.build_corpus_words_only to accept a filename as a parameter instead of using the hardcoded papers.json
- Update term_counter.py to pass in the filename to extraction_utils.build_corpus_words_only
- Update term_counter.py to pass in the correct fields (field names in the grants.json that you want to extract text from
- change the output file names to something better, and include the datasource in the name (e.g. "term_counts_grants.json")
- change do_manual() to something more descriptive (e.g. extract_terms_from_datasource)
- call do_manual() once for papers, and once for grants
- Look in our regex term list in term_counter.py and try to group diseases into a single line. For example:
- "lassa fever" and "lassa hemorrhagic fever" are counted seperately. A suggested improvement would be:
- "(lassa fever|lassa hemorrhagic fever)"
- Another example could be "meningococcal disease", and "meningococcal", perhaps (meningococcal disease|meningococcal) is better
- Verify these combos with Alice?
- Instead of CSV output, add HTML output. In the HTML, instead of printing the 2nd column in the huge file, just highlight the search terms in RED or something. Have fun with it.
- I'm going to oversimplify this probably, but really you could just write:
<html><body><table><tr><th>search term></th><th>result</th></tr>
then for every term that is a match: <tr><td>term</td><td>this is <span class="highlight">all</span> the text</td></tr>
finally: </table></body></html>