-
Notifications
You must be signed in to change notification settings - Fork 26
Developer Info
Juan Miguel Cejuela edited this page Dec 9, 2015
·
14 revisions
Scheme:
(#NR) (?major|minor) (?FIX) write short descriptive imperative message
-
#NR
: Issue Number -
(?major|minor)
: optional, importance of commit -
FIX
: optional, if the commit fixes the issue -
body description
: optional, describe sub-tasks in imperative voice
Use git-flow with master, developer and feature branches.
- We use Python 3 because:
- it will be more supported in the future
- default use for UTF8/Unicode
- Difficulty in writing software that works both for python 2 & 3
- We store in a text file a list of the PMIDs that were analyzed to get sentences for annotation (with a high probablity of including mutation mentions)
- We store in ann.jsons files
who
annotated what (eitherml:
oruser
(manual)), andconfidence
. When an automatic annotation had to be manually reviewed, the list ofwho
will beml:..., user:...
(As for for how to filter annotations by confidence, we either do it ourselves or use possible tagtog feature)
Testing procedure for nala package. Installation and unit test testing in a clean anaconda environment.
conda create --no-default-packages -n cleanenv python setuptools
activate cleanenv
python setup.py install
python -m nala.download_corpora
python setup.py test
deactivate
conda env remove --name cleanenv
root
iteration_0
base --> idp4
iteration_1
candidates
reviewed
iteration_2
candidates
reviewed
iteration_3
candidates
reviewed
stats.xls
- base = read in base' of iteration 0 for i in 1..(n-1): rev = read in reviewed of iteration i base.append(rev)
- generate binary model by training with base
- generate candidates
- using docselector to get filtered pubmedids
- retrieve html documents of pubmedids and import them into our dataset
- run tagger using binary model on retrieved articles
- save retrieved articles with predictions into candidate folder
- do manual annotation by
- using threshold module divide predicted labels into confirmed and preselected annotations (predicted: threshold)
- importing candidates into tagtog (could be also an alternative available e.g. interactive commandline)
- manually review imported data
- export from tagtog into anndoc format
- and save into reviewed folder
- do evaluation by
- defining dataset = current base (iteration 0) + reviewed iterations (iterations 1..n)
- do k-fold-cross-validation on the defined dataset
- divide data into k-sets
- repeat k times
- train on 1..k-1
- test on k
- save performance (average of k x k runs)
Document selection with DocSelector to add new unknown documents
1. run UniProtDocumentSelector
* input: given query (by default human swiss prot proteins)
* output: pubmed ids of docs that are likely to contain mutations
2. run a serius of online Filters only on pubmeid
input: list of pubmed_ids
output: smaller list of pubmed_ids
Instances:
FilterByAlreadySeen
* filter out all pubmed_ids used in iterations 1..N-1
2. run FromOnlinePubmedReader:
* input: pubmed ids from step 1)
* download the abstracts for each article
* output: Dataset object
3. run a series of ofline Filters one after the other
input: Dataset object
output: Dataset object with less documents in it
KeywordsFilter:
* only keeps articles that have a given set of keywords in their title and or abstracts
Natural Language filter
return Dataset object
# TODO diagram
- Generation of PMIDs through some initial fetching of ids (in our case: UniProtDocumentSelector)
- Online Filters (running on list of pubmedids) - need connection to internet, thus named "Online" Filters
- Convert PMIDs to Documents by downloading each of them through the FromOnlinePubmedReader
- Offline Filters (running on list of Documents) - need no connection but Text, thus named "Offline" Filters
This DocumentFilter uses predictions from both Nala and tmVar in order to find new unknown and natural language mutation mentions using a customised set of regexs. The following diagram, shows the data-flow from filtered documents: