-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend pattern writer to test for validity of pubs & features FB & add to KB #5
Comments
the trouble with this is that you are doubling the amount of curation that needs doing - a curator has to go copy and paste the FBid from somewhere, which is time consuming, so if this can be avoided, it would be good. I would not recommend UTF-8 (which should hopefully make it a bit simpler !) The format we currently use is sgml for special characters - this is what is in all the lookup files that curators would use to identify the correct feature, so if it were possible to use the sgml format, that would match the lookup files the curators will be using, which might be helpful. Peeves now looks in the synonym_sgml column of the synonym table to get the current sgml symbol for a feature (it does have to convert superscripts and subscripts too). I've got a question:
what tables are you using to look this up - if you are using the synonym table, then you may already have the sql necessary to get the correct uniquename ? I can dig out the Peeves queries if that would help ? I think if you can find a way to use symbols without making them also add in the FBids, it'll be way faster for curators. |
Hi @dosumis, I've been thinking about this - bottom line is we should avoid curators having to fill in two columns for the same thing, but perhaps because image curation is going to have simpler genotypes than am used to in phenotype curation: I tried looking for the specs and examples to have a look to try and make a more concrete suggestions for columns etc. - but I can't find the specs/examples. I got to the https://github.com/VirtualFlyBrain/curation/wiki/Curation-wiki--Home wiki, but the links on there are not working at the moment. I'm specifically interested in these (doc from wiki): Curation record types for adding new images:
Expression and Split 'YAML spec' and Example just go to the wiki page, 'Anatomy' YAMLspec goes to https://github.com/VirtualFlyBrain/curation/blob/master/records/anatomy_spec.yaml but I get a page not found error (am logged in !) ta |
Thanks for the pointer. I think I may be missing something. We use synonym.sgml to pull names and synonyms for VFB: I always thought the name synonym.sgml was a bit odd as it contains sgml super and subscript markup, but as far a I can tell, doesn't always have sgml greeks. Instead it is the only column with unicode versions of names and synonyms: Using this SELECT DISTINCT f.uniquename as fbid, f.name as feature_name, s.name as ascii_name,
stype.name AS stype,
fs.is_current, s.synonym_sgml as unicode_name
FROM feature f
LEFT OUTER JOIN feature_synonym fs on (f.feature_id=fs.feature_id)
JOIN synonym s on (fs.synonym_id=s.synonym_id)
JOIN cvterm stype on (s.type_id=stype.cvterm_id)
where f.uniquename = 'FBal0097158'; =>
The sgml.synonym column (here I've called it unicode_name as that's what I use it for) has unicode versions of the official symbols and fullname, but no sgml greek. Note in this case feature.name = 'Scer\GAL4[alphaTub84B.PL]' - also no sgml greek. Matching on feature.name is more appealing as there's only one so I can use this for a fast lookup. Is the greek always spelt out in these cases? |
Comment's just crossed. Reading yours now. |
(ditto !) |
Re wiki: specs are in a branch and I've done some re-arranging. Latest versions are here - but still draft: https://github.com/VirtualFlyBrain/curation/tree/configs_and_test_recs/records/new_images Need to conclude this discussion before finalising. I'll wire them up to the wiki, and update that, when details settle down a bit more. |
thanks for the link ! |
OK,
This means that what Peeves is doing when presented with an sgml symbol to check for validity is first turning it into utf-8 using a subroutine that has a mapping of sgml (&agr;, &bgr;) etc. to utf-8, turning [ ] style super and sub-scripts into and and then using that utf-8 symbol to query the synonym.synonym_sgml table. This sounds like too much conversion back and forth to me for the image curation code, so I reckon that if we think that we do need to use symbols, then getting curators to spell out any greeks would be the way to go, because the answer to your question:
is yes, the greek is always spelt out in the feature.name I've had a look at the yaml specs and I've got a couple of ideas re symbol vs FBid (will put in next comment) |
I've got a suggestion for the split_spec.yaml file The code I made for Alex for expression curation uses the following information (1-3 submitted as columns, 4. submitted once when running script) to uniquely identify the line components involved.
e.g. R25D01 or GMR25D01
In terms of filling in columns 1-3, there are options in the code: a. fill in combination symbol only
b. fill in all info
c. DO NOT fill in combination symbol, fill in DBD identifier, AD identifier, DBD type
For image curation, I think we'll probably want to use either a. or c. mostly. Using the symbols is probably more human readable, but for the GMR/VT lines, its still possible to unambiguously map using symbols |
For ep_spec.yaml, I think maybe this description needs tweaking: Please use transposon rather than insertion. For promoter fusion transgenes, please use transposon rather than insertion. For enhancer traps lines, please use insertion. (i.e. it could be either FBtp or FBti ??) For this, perhaps we should see whether curators find it OK just filling in the FBid, given that in most cases its going to be just a single driver per image (and sometimes an effector), it might not be too confusing to use ids for this rather than symbols ? |
I'm now allowing name-only for features. Will only change if we hit problems with special characters. |
Notes on design and implementation.
The spec originally allowed features to be referred to by name only, but this makes for quite a bit of overhead to deal with special characters = e.g. Should greeks always be spelt out; should they use sgml, should they use UTF-8? I'd rather follow a simpler approach if possible: whenever a feature must be referred to, the curation table has two columns: one for the name and one for the ID. Loading scripts check that the name given is present in the set of synonyms/names for the feature in question in FlyBase. A warning is thrown if this is not the case.
@gm119 - Do you think this is this viable?
The spec for Pubs originally specified an FBrf or, if not available, a DOI. This could potentially be supplemented by a miniref for cross-checking.
CC @Clare72 @gm119
The text was updated successfully, but these errors were encountered: