-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EVA-3720 Create new job to QC duplicate RS Ids #465
base: master
Are you sure you want to change the base?
Conversation
} | ||
|
||
if (!isSingleConnectedComponent(graph)) { | ||
duplicateCVEAccessions.add(cveAcc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth adding a logging statement here at least a debug level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I think it makes more sense to have the warning log here instead of in appendToFile
so it happens as early as possible.
|
||
try { | ||
while (clusteredVariantIds.size() < chunkSize && (line = reader.readLine()) != null) { | ||
clusteredVariantIds.add(Long.parseLong(line.trim())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In some case the line will contain the rsid and the Hash. You should take everything up to the first whitespace.
mongoTemplate.insert(Arrays.asList(ss11, ss12, ss21, ss22, ss31, ss32), SubmittedVariantEntity.class); | ||
mongoTemplate.insert(Arrays.asList(rs11, rs21, rs31), ClusteredVariantEntity.class); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To test for duplicates in the same assembly
public static void populateTestDataDuplicate(MongoTemplate mongoTemplate) { | |
SubmittedVariantEntity ss11 = createSS("GCA_000000001.1", 60711, "hash" + 11, "study1", "chr1", 11L, 2L, 100L, "C", "T"); | |
SubmittedVariantEntity ss12 = createSS("GCA_000000001.1", 60711, "hash" + 12, "study2", "chr1", 12L, 2L, 101L, "A", "G"); | |
ClusteredVariantEntity rs11 = createRS(ss11); | |
ClusteredVariantEntity rs12 = createRS(ss12); | |
mongoTemplate.insert(Arrays.asList(ss11, ss12), SubmittedVariantEntity.class); | |
mongoTemplate.insert(Arrays.asList(rs11, rs12), ClusteredVariantEntity.class); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love a good graph algorithm... great work, left some suggestions on top of Tim's.
public Step duplicateRSAccQCStep(StepBuilderFactory stepBuilderFactory, | ||
SimpleCompletionPolicy chunkSizeCompletionPolicy) { | ||
TaskletStep step = stepBuilderFactory.get(DUPLICATE_RS_ACC_QC_STEP) | ||
.<List<Long>, List<Long>>chunk(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should work with larger chunk sizes too, right?
} | ||
|
||
if (!isSingleConnectedComponent(graph)) { | ||
duplicateCVEAccessions.add(cveAcc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I think it makes more sense to have the warning log here instead of in appendToFile
so it happens as early as possible.
private List<SubmittedVariantEntity> getAllSubmittedVariantEntitiesForCVEAccs(Set<Long> cveAccs) { | ||
Bson query = Filters.and(Filters.in(SVE_RS_FIELD, cveAccs)); | ||
logger.info("Issuing find in EVA collection for a bunch of SVE containing the given CVE accs : {}", query); | ||
List<SubmittedVariantEntity> submittedVariantEntitiesList = mongoTemplate.getCollection(mongoTemplate.getCollectionName(SubmittedVariantEntity.class)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just my ignorance, is there a reason we can't use simpler querying code like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, do we need to query both EVA and dbSNP collections? I think it might be needed to check for intersection on the full set of SS IDs.
} | ||
|
||
private boolean listsIntersect(List<SubmittedVariantEntity> list1, List<SubmittedVariantEntity> list2) { | ||
return list1.stream() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be simplified a bit, see first two answers here
|
||
List<Long> duplicateCVEAccessions = new ArrayList<>(); | ||
|
||
// find if a CVE accession is duplicate by processing its SVE docs grouped by assembly, contig and position |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be more explicit about how we're defining an RS duplicate - at least link to the ticket with the definition.
mongoTemplate.insert(Arrays.asList(rs11, rs21, rs31), ClusteredVariantEntity.class); | ||
} | ||
|
||
public static void populateTestDataNoDuplicate2(MongoTemplate mongoTemplate) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment (or write in the test name if it's not too long) stating what noDuplicate1
and 2
are testing specifically?
No description provided.