Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EVA-3720 Create new job to QC duplicate RS Ids #465

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

nitin-ebi
Copy link
Contributor

No description provided.

@nitin-ebi nitin-ebi self-assigned this Jan 14, 2025
}

if (!isSingleConnectedComponent(graph)) {
duplicateCVEAccessions.add(cveAcc);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth adding a logging statement here at least a debug level.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I think it makes more sense to have the warning log here instead of in appendToFile so it happens as early as possible.


try {
while (clusteredVariantIds.size() < chunkSize && (line = reader.readLine()) != null) {
clusteredVariantIds.add(Long.parseLong(line.trim()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some case the line will contain the rsid and the Hash. You should take everything up to the first whitespace.

mongoTemplate.insert(Arrays.asList(ss11, ss12, ss21, ss22, ss31, ss32), SubmittedVariantEntity.class);
mongoTemplate.insert(Arrays.asList(rs11, rs21, rs31), ClusteredVariantEntity.class);
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To test for duplicates in the same assembly

Suggested change
public static void populateTestDataDuplicate(MongoTemplate mongoTemplate) {
SubmittedVariantEntity ss11 = createSS("GCA_000000001.1", 60711, "hash" + 11, "study1", "chr1", 11L, 2L, 100L, "C", "T");
SubmittedVariantEntity ss12 = createSS("GCA_000000001.1", 60711, "hash" + 12, "study2", "chr1", 12L, 2L, 101L, "A", "G");
ClusteredVariantEntity rs11 = createRS(ss11);
ClusteredVariantEntity rs12 = createRS(ss12);
mongoTemplate.insert(Arrays.asList(ss11, ss12), SubmittedVariantEntity.class);
mongoTemplate.insert(Arrays.asList(rs11, rs12), ClusteredVariantEntity.class);
}

Copy link
Contributor

@apriltuesday apriltuesday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love a good graph algorithm... great work, left some suggestions on top of Tim's.

public Step duplicateRSAccQCStep(StepBuilderFactory stepBuilderFactory,
SimpleCompletionPolicy chunkSizeCompletionPolicy) {
TaskletStep step = stepBuilderFactory.get(DUPLICATE_RS_ACC_QC_STEP)
.<List<Long>, List<Long>>chunk(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should work with larger chunk sizes too, right?

}

if (!isSingleConnectedComponent(graph)) {
duplicateCVEAccessions.add(cveAcc);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I think it makes more sense to have the warning log here instead of in appendToFile so it happens as early as possible.

private List<SubmittedVariantEntity> getAllSubmittedVariantEntitiesForCVEAccs(Set<Long> cveAccs) {
Bson query = Filters.and(Filters.in(SVE_RS_FIELD, cveAccs));
logger.info("Issuing find in EVA collection for a bunch of SVE containing the given CVE accs : {}", query);
List<SubmittedVariantEntity> submittedVariantEntitiesList = mongoTemplate.getCollection(mongoTemplate.getCollectionName(SubmittedVariantEntity.class))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just my ignorance, is there a reason we can't use simpler querying code like this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, do we need to query both EVA and dbSNP collections? I think it might be needed to check for intersection on the full set of SS IDs.

}

private boolean listsIntersect(List<SubmittedVariantEntity> list1, List<SubmittedVariantEntity> list2) {
return list1.stream()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be simplified a bit, see first two answers here


List<Long> duplicateCVEAccessions = new ArrayList<>();

// find if a CVE accession is duplicate by processing its SVE docs grouped by assembly, contig and position
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be more explicit about how we're defining an RS duplicate - at least link to the ticket with the definition.

mongoTemplate.insert(Arrays.asList(rs11, rs21, rs31), ClusteredVariantEntity.class);
}

public static void populateTestDataNoDuplicate2(MongoTemplate mongoTemplate) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment (or write in the test name if it's not too long) stating what noDuplicate1 and 2 are testing specifically?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants