Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add start and end positions in output matrix #135

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

acesnik
Copy link

@acesnik acesnik commented May 4, 2024

I'm drafting a PR for #110

Copy link
Owner

@lazear lazear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, left a few specific comments.

Additional considerations not addressed:
During deduplication of identical peptide sequences post-modification, add protein-positions

keep.proteins.extend(remove.proteins.iter().cloned());

Ensure that positions and proteins identified are co-sorted:

.for_each(|peptide| peptide.proteins.sort_unstable());

Might be best to define a new struct for assigning proteins that includes both the identifier and positions, to preclude any potential bugs from the above.

struct ProteinAssignment {
   identifier: String,
   // do these need to be usize (8 bytes)? u32 (4 bytes) is probably sufficient...
   start: usize,
   end: usize,
}

@@ -28,6 +28,10 @@ pub struct Peptide {
pub position: Position,

pub proteins: Vec<Arc<String>>,
/// What residue does this peptide start at in the protein (1-based inclusive)?
pub start_position: Vec<Arc<usize>>,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for Arc - this is a smart pointer (atomic reference counted) allocated on the heap. It's used for proteins: String (which is already heap allocated) to prevent repeated clones of protein identifiers. Doesn't make sense to use an Arc for a simple usize in this case!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that makes sense!

@@ -76,6 +82,7 @@ impl std::hash::Hash for Digest {
fn hash<H: std::hash::Hasher>(&self, state: &mut H) {
self.sequence.hash(state);
self.position.hash(state);
self.start_position.hash(state);
Copy link
Owner

@lazear lazear May 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is correct behavior (won't necessarily cause a bug though). We use hash to deduplicate digests prior to creating peptides.

Only sequence and position are considered to ensure that we don't accidentally deduplicate peptide sequences that occur on protein termini, which might be assigned termini-specific modifications. Including start_position will lead to extra (duplicated) digests being created -> more duplicated peptides that must be generated (expensive) and then trashed during de-duplication.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, that makes sense. I was thinking about maybe including a list of start indices for each protein in the proteins list, but that probably doesn't make sense from a peptide-centric search perspective, and other search engines just list the index for the leading protein.

"rank",
"label",
"expmass",
"calcmass",
"measured_mass",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the rationale for changing column names?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, the rationale here, although I'm not wedded to them, are to 1) keep the underscore delimiting in the rest of the columns, 2) spell out calculated and experimental/measured, and 3) including number after scan doesn't feel necessary since every entry has "scan=[scan number]".

Copy link
Owner

@lazear lazear May 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not opposed to it - but the column names have been stable for ~2.5 years so I don't see a huge need to rename them (and thus force people to retool pipelines downstream of Sage). They were originally named like this because sage results (used) to be passed directly to mokapot for rescoring.

@@ -373,6 +380,8 @@ impl TryFrom<Digest> for Peptide {
missed_cleavages: value.missed_cleavages,
semi_enzymatic: value.semi_enzymatic,
proteins: vec![value.protein],
start_position: value.start_position,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Must be a Vec of positions.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@acesnik
Copy link
Author

acesnik commented May 7, 2024

Thanks for the feedback on this! I'll continue working on the PR and mark it ready for review when ready or ask some follow-up questions along the way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants