prevent un-necessary recalculation of RMSD matrix#20
prevent un-necessary recalculation of RMSD matrix#20tanmoy7989 wants to merge 1 commit intosalilab:mainfrom
Conversation
shruthivis
left a comment
There was a problem hiding this comment.
Looks fine to me, can merge.
benmwebb
left a comment
There was a problem hiding this comment.
It seems dangerous to me to cache the matrix file here (see inline comments).
| print("Computing RMSD matrix...") | ||
| inner_data = rmsd_calculation.get_rmsds_matrix( | ||
| conforms, args.mode, args.align, args.cores, symm_groups) | ||
| conforms, args.mode, args.align, args.cores, symm_groups) |
There was a problem hiding this comment.
This diff seems unnecessary since you don't change any logic, only whitespace. In fact, it may even cause flake8 to complain.
| # afterwards (so that we retain the original IMP orientation) | ||
| numpy.save("conforms", conforms) | ||
|
|
||
| if not os.path.isfile("Distances_Matrix.data.npy"): |
There was a problem hiding this comment.
What happens if the user
- runs a simulation
- does analysis, creating
Distances_Matrix.data.npy - goes back and runs a new simulation
- does a 2nd round of analysis?
Won't this file then be out of date and result in a hard-to-diagnose issue (matrix is for the first run but is used in the second run)? If so, one fix might be to stat the matrix/RMF/PDB files and only skip the matrix creation if the .npy file is newer than all RMF/PDB. Another would be to never reuse your cached matrix unless the user explicitly requests it with a --use-cache option or similar. Always better to err on the side of caution with caching.
all-by-all RMSD distance matrix is calculated only when sampling_precision calculated is requested. However, I have a use-case where I don't want to calculate sampling precision, but rather have prior information about the cluster threshold. I want to just cluster the models based on this threshold, and obtain the cluster precision. But model precision calculation without doing sampling precision calculation is possibly only when the
Distances_Matrix.data.npyfile is already created and contains the RMSD distance matrix.So, I changed this to the RMSD distance matrix being calculated and stored in
Distances_Matrix.data.npyalways (irrespective of whether sampling precision calculation is skipped or not), except when the script is being re-run and that file already exists.