Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rpy2/r-bio3d integration for testing/R interop; memory efficient implementation for DCC computation #14

Merged
merged 4 commits into from
Feb 25, 2024

Conversation

JHKru
Copy link
Member

@JHKru JHKru commented Jan 26, 2024

  1. rpy2/r-bio3d integration:
    Currently, tests with bio3d and BiophysConnectoR as references consist of .csv files generated separately by R.
    This offers limited flexibility when updating tests:
    R scripts have to be adjusted and .csv files regenerated manually for each change.
    This would quickly get unsustainable and confusing, in case tests are expanded, for example to a larger range of proteins.
    With rpy2, bio3d can directly be called from within Python scripts with (rudimentary) interconversion between bio3d-PDB objects and AtomArrays.
    Both packages can be readily installed via conda.
    Additional dependencies for springcraft-dev are the only downsides that come to mind.
    The obsolete .csv files were deleted for this pull request.

  2. Memory efficient DCC computation variant
    The previous approach to compute DCCs needs large amounts of memory for medium/large proteins.
    The memory efficient variant was reasonably fast in tests and could therefore replace the older variant.
    Both variants are still present for now (the older one can be toggled by mem_eff=False).

@JHKru JHKru changed the title rpy2/r-bio3d integration for testing/ R interop; Memory efficient implementation for DCC computation rpy2/r-bio3d integration for testing/ R interop; memory efficient implementation for DCC computation Jan 26, 2024
@padix-key
Copy link
Member

Hi, feel free to ask, if you would like to have a review. Since you propose the mem_eff switch: What is the advantage of setting mem_eff=False? Is it significantly slower?

@JHKru
Copy link
Member Author

JHKru commented Jan 31, 2024

Hi, sure; that would be great, if you find the time.
Regarding DCC mem_eff:
I haven't tested it systematically yet to be honest and mainly left it as fallback option or (possibly) for future tests.
The memory efficient variant should be slower for mode subsets (written in a for-loop), but it is not noticeable for the small protein we use in our tests.

@JHKru JHKru requested a review from padix-key January 31, 2024 21:48
@JHKru JHKru marked this pull request as ready for review January 31, 2024 21:49
@padix-key
Copy link
Member

Could you run a quick and dirty timing test? If it is not significantly slower and it passes our tests, but also provides a big memory improvement, I would say we enforce this new way:
This avoids maintencance from our side and does not require additional attention by the user.

@JHKru
Copy link
Member Author

JHKru commented Feb 16, 2024

image
-> Averages over five repetitions.
The new mem_eff method is faster for larger proteins (compare mem_eff with no_mem_eff).
Surprisingly, the loop-based variant for mem_eff considering only the first 100 modes is also faster than its no_mem_eff counterpart (mem_eff_subset_loop vs. no_mem_eff_subset).
Lastly, the loop-based, mem_eff variant (considering all modes) is also faster than no_mem_eff.

In summary, we can and should remove the mem_eff option and the older methods.

@JHKru
Copy link
Member Author

JHKru commented Feb 16, 2024

One additional annoyance to consider:
I tested the development environment on Windows today and rpy2 did not work properly.
There are some known problems with Windows specifically, both with conda and pip
(for example, www.github.com/rpy2/rpy2/issues/848#issuecomment-1059785984).
With WSL, everything works as expected.

@padix-key
Copy link
Member

One additional annoyance to consider: I tested the development environment on Windows today and rpy2 did not work properly. There are some known problems with Windows specifically, both with conda and pip (for example, www.github.com/rpy2/rpy2/issues/848#issuecomment-1059785984). With WSL, everything works as expected.

From my perspective this is not a large problem, as this happens only for the tests, which most users wouldn't run anyway.
If it works in the CI this is fine for me.

Does this problem appear in during the rpy2 installation or during test execution? In the latter case, I would suggest to use pytest.mark.skip() or pytest.skip(), so if anyone happens to run the tests, it is directly clear that there is not problem with the code.

Copy link
Member

@padix-key padix-key left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functionally the code looks good to me, I would only suggest some small style changes. I really like the new tests, it is much cleaner now!

Comment on lines 214 to 221
reference_fluc_subset = np.array(
bio3d.fluct_nma(enm_nma_bio3d,
mode_inds=r_seq(12,33)
))
reference_dcc = np.array(bio3d.dccm(enm_nma_bio3d))
reference_dcc_subset = np.array(
bio3d.dccm(enm_nma_bio3d, nmodes=30)
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To much indentation

Comment on lines 192 to 195
reference_fluc = np.genfromtxt(
join(data_dir(), ref_fluc),
skip_header=1, delimiter=","
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too much indentation

Comment on lines 330 to 332
reshaped = cov.reshape(
cov.shape[0]//3, 3, -1, 3
).swapaxes(1,2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too much indentation. Furtermore I would use cov.shape[0]//3, 3, cov.shape[0]//3, 3 just to reflect that both values are the same N.

Comment on lines 344 to 346
modes_reshaped = np.reshape(
eig_vectors, (len(mode_subset), -1, num_dim)
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too much indentation

Comment on lines 291 to 292
from springcraft import GNM
from springcraft import ANM
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a reason not to use a relative import here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really; changed that.

@padix-key
Copy link
Member

I would propose to do a squash and merge as soon as you are ready to keep the git history a bit cleaner

Refinements; DCC

Updated environment.yml -> bio3d as Dev-Dependency

Updated environment.yml -> bio3d as Dev-Dependency

Revert "Updated environment.yml -> bio3d as Dev-Dependency"

This reverts commit a4bb8de.

Revert "Updated environment.yml -> bio3d as Dev-Dependency"

This reverts commit 8cc2017.

Add r2py/bio3d as dev dependencies

Adjusted tests; incorporated rpy2
Add rpy2 and r-bio3d to pyproject.toml

Added correct versions in .toml

Revert "Integrated rpy2 into tests -> fix"

This reverts commit 8133498.

Added rpy2/r-bio3d -> fix

Revert "Added correct versions in .toml"

This reverts commit 02562ee.

Revert "Add rpy2 and r-bio3d to pyproject.toml"

This reverts commit ecc420f.

Update .gitignore

Added rpy2/r-bio3d to GitHub-Workflow

Removed mem_eff argument for DCC
@JHKru
Copy link
Member Author

JHKru commented Feb 22, 2024

Not sure, if all indentations are correct now, but I think I'm ready.
Is the commit history fine this way?

@JHKru JHKru changed the title rpy2/r-bio3d integration for testing/ R interop; memory efficient implementation for DCC computation rpy2/r-bio3d integration for testing/R interop; memory efficient implementation for DCC computation Feb 22, 2024
Comment on lines 192 to 194
reference_fluc = np.genfromtxt(join(data_dir(), ref_fluc),
skip_header=1, delimiter=","
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other places in the code base, and as far as I have seen, most commonly in Python projects, the indentation of a line with a closing bracket bracket is the same as the corresponding opening bracket.

Suggested change
reference_fluc = np.genfromtxt(join(data_dir(), ref_fluc),
skip_header=1, delimiter=","
)
reference_fluc = np.genfromtxt(
join(data_dir(), ref_fluc),
skip_header=1, delimiter=","
)

@padix-key
Copy link
Member

Hi, sorry for the late response. A added an example change to the review to show my preferred indentation style (see above). If you still prefer yours, that would be also fine for me, then I would merge.

@JHKru
Copy link
Member Author

JHKru commented Feb 23, 2024

No worries, indentations should be adjusted now.

@padix-key padix-key merged commit 914b997 into biotite-dev:master Feb 25, 2024
1 check passed
@JHKru JHKru mentioned this pull request Sep 14, 2024
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants