Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix memory leak by removing custom __copy__ logic #1227

Merged
merged 4 commits into from
Jan 27, 2025

Conversation

nicoloesch
Copy link
Contributor

Fixes #1222 .

Description

This change removes the custom __copy__ logic in torchio.Subject. As a result, calling copy.copy() on a torchio.Subject will no longer create a mixture of shallow and deep copies. Therefore, to

  • create a shallow copy, utilise copy.copy(subject)
  • create a deep copy, utilise copy.deepcopy(subject)

Removing this mix of deep and shallow copies within the method for a shallow copy (__copy__) appears to be beneficial for the memory management in python and the garbage collector correctly freeing torchio.Subject and their images that are no longer referenced.

This required the change from copy.copy to copy.deepcopy in the torchio.Transform base class to not modify the original subject but rather create a copy of it that is modified using the transform (applied only if copy=True during instantiation of any transform). I assume this is the excepted behaviour based on the provided documentation and code logic.

Checklist

  • I have read the CONTRIBUTING docs and have a developer setup (especially important are pre-commitand pytest)
  • Non-breaking change (would not break existing functionality)
  • Breaking change (would cause existing functionality to change)
  • Tests added or modified to cover the changes
  • Integration tests passed locally by running pytest
  • In-line docstrings updated
  • Documentation updated, tested running make html inside the docs/ folder
  • This pull request is ready to be reviewed

Important Notes

The absence of the memory leak has only been verified on my local machine as outlined in the comment in the original issue and should be verified independently by the reviewer on their machine given the instructions in the linked comment.

Copy link

codecov bot commented Oct 24, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.51%. Comparing base (c7f3b82) to head (317fd09).
Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1227      +/-   ##
==========================================
- Coverage   84.53%   84.51%   -0.03%     
==========================================
  Files          92       92              
  Lines        5982     5966      -16     
==========================================
- Hits         5057     5042      -15     
+ Misses        925      924       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@nicoloesch
Copy link
Contributor Author

Update: I am able to run my models with roughly 6GB of memory usage instead of 70-100GB. It appears the change fixed the memory leak.

@romainVala
Copy link
Contributor

Hi
I do confirm that using this PR, does indeed remove the memory increase observe in this example

@romainVala
Copy link
Contributor

Not sure this is related to this issue, but I now wonder what the copy argument of transform is intended for ?
I thought that copy=False would then apply the transform directly to the input subject
but :

import torchio as tio
suj = tio.datasets.Colin27()
tr = tio.RandomAffine(copy=False)
sujt = tr(suj)

#even though copy=False, sujt is a new instance of suj   they are different (and have different data

suj.history
[]
sujt.history
[Affine(scales=(tensor(0.9252,  ...


tests/data/test_subject.py Outdated Show resolved Hide resolved
@nicoloesch
Copy link
Contributor Author

@romainVala I assume the copy argument is to make a deep copy of the subject and only operate on a copy of the subject, whereas the initial loaded subject stays unchanged. I think one use case of copy=True could be if the Subject contains Image elements that are initialised with a tensor opposed to an image path. However, I also was wondering whether we should not default this argument to False instead as we would re-load the inidividual every time we iterate over the DataLoader if we provide a path argument - At least that is my understanding. What are your thoughts @fepegar?

@fepegar
Copy link
Member

fepegar commented Oct 24, 2024

we would re-load the individual every time we iterate over the DataLoader if we provide a path argument

Is this not what you would expect? Otherwise, you might run out of memory if your dataset is not small.

@nicoloesch
Copy link
Contributor Author

Is this not what you would expect? Otherwise, you might run out of memory if your dataset is not small.

This is exactly what I expect but wouldn't then the copy be entirely obsolete? We are re-loading it from disk every time so we always have the initial state available. What is the point to protect (through copying) the initial subject, if we never re-store it, i.e. change the image on disk?

@romainVala
Copy link
Contributor

romainVala commented Oct 25, 2024

this was exactly my point, I do not see how the copy argument change the behaviour

If I test with a tensor, it is the same

import torchio as tio, numpy as np, torch

img = tio.ScalarImage(tensor=torch.rand((1,10,10,10)), affine=np.eye(4))
suj = tio.Subject({'image1':img})
tr = tio.RandomAffine(copy=False)
sujt = tr(suj)

suj.image1.data[0,0,0,:4]
Out[16]: tensor([0.7858, 0.4274, 0.7144, 0.6861])

sujt.image1.data[0,0,0,:4]
Out[17]: tensor([0.7443, 0.5312, 0.6450, 0.3963])

So both original subject and the transformed one are identical despite copy=False argument.

but wait , I find out where it comes from:
RandomAffine create an other transform Affine, but forget to pass the copy argument
therefore the real transform here is done with copy=True !

If I change the line
https://github.com/fepegar/torchio/blob/cfc8aaff11f346cc3286257ab7903464e7d5cfdb/src/torchio/transforms/augmentation/spatial/random_affine.py#L187
to
transform = Affine(**self.add_include_exclude(arguments), copy=self.copy)

then
tr = tio.RandomAffine(copy=False)
is now working as expected (ie suj and sujt are now identical !)

so this is indeed related to this issue #1223

Note that it is now also working for my first exemple (with subject colin) so the copy argument act similary wheter or not one use tensor as input

@fepegar
Copy link
Member

fepegar commented Nov 16, 2024

@nicoloesch I'm assuming #1228 needs to be merged before this?
@romainVala are you happy with the changes in this PR?

Sorry folks, I'm a bit busy these days.

@nicoloesch
Copy link
Contributor Author

@fepegar please excuse my delayed response. I think the order of merging both commits should not matter as they touch upon different parts of the code base. The only mutual file is transform.py, where this commit just changes how __call__ behaves in terms of making a shallow/deep copy, whereas #1228 touches on the arguments of specific transformations during apply_transform in the case of re-using an already existing transform.

@sh-shahrokhi
Copy link

sh-shahrokhi commented Jan 9, 2025

Hello, Is there any chance that this gets merged?

@nicoloesch
Copy link
Contributor Author

@sh-shahrokhi I am not entirely sure when this is getting merged into the main branch. In the meantime, you can still utilise my feature branch that alleviates the issue (also referenced in the PR). The only ``issue'' is that its 0.20.1 instead of the newest version of torchio

@fepegar
Copy link
Member

fepegar commented Jan 27, 2025

Picking this up now.

@fepegar fepegar force-pushed the 1222-fix-memory-leak branch from 7245fb8 to f8335ef Compare January 27, 2025 20:48
@fepegar fepegar merged commit 7f4b5a8 into TorchIO-project:main Jan 27, 2025
20 of 21 checks passed
@fepegar
Copy link
Member

fepegar commented Jan 27, 2025

Thanks everyone for the discussion and your patience, and of course thanks @nicoloesch for your great work!

@fepegar fepegar mentioned this pull request Jan 27, 2025
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Memory leak with TorchIO 0.20.1
5 participants