-
Notifications
You must be signed in to change notification settings - Fork 21
Improving read and write speed #101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…, and adding disable_gc kwarg to gdx.GdxSymbol.load, read_gdx.to_dataframes, and read_gdx.to_dataframe
elainethale
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few initial comments.
| target_symbol = GdxSymbol(name, data_type, dims=dims, file=self, index=index) | ||
| self.append(target_symbol) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is skipping this for symbols whose data you don't want to load costly enough in terms of time to warrant this separate function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In large GDX files, it is a very significant improvement in runtime. We encounter regularly 100-1000 Mb GDX files with hundreds of symbols. This method is sometimes up to 15x faster when reading a single symbol from a large file. There are of course other overheads, like opening the file, etc and thus the old method is significantly better to read all symbols. The increase in code maintenance and documentation is the biggest drawback. Small files are always extremely fast.
| if not self.empty: | ||
| raise Error("GdxFile.read_single_symbol can only be used if the GdxFile is .empty") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that you can't call this function multiple times to read in the metadata for multiple symbols? So you can either read in the meta data for all symbols or for just one symbol?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That section should probably be improved. You can call that function multiple times, because it always closes the file and starts refreshed. That was a safeguard while developing and the code should not reach that section, but there might be some use case I can't think of and didn't remove that Error.
| # restore GC if changed | ||
| if disable_gc and gc_was_enabled and not gc.isenabled(): | ||
| gc.enable() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm glad this is here. It makes me more comfortable that disabling is default behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: "def load(self,load_set_text=False,disable_gc=True):" Carbage collector is Python's internal guard against memory leaks etc. and this disables it by default. "disable_gc=False" would be the python default behavior. The speedup with large GDX files is quite large and I didn't notice any issues while testing.
Improving read and write speed, adding gdx.GdxFile.read_single_symbol, and adding disable_gc kwarg to gdx.GdxSymbol.load, read_gdx.to_dataframes, and read_gdx.to_dataframe