-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update/hackathon feedback #35
Conversation
…rk of creating fits, jpgs, and storing the output for you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a few places where the logic is starting to get complicated, and without some tests of these new functions directly, I am starting to worry about the maintainability of this
@@ -112,3 +109,18 @@ def scale_points(height_1: int, width_1: int, height_2: int, width_2: int, x_poi | |||
x_points = width_2 - x_points | |||
|
|||
return x_points, y_points | |||
|
|||
def create_output(cache_key, np_array=None, fits_file=None, large_jpg=None, small_jpg=None, index=None, comment=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're doing a lot here for the programmer, and there's a ton of combinations of args here that can result in errors that aren't being handled. What happens, for instance, if np_array is none and fits_file is none? Presumably save_fits_and_thumbnails will be called and it will error out. There's nothing indicating to the developer that certain args in certain combinations are required, nor is it clear what exceptions will be raised.
I think we can create a class that does some input validation, throws errors if things are mal-defined and has some nice helper methods to abstract some of the work away from the developer - maybe we can whiteboard this out tomorrow?
|
||
log = logging.getLogger() | ||
log.setLevel(logging.INFO) | ||
|
||
def get_hdu(basename: str, extension: str = 'SCI', source: str = 'archive') -> list[fits.HDUList]: | ||
def get_hdu(path: str, extension: str = 'SCI') -> list[fits.HDUList]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, I think a lot of these get_hdu, get_fits, etc... could live in a class that abstracts them away into a very simple, testable interface. I'm having trouble following what a few of these functions do and how they should be called, so maybe we can meet about this tomorrow?
@mock.patch('datalab.datalab_session.data_operations.median.save_fits_and_thumbnails') | ||
@mock.patch('datalab.datalab_session.data_operations.median.create_jpgs') | ||
@mock.patch('datalab.datalab_session.data_operations.data_operation.get_fits') | ||
@mock.patch('datalab.datalab_session.file_utils.save_fits_and_thumbnails') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something I had noticed, but glossed over before, is the amount of mocking we're having to do which points to the fact that the code isn't easily testable (components are tightly coupled, for instance) and means we should consider re-factoring a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside from this, for Test classes that will mock a whole set of things to test a data operation for instance, you can mock out the common things at the class level so you don't have to repeat them for each test in the class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good but I would like to maintain updating progress while downloading input files
image_data_list.append(sci_hdu.data) | ||
|
||
if percent is not None and cur_percent is not None: | ||
self.set_percent_completion(cur_percent + index/total_files * percent) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of the operation progress is used when downloading input files to work on. It seems bad to remove any reporting of that progress. I.e. if you do an operation with 100 images, you will sit there with 0% progress for a long while before it finishes loading all 100 images. Can we retain the reporting of progress up to some maximum based on the number of images its downloading?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah.. it's a lot harder to track the progress like you said, downloading is most of the work for many operations. I also was dubious about removing the download percent. Next time I should call a meeting to go over the feedback on whether we should action on it or not. Customer isn't always right haha
fits_paths = [] | ||
for file in rgb_input_list: | ||
fits_paths.append(get_fits(file.get('basename'))) | ||
self.set_operation_progress(self.get_operation_progress() + 0.2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a generic comment and I know we had already been doing this before, but calling the .get_operation_progress()
every time we want to set the value causes an extra lookup into the remote cache, which is wasteful. We should probably just maintain a local variable with the operation progress in the class and use this during the operate method so we don't waste time retrieving it from the cache each time we set it.
@mock.patch('datalab.datalab_session.data_operations.median.save_fits_and_thumbnails') | ||
@mock.patch('datalab.datalab_session.data_operations.median.create_jpgs') | ||
@mock.patch('datalab.datalab_session.data_operations.data_operation.get_fits') | ||
@mock.patch('datalab.datalab_session.file_utils.save_fits_and_thumbnails') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside from this, for Test classes that will mock a whole set of things to test a data operation for instance, you can mock out the common things at the class level so you don't have to repeat them for each test in the class.
…on progress but boost to performance with not checking the cache so often
Based on developer feedback from the hackathon, certain aspects of the experience stood to be improved
get_hdu
now works on a fits file path instead of doing the fetching for it from the archive. This way developers can fetch the file once and use the get_hdu utility to work with different headers without re-fetching the whole file.create_output()
that accepts a numpy array and then creates the FITS, Jpgs, and uploads it to the bucket. Developers didn't want to have to do all of this manually and once they had finished mutating the numpy array were ready to send output. All the old functions remain for edge cases where you would want more control over the process.create_fits
now takes a comment argument to better track what operations were performed and their inputs. Could help an instructor verify that a student ran the right operations or help a developer debug a broken product.