Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated WorldCerealInferenceDataset #103

Merged
merged 27 commits into from
Sep 9, 2024
Merged

Conversation

kvantricht
Copy link
Contributor

Extractions of test patches are now more in line with actual inference on openEO. WorldCerealInferenceDataset has been updated to cope with these new files.

@kvantricht
Copy link
Contributor Author

Scattered discussion of the above in older PR (#65). Once we'd merge this one, we can close the other one.

presto/dataset.py Outdated Show resolved Hide resolved
months,
target,
lon,
lat,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need latlons and lon, lat here? If I understand correctly, lon == latlons[:, 1] and lat == latlons[:, 0], which means we don't need lon, lat?

Copy link
Contributor Author

@kvantricht kvantricht Sep 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not so simple I'm afraid, I struggled with this a lot. After the meshgrid and flattening of latlons it becomes quite hard to get back to original lon, lat we need to properly reconstruct the DataArray.

latlons.shape
(2500, 2)
lon.shape
(50,)

So no, lon != latlons[:, 1]. I've been thinking about easier ways but haven't found them as of yet.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm okay. I think this is probably (?) easier, especially considering we additionally apply the transformation. The latlons take up quite a bit of RAM, and for large tiles this might become an issue.

Just in case its useful here is some code to go from the flat latlons back to the original ones (but without the transformation):

import numpy as np
from einops import rearrange

org_lat = np.array([1, 2, 3])
org_lon = np.array([4, 5, 6, 7])

def to_flat_latlons(lat, lon):
    lon, lat = np.meshgrid(lon, lat)
    latlons = rearrange(np.stack([lat, lon]), "c x y -> (x y) c")
    return latlons


def from_latlons(latlons):
    x = len(np.unique(latlons[:, 0]))
    y = len(np.unique(latlons[:, 1]))
    latlons = rearrange(latlons, "(x y) c -> c x y", x=x, y=y)
    lats, lons = latlons[0], latlons[1]
    return lats[:, 0], lons[0, ]

output_lat, output_lon = from_latlons(to_flat_latlons(org_lat, org_lon))
assert np.equal(output_lon, org_lon).all()
assert np.equal(output_lat, org_lat).all()

I wonder if its worth passing around the original lats and lons and only applying that transformation right before the model ingests the values

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope i understood this well. Could you check if my latest commit address this the way you suggest it? Feel free to suggest to do it differently. Rest of the day, unfortunately I'm away. Will catch up tomorrow morning.

@kvantricht
Copy link
Contributor Author

@gabrieltseng can't get the checks to work on this one :-( flake complains (not locally for me) about missing white space but when I add it, black wants to split the line and this is going endless in loops. Not sure how to fix.

@gabrieltseng
Copy link
Collaborator

hmm strange. I will try fixing on my end and if it works merge it in

@gabrieltseng
Copy link
Collaborator

Hmm actually @kvantricht would you mind undoing 8ac58df ? I think i prefer the solution before.

Perhaps in dataset.py you could make the variable names more descriptive to make it clear why we need to pass lat lons twice?

perhaps flat_latlons and lat_for_reconstruction, lon_for_reconstruction (although thats quite a mouthful, at least its very explicit).

@kvantricht
Copy link
Contributor Author

kvantricht commented Sep 5, 2024

Hmm actually @kvantricht would you mind undoing 8ac58df ? I think i prefer the solution before.

Perhaps in dataset.py you could make the variable names more descriptive to make it clear why we need to pass lat lons twice?

perhaps flat_latlons and lat_for_reconstruction, lon_for_reconstruction (although thats quite a mouthful, at least its very explicit).

@gabrieltseng so this appears to be all a bit more nasty than anticipated. I reverted the commit and started implementing your suggestion of name changes, but it turns out that with all the changes that were done (had to be done), once we arrive at combine_predictions, we're actually not talking about lat and lon necessarily, not when the original CRS of the .nc file was different from WGS84. So we're reprojecting for Presto, but the resulting coordinates violate the typical x and y coordinates that only change in x and y, respectively. When transforming to lat/lon, every single pixel gets a unique lat/lon combo from which we can no longer infer 1D lat and lon arrays to use for reconstruction.

Not sure if I'm making myself clear, but the only way the code works is by using the original x and y coordinates of the DataArray in the reconstruction, which are not necessarily lat and lon. So for now I called them what they are, x_coord and y_coord. Only one test for the combine_predictions was checking if the final level names in the DataFrame correspond to lat and lon, which is now x and y.

Looking forward to what you make of all this ...

@gabrieltseng
Copy link
Collaborator

Cool this looks good to me - thanks @kvantricht

Copy link
Collaborator

@gabrieltseng gabrieltseng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@kvantricht kvantricht merged commit 547c908 into main Sep 9, 2024
1 check passed
@kvantricht kvantricht deleted the updated-inferencedatasets branch September 9, 2024 09:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants