Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added geocoding function #50

Merged
merged 46 commits into from
Nov 20, 2023
Merged

added geocoding function #50

merged 46 commits into from
Nov 20, 2023

Conversation

mtravis
Copy link
Contributor

@mtravis mtravis commented Oct 16, 2023

I've added a new geocode function that takes a place name and provides a bbox well known text (WKT).

We could pass this to the sql function and add a geocode command negating the need to add any geojson.

@mtravis mtravis marked this pull request as draft October 16, 2023 13:17
@mtravis mtravis marked this pull request as ready for review October 16, 2023 13:18
@mtravis mtravis marked this pull request as draft October 16, 2023 13:18
@cholmes
Copy link
Collaborator

cholmes commented Oct 16, 2023

Thanks for the contribution @mtravis!

Looks like the build is failing - I think you need to add the osmnx library to the requirements.txt so it gets installed.

Any chance you could add some tests for it? We've now got a test framework, and it'd be good to start to use it.

Starting as a function makes sense. I'd love to understand a bit more how it works:

  • Does it call a remote service? Or does it download the data it needs?
  • How good is its fuzzy matching? Like mis-spellings, etc.
  • Is it possible to get the actual polygon bounds, not just the bbox? I'd love to be able to put in requests like 'santa clara county' and have it get the real bounds of the county.

I wouldn't see having get_buildings geocoding negate any need for a geojson, as people will often have some special area they want. But I think ideally we make it so users can pass in a name to geocode or a geojson file.

I think we can have the CLI handle the difference, and then have the 'download' function just take the output. We could shift download to require WKT instead of GeoJSON (and translate to wkt in the cli function), or else have your function output GeoJSON.

Ultimately I think we might want to allow different geocoding options (including letting users add their own api keys), since I know of no perfect free geocoder, though I've not tried this one yet.

But happy to merge this once the tests are passing, and ideally with a test so we get into that habit.

@felix-schott
Copy link
Collaborator

felix-schott commented Oct 16, 2023

Hi both,

First of all: off-topic, but I'm not entirely sure what the SQL CLI command is for that you've linked, @mtravis. There seems to be overlap with download(..., generate_sql=True) as it generates a SQL query to be run in DuckDB but doesn't actually run it. However, the SQL function only outputs to FlatGeobuf whilst download() has a bunch of different options. Maybe @cholmes can shed light on whether the SQL cli command is just a remnant of previous iterations of the project or a util unction without further connection to get_buildings? I can't find the sql CLI command mentioned in the main docs.

I agree with @cholmes that there is value in adding the geocoding functionality to ob get_buildings directly via an additional argument, and handling it on the CLI level with the Python function still accepting a GeoJSON (or WKT). I can see something like ob get_buildings --lookup-aoi london buildings.gpkmaking sense (just an example, there is most likely a better name for the param). In that case, you would have to add an option here, deal with the geocoding in the cli function and then pass the output of your function as GeoJSON or WKT (see also #51 which I just added) to the download function. If we ignore #51 for now, you can easily convert the shapely object that is the output of your function to a GeoJSON dict using mapping(obj).

@cholmes has some valid points - the package seems to be wrapping the Nominatim API which I've used before. It's a simple REST API that I think works quite well. It should also be possible to return exact geometry instead of bbox. It is worth having a look at the usage policy though - I don't think there is a risk that we're exceeding the rate limit but will be good to keep in mind that they're not providing a free service without any restrictions.

I know this is just a draft/conversation starter but eventually the osmnx package would have to be added to requests.txt, this is what currently breaks the tests. I'm adding some tests on the CLI level soon, that will help here - you can use that template to add a test then or I can do it.

Cheers
Felix

@mtravis
Copy link
Contributor Author

mtravis commented Oct 16, 2023

@cholmes thanks for the feedback. I've added my answers inline below.

Looks like the build is failing - I think you need to add the osmnx library to the requirements.txt so it gets installed.
Any chance you could add some tests for it? We've now got a test framework, and it'd be good to start to use it.

Sure, I'll add those next. May need some help setting up the tests.

  • Does it call a remote service? Or does it download the data it needs?

It calls the Nominatim API

  • How good is its fuzzy matching? Like mis-spellings, etc.

I'm not sure it has fuzzy searching. If you use this as an example on the main OSM site which uses nominatim then
https://www.openstreetmap.org/search?query=Santa%20clar%20county%2C%20california with the missing won't get you any results

  • Is it possible to get the actual polygon bounds, not just the bbox? I'd love to be able to put in requests like 'santa clara county' and have it get the real bounds of the county.

Sure that's definitely possible, in fact I was trying getting that to start with but was trying to get around having a complex geometry returned. Perhaps the default could be the actual polygon geometry and then we have a --bbox_only option?

I wouldn't see having get_buildings geocoding negate any need for a geojson, as people will often have some special area they want. But I think ideally we make it so users can pass in a name to geocode or a geojson file.

I shouldn't have used negate here as I didn't acutally mean not have geojson at all just that having geocode meant it wasn't always necessary

I think we can have the CLI handle the difference, and then have the 'download' function just take the output. We could shift download to require WKT instead of GeoJSON (and translate to wkt in the cli function), or else have your function output GeoJSON.

I'll look into this. It might be easiest to get the geocode function output to geojson for now.

Ultimately I think we might want to allow different geocoding options (including letting users add their own api keys), since I know of no perfect free geocoder, though I've not tried this one yet.

I can think of a couple of options we could e.g. Geocage.

@cholmes
Copy link
Collaborator

cholmes commented Oct 16, 2023

but I'm not entirely sure what the SQL CLI command is for that you've linked, @mtravis. There seems to be overlap with download(..., generate_sql=True) as it generates a SQL query to be run in DuckDB but doesn't actually run it. However, the SQL function only outputs to FlatGeobuf whilst download() has a bunch of different options. Maybe @cholmes can shed light on whether the SQL cli command is just a remnant of previous iterations of the project or a util unction without further connection to get_buildings?

Ah, right. I was thinking the sql command was called, but you're right, it's just a remnant of some previous debugging. My first attempt at 'download' revealed that I had an error in my quadkey production code, so I used the sql function to debug it (before I had the generate_sql option). Left it in as a util function, but seems like we should either actually use it as the util function for get_buildings or else kill it, so it's not in there confusing things.

@mtravis
Copy link
Contributor Author

mtravis commented Oct 20, 2023

I've attempted to add an option for geocode but I've struggled with that and now the tests are failing . I'm a bit lost on how to implement new arguments or options in click.

Should we create a completely new argument for geocode and get the wkt directly from there?

Copy link
Contributor Author

@mtravis mtravis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felix-schott Thanks for reviewing.

open_buildings/download_buildings.py Outdated Show resolved Hide resolved
Copy link
Contributor Author

@mtravis mtravis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felix-schott thanks

@mtravis
Copy link
Contributor Author

mtravis commented Nov 4, 2023

@felix-schott I've updated the geocode function so that it returns geojson and made some other changes. Plus I added a test for the geocode function and when I run that locally it works ok. Looks like the checks for the ubuntu machines are failing though.

@felix-schott
Copy link
Collaborator

@felix-schott I've updated the geocode function so that it returns geojson and made some other changes. Plus I added a test for the geocode function and when I run that locally it works ok. Looks like the checks for the ubuntu machines are failing though.

cheers, i'm having a look tomorrow!

@mtravis
Copy link
Contributor Author

mtravis commented Nov 5, 2023

@felix-schott I've updated the geocode function so that it returns geojson and made some other changes. Plus I added a test for the geocode function and when I run that locally it works ok. Looks like the checks for the ubuntu machines are failing though.

cheers, i'm having a look tomorrow!

Great. I was wondering how it's possible to test the whole app locally i.e. run ob get_buildings --local plymouth

@felix-schott
Copy link
Collaborator

@felix-schott I've updated the geocode function so that it returns geojson and made some other changes. Plus I added a test for the geocode function and when I run that locally it works ok. Looks like the checks for the ubuntu machines are failing though.

cheers, i'm having a look tomorrow!

Great. I was wondering how it's possible to test the whole app locally i.e. run ob get_buildings --local plymouth

hi @mtravis, i pushed some changes, most importantly a cli test for ob get_buildings --location .... it's failing because it doesn't return any data (it's late and i couldn't be bothered to investigate further). i also tried to rebase the upstream changes into your fork as since you forked a few changes have been added, e.g. the cli tests - ended up being quite messy, so this PR should definitely be merged with the squash option enabled ...

there a few things to be discussed here, i had to change dst from positional parameter to named option as we're introducing a case where the first positional parameter (input) is non-existent and the aoi is provided via --location option.

@felix-schott
Copy link
Collaborator

it's most likely failing because I provided county_code UK instead of GB! it's late... task for tomorrow :)

@mtravis
Copy link
Contributor Author

mtravis commented Nov 7, 2023

@felix-schott I've updated the test now so it uses GB.

@mtravis
Copy link
Contributor Author

mtravis commented Nov 7, 2023

Just tested by version locally and it works!

ob get_buildings --overwrite --location plymouth --country_iso=GB took 26 seconds

I'm going to do some more testing through out the day. Thanks for your fixes @felix-schott

@felix-schott
Copy link
Collaborator

great, @mtravis! only two things left imo: might be a good idea to update the README to reflect the changes. not sure if @cholmes would like to do this as part of a new release or if we want to include it in this PR. then, equally not quite sure if we came to a conclusion re bbox vs actual geometry. currently it uses bbox. i do think as a default, actual geometry or simplified geometry (https://shapely.readthedocs.io/en/stable/reference/shapely.simplify.html) would be better. depends on the performance benefit of bbox vs simplified geom vs full geom. might be worth doing some benchmarking?

finally @cholmes might have some thoughts on the changes proposed here before this gets merged?

summary:

  • adds --location option to get_buildings.
  • changes dst from positional parameter to named option --dst (necessary due to above)
  • adds unit test for geocoding function and integration test for the cli command

still to do:

  • bbox vs simplified vs full geometry, sensible default behaviour?
  • update README

@mtravis
Copy link
Contributor Author

mtravis commented Nov 8, 2023

Thanks @felix-schott I agree the simplified geometry will probably work best. I'll apply this tomorrow.

@mtravis
Copy link
Contributor Author

mtravis commented Nov 10, 2023

I've been testing the geocode function locally to see if the actual geometry can be used instead of the bbox and there's an issue with the gdf being returning having a multipolygon geometry. This is the case for many coastal cities with islands e.g New York.

Perhaps for now we stick with bbox and then look at other options in the future? Unless anyone has an idea now?

…n exact geometry, add test for multipolygon geocoding result
@felix-schott
Copy link
Collaborator

I've been testing the geocode function locally to see if the actual geometry can be used instead of the bbox and there's an issue with the gdf being returning having a multipolygon geometry. This is the case for many coastal cities with islands e.g New York.

Perhaps for now we stick with bbox and then look at other options in the future? Unless anyone has an idea now?

hi @mtravis, sorry for the slow response. thanks for rigorous testing - good catch! i had a look and noticed it's failing in the geojson_to_quadkey() function. that function however was quite convoluted and inflexible (looking at it i'm not even sure if it was working correctly), i simplified it by using the shapely bounds method. now everything is working.

if you're happy, i'll merge.

@cholmes
Copy link
Collaborator

cholmes commented Nov 16, 2023

finally @cholmes might have some thoughts on the changes proposed here before this gets merged?

Sorry for the slow response! I've been slammed lately, and I've never gotten a great system to properly filter important github messages. Don't have time to look in depth right now, but please do merge when ready! I'm excited to try it out.

And I definitely think this is worthy of a release, I'll see if I might be able to pull that off in the next couple of days.

@mtravis mtravis marked this pull request as ready for review November 16, 2023 22:47
@mtravis
Copy link
Contributor Author

mtravis commented Nov 16, 2023

Thanks both. @felix-schott I'm happy for you to merge so I've put into review. I'll take it for a test run tomorrow. Looking forward to getting this out there.

@cholmes
Copy link
Collaborator

cholmes commented Nov 20, 2023

Tried it out, looks good. I put in one bug #55 to make it a bit nicer, but I'll go ahead and merge and we can do that one later.

@cholmes cholmes merged commit 276386b into opengeos:main Nov 20, 2023
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants