Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data fetcher "Coriolis" #47

Open
quai20 opened this issue Aug 24, 2020 · 14 comments
Open

Data fetcher "Coriolis" #47

quai20 opened this issue Aug 24, 2020 · 14 comments
Labels
backends enhancement New feature or request performance stale No activity over the last 90 days

Comments

@quai20
Copy link
Member

quai20 commented Aug 24, 2020

This fetcher will be based on Cassandra/Elastic Search API(s) developed for Coriolis database at Ifremer IT department.
Those API(s) are mostly meant for Web portals but we plan to integrate a new fetcher based on this.

To start, I create this issue to list our feedback & ideas on various aspects :

  • how to use it from python
  • how to access data
  • how to access metadata
  • how do we plan to integrate it into argopy fetching architecture
@quai20 quai20 added enhancement New feature or request backends labels Aug 24, 2020
@quai20
Copy link
Member Author

quai20 commented Aug 26, 2020

First thoughts and tests on metadata fetching :

From python, I access the api with simple POST request :

import json, requests, pandas as pd 
StringJson = r'{"criteriaList":[{"field":"platformCode" [...] }'
DataJson = json.loads(StringJson)
url = 'http://blp_test.ifremer.fr/ea-data-selection/api/find-by-search-filtred'
x = requests.post(url, json = DataJson)
response = pd.json_normalize(x.json())
  • We can query one or many wmo : "field":"platformCode","values":[...]
  • We can query a geographical area : "field":"globalGeoShapeField","values":[...]
  • We can query a time subset : "field":"startDate","values":[...]

Note that to query only one float, we can also use http://blp_test.ifremer.fr/ea-data-selection/api/trajectory/6902746 to retrieve the info.

And we can of course mix multiple fields in one request.

With that request we can retrieve :

  • station_id
  • cycle number
  • platformCode
  • positionQc
  • lat
  • lon

So the profile timestamp is obviously missing for us. One solution (not a good one) is to loop over another request by station_id ('http://blp_test.ifremer.fr/ea-data-selection/api/find-by-id/'+station_id) to retrieve the timestamp, measured parameters, and other informations. But it's way too long. It's more preferable to add some fields (at least the timestamp) in the ES index.

Another note, about pagination. For now, I manage to retrieve data with "pagination":{"page":1,"size":10000,"isPaginated":true} in my request, and I get 10000 items from the index (10000 is the default maximum in the Elastic Search configuration). So the result is not complete, and I don't know yet how to access the entire result of my request. For instance "pagination":{"page":2,"size":10000,"isPaginated":true} throw me an error (err 500, Elasticsearch exception [type=search_phase_execution_exception, reason=all shards failed])

Just as an example, here's an example of json request post to the API :

{
    "criteriaList": [
        {
            "field": "startDate",
            "values": [
                {
                    "name": "2020-01-01T00:00:00.000+0100",
                    "code": "2020-01-01T00:00:00.000+0100",
                    "n": 0
                },
                {
                    "name": "2020-08-25T00:00:00.000+0200",
                    "code": "2020-08-25T00:00:00.000+0200",
                    "n": 0
                }
            ],
            "types": [
                "DATE"
            ]
        },
        {
            "field": "globalGeoShapeField",
            "values": [
                {
                    "code": "{\"type\":\"POLYGON\",\"coordinates\":[[[-180,-90.0],[-180,90.0],[180,90.0],[180,-90.0],[-180,-90.0]]]}\"",
                    "name": "",
                    "n": 0
                }
            ],
            "types": [
                "GEOGRAPHIC"
            ],
            "options": []
        },
        {
            "field": "deploymentYear",
            "values": [],
            "types": [
                "AUTOCOMPLETE",
                "FACET"
            ],
            "options": [
                "SORTED_VALTXT_DESC"
            ],
            "sortPriority": 0,
            "order": "DESC"
        }
    ],
    "pagination": {
        "page": 1,
        "size": 10000,
        "isPaginated": false
    },
    "bboxParams": {
        "latTopLeft": 90.0,
        "lonTopLeft": -180,
        "latBottomRight": -90.0,
        "lonBottomRight": 180,
        "zoom": 5
    },
    "languageEnum": "en"
}

@gmaze
Copy link
Member

gmaze commented Aug 27, 2020

From python, I access the api with simple POST request :

I see here a first difficulty for argopy, since our internal file system based on fsspec does not support POST requests for the http store.

This means that we would need to build something on top of it to support POST requests.

When working on #28 I encountered https://github.com/ross/requests-futures, this may be a solution.

@gmaze
Copy link
Member

gmaze commented Aug 28, 2020

1st bench on coriolis-datacharts.ifremer.fr

Using #28 new argopy parallel fetching possibilities, I tried to bench one float P/T/S data fetching:

First we need to load the list of all URLs to fetch (910 in total, one URL per parameter per station), see this file urls_6902749.txt:

with open('urls_6902749.txt') as of:
    d = of.readlines()
urls = []
for l in d:
    url = l.replace("[","").replace("]","").split("\n")[0].strip().replace("'","")
    urls.append(url)
print("Eg:", urls[0])
print("N=", len(urls))
Eg: https://coriolis-datacharts.ifremer.fr/api/profiles?platform=6902746&start=1499352540&end=1499352540&parameter=66&measuretype=1
N= 910

Then fetch them:

# Create argopy http store:
import time
from argopy.stores import httpstore
fs = httpstore(cache=False)

# Perform several fetches to bench performances:
t = []
for r in range(0, 5):
    try:
        print("run %i" % r)
        start_time = time.time()
        d = fs.open_mfjson(urls, max_workers = 100, progress=1);
        t.append(time.time()-start_time)
    except:
        pass

this leads to fetching times of:

[6.23819899559021,
 6.561779975891113,
 6.459397077560425,
 6.022731781005859,
 6.367558240890503]

i.e. about 6 seconds for 1 float core data.

Let's compare to existing data sources:

reg = {}
for src in ['erddap', 'argovis']:
    start_time = time.time()
    ArgoDataFetcher(src=src, cache=False).float(6902749).to_xarray()
    reg[src] = time.time()-start_time
print(reg) 
{'erddap': 1.236616849899292, 'argovis': 2.6754958629608154}

So the Coriolis API is much longer than existing API, and that's not even with accounting for post-processing on the client side.

@gmaze gmaze changed the title Future fetcher "Coriolis" Data fetcher "Euro-Argo" / "Coriolis" Nov 9, 2020
@gmaze
Copy link
Member

gmaze commented Nov 9, 2020

From Euro-Argo:

In the frame of Euro-Argo RISE (Task 7.2 Promotion and improvement of data access and usage) and ENVRI-FAIR projects, a new version of the Argo data selection tool has been released. This will replace the existing tool available from the ADMT website. The technical developments have been led by Ifremer.

The tool will be presented at next ADMT (2021). It is still under development and some bugs or needs for improvements have been already identified, but it is functional and already offers a great means to users to select, visualise and download scientific Argo data (i.e. from the profiles netCDF files). It is available at https://dataselection.euro-argo.eu/

If you are interested, I invite you to try and test the tool and report any feedback you may have on this living document; that will help us resolve any bugs and improve the tool. For ease I have divided the document in different sections where you may report your feedback.

The API we're talking about here (coriolis-datacharts.ifremer.fr), is the one powering this new web interface for data visualization after selection with dataselection.euro-argo.eu

we can provide feedback here

This new data selection API documentation can be found here: https://dataselection.euro-argo.eu/swagger-ui.html

@gmaze
Copy link
Member

gmaze commented Nov 9, 2020

Example of usage of the new data selection API:

Retrieve the list of profile coordinates: https://dataselection.euro-argo.eu/api/find-by-search-filtred
Data posted:

curl -X POST 
--header 'Content-Type: application/json' 
--header 'Accept: application/json' 
-d '{"criteriaList":[{"field":"startDate","values":[{"name":"2020-10-30T16:58:42.487+0100","code":"2020-10-30T16:58:42.487+0100","n":0}],"types":["DATE"]},{"field":"globalGeoShapeField","values":[{"code":"{\"type\":\"POLYGON\",\"coordinates\":[[[-66.26953125000001,31.353636941500987],[-66.26953125000001,37.30027528134433],[-60.46875000000001,37.30027528134433],[-60.46875000000001,31.353636941500987],[-66.26953125000001,31.353636941500987]]]}\"","name":"","n":0}],"types":["GEOGRAPHIC"],"options":[]},{"field":"cycleQcState","values":[{"name":"Good","code":"Good","n":0}],"types":["FACET"],"options":["SORTED_VALTXT_ASC"]}],"pagination":{"page":1,"size":9000,"isPaginated":false},"bboxParams":{"latBottomRight":-90,"latTopLeft":90,"lonBottomRight":180,"lonTopLeft":-180,"zoom":2},"languageEnum":"en"}' 'https://dataselection.euro-argo.eu/api/find-by-search-filtred'

Output a list of points like:

  {
    "id": 4890265,
    "cvNumber": 94,
    "coordinate": {
      "lat": 33.825,
      "lon": -64.758,
      "geohash": "dw1bqmssvjp2",
      "fragment": true
    },
    "platformCode": "3901654",
    "cycleQcState": "Good",
    "level": 0
  },

in the response above the WMO is in "platformCode" and the cycle number is in "cvNumber"

using the profile "id", we can then visualise profile data at: https://dataselection.euro-argo.eu/cycle/4890265

@gmaze
Copy link
Member

gmaze commented Nov 9, 2020

The data selection tool (https://dataselection.euro-argo.eu/), could be used by argopy to create a dashboard for a given fetcher
One just have to create the fetcher definition into a data selection criteria list

@gmaze
Copy link
Member

gmaze commented Nov 9, 2020

Note that:
http://blp_test.ifremer.fr/ea-data-selection

is the local dev version of this one:
https://dataselection.euro-argo.eu

@gmaze gmaze changed the title Data fetcher "Euro-Argo" / "Coriolis" Data fetcher "Coriolis" Nov 10, 2020
@gmaze
Copy link
Member

gmaze commented Nov 10, 2020

Retrieve all data from a single float

set time stamps to very old and very far in future values:

Eg:
https://coriolis-datacharts.ifremer.fr/api/profiles?platform=3901654&start=-2208988800&end=4133894400&parameter=35&measuretype=1

This will retrieve Temperature (code=35) data from 1900/01/01 to 2100/12/31 of float WMO 3901654

ps: to get parameter code information: https://co-discovery-demo.ifremer.fr/coriolis/api/params/parametre?code=35

@gmaze
Copy link
Member

gmaze commented Nov 18, 2020

The data selection API is being developed here:
https://gitlab.ifremer.fr/ea-data-selection/ea-data-selection/

This is an Ifremer private repo, just mentioned here for the record

@github-actions
Copy link

github-actions bot commented Sep 1, 2021

Stale issue message

@github-actions github-actions bot added the stale No activity over the last 90 days label Sep 1, 2021
@github-actions github-actions bot closed this as completed Sep 8, 2021
@gmaze gmaze reopened this Sep 9, 2021
@gmaze gmaze removed the stale No activity over the last 90 days label Sep 23, 2022
@gmaze
Copy link
Member

gmaze commented Nov 30, 2022

@gmaze gmaze pinned this issue Jan 18, 2023
@github-actions
Copy link

github-actions bot commented Mar 1, 2023

This issue was marked as staled automatically because it has not seen any activity in 90 days

@github-actions github-actions bot added the stale No activity over the last 90 days label Mar 1, 2023
@gmaze gmaze unpinned this issue Jul 3, 2023
Copy link

This issue was closed automatically because it has not seen any activity in 365 days

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 15, 2024
@gmaze gmaze reopened this Apr 15, 2024
@github-actions github-actions bot removed the stale No activity over the last 90 days label Apr 17, 2024
Copy link

This issue was marked as staled automatically because it has not seen any activity in 90 days

@github-actions github-actions bot added the stale No activity over the last 90 days label Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backends enhancement New feature or request performance stale No activity over the last 90 days
Projects
None yet
Development

No branches or pull requests

2 participants