-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Async functionality implemented. #56
Conversation
Hi @gabriel-trigo, thanks for the PR and the clear explanations! It touches some core parts, so I'll need a little bit of time review it. #52 |
Hi @danielnsilva , I noticed there are some issues with the test cases that I hadn't noticed, so I'm fixing them, and will probably submit one more commit tomorrow. |
Hi @danielnsilva, I have corrected the testing problems (the async tests were not running properly before), and now they seem to all be passing. I committed the .yaml vcr files that were generated when testing locally. I noticed that a few of the tests, particularly the ones iterating through PaginatedResults often took a couple of tries to work out (and generate the correct .yaml file), because often calls to next_page() would time out on the server side. I think everything checks out now! |
Initially, I believe that if we're adopting Duplicated code can make maintenance tricky, and that concerns me. Also, it's not my intention to introduce breaking changes. A possible solution is to migrate everything to async methods and create methods that make the async ones behave synchronously. In this way, we concentrate all application logic in the async methods, which will simplify maintenance, but we still maintain methods with synchronous behavior. To illustrate, the import asyncio
from typing import List, Union
import httpx
from tenacity import (retry, retry_if_exception_type, stop_after_attempt,
wait_fixed)
from semanticscholar.SemanticScholarException import \
BadQueryParametersException, ObjectNotFoundException
class BaseRequester:
def __init__(self, timeout) -> None:
'''
:param float timeout: an exception is raised \
if the server has not issued a response for timeout seconds.
'''
self.timeout = timeout
@property
def timeout(self) -> int:
'''
:type: :class:`int`
'''
return self._timeout
@timeout.setter
def timeout(self, timeout: int) -> None:
'''
:param int timeout:
'''
self._timeout = timeout
class ApiRequester(BaseRequester):
'''
This class handles calls to Semantic Scholar API.
'''
def __init__(self, timeout) -> None:
super().__init__(timeout)
self._async_requester = AsyncApiRequester(timeout)
def get_data(
self,
url: str,
parameters: str,
headers: dict,
payload: dict = None
) -> Union[dict, List[dict]]:
loop = asyncio.get_event_loop()
return loop.run_until_complete(
self._async_requester.get_data(url, parameters, headers, payload))
class AsyncApiRequester(BaseRequester):
@retry(
wait=wait_fixed(30),
retry=retry_if_exception_type(ConnectionRefusedError),
stop=stop_after_attempt(10)
)
async def get_data(
self,
url: str,
parameters: str,
headers: dict,
payload: dict = None
) -> Union[dict, List[dict]]:
'''Get data from Semantic Scholar API
:param str url: absolute URL to API endpoint.
:param str parameters: the parameters to add in the URL.
:param str headers: request headers.
:param dict payload: data for POST requests.
:returns: data or empty :class:`dict` if not found.
:rtype: :class:`dict` or :class:`List` of :class:`dict`
'''
url = f'{url}?{parameters}'
method = 'POST' if payload else 'GET'
async with httpx.AsyncClient() as client:
r = await client.request(
method, url, timeout=self._timeout, headers=headers, json=payload)
data = {}
if r.status_code == 200:
data = r.json()
if len(data) == 1 and 'error' in data:
data = {}
elif r.status_code == 400:
data = r.json()
raise BadQueryParametersException(data['error'])
elif r.status_code == 403:
raise PermissionError('HTTP status 403 Forbidden.')
elif r.status_code == 404:
data = r.json()
raise ObjectNotFoundException(data['error'])
elif r.status_code == 429:
raise ConnectionRefusedError('HTTP status 429 Too Many Requests.')
elif r.status_code in [500, 504]:
data = r.json()
raise Exception(data['message'])
return data Notice the I'm concerned about potentially introducing a poor design into the project, but I believe the There's one more thing I've been pondering. Shouldn't paginated results always be sequential? When using asynchronous methods and iterating over all results, can we ensure they'll be gathered in the correct order? I'll need to take a closer look into this. |
Hi @danielnsilva, these are very good points. In this approach that you suggested, would the SemanticScholar methods still be duplicated? I had difficulty not duplicating the SemanticScholar methods because even if I could get the methods to behave synchronously, I still had to declare them as async to support the async functionality. This meant that even if users wanted to use the method synchronously, they still had to put the keyword await before using the method. In this suggestion, would the SemanticScholar methods be declared as Async? await method(async=False) (Users would need to do something like this). Regarding PaginatedResults, the Async implementation actually behaves almost identically as the synchronous one. The only difference is that the constructor and next_page() methods are non-blocking. But the whole idea of iterating through pages sequentially is all kept the same! I'm more than happy to work on the changes you deem necessary based on this discussion! Best |
Let me clarify this. I understand we might not be able to completely get rid of code duplication. But I think if we can focus most of the application logic in the async version, that'd be a step in the right direction. This approach could be better than having the same logic in both sync and async methods. This means we'll always have two methods, sync and async. But the sync version will be just an async method called synchronously. I'm not sure if using a sync/async parameter would help achieve our goal. Here's a quick overview:
Building on the idea, within In For clarity, here's how it'd look: # For synchronous calls
sch.get_paper()
# For asynchronous calls
await async_sch.get_paper() This approach aligns closely with what you had proposed, with the primary advantage being that the synchronous version doesn't replicate all the method's internal code—it simply calls it synchronously.
That's fine. My concern was related to iterating over the results, and if there's a possibility that the next page could be invoked even before the current page request completes. This could lead to potential out-of-order results. This is just a hypothetical scenario that crossed my mind, and I haven't looked into it in depth. |
Hi @danielnsilva, I thought a lot about this and I think I found a good approach following the intuition you provided:
I pushed a commit with these modifications, how does it sound to you? |
@danielnsilva what do you think about the solution? |
I'm back! 😄 I think you did a great job @gabriel-trigo. I've made some remarks, but I believe we have a good version to add the async option to the project. |
Hi @danielnsilva, thank you for taking the time to review this. I'll review the code based on the points you mentioned, and I'll update the pull request by this week then! |
…od signatures, kept sync get_data method and added deprecation warning, kept ApiRequester Class name)
Hi @danielnsilva, thank you for the suggestions once again. I have implemented all of them and updated this PR 👍 |
Hi! I have a project that required many API calls, and being able to do them asynchronously saves a huge amount of time.
The changes I made are as follows:
Comments:
@danielnsilva would you be able to review this? I'm a Brazilian CS student and I'm trying to begin contributing to open source projects. I found this project very cool, and have already used it quite a bit, so I was hoping to be able to contribute to it.
Best,
Gabriel
Below: picture of a simple usecase demonstrating the advantage of async requests.