You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After allowing certain types of low-code streams to be processed within the concurrent framework, we ran into an issue where stream that used the AsyncRetriever component would result in errors during processing. One such example is
{
"type": "TRACE",
"trace": {
"type": "ERROR",
"emitted_at": 1733873724427,
"error": {
"message": "Invalid state within AsyncJobRetriever. Please contact Airbyte Support",
"internal_message": "AsyncPartitionRepository is expected to be accessed only after `stream_slices`",
"stack_trace": "Traceback (most recent call last):\n File \"/Users/brian.lai/dev/airbyte-python-cdk/airbyte_cdk/sources/streams/concurrent/partition_reader.py\", line 40, in process_partition\n for record in partition.read():\n File \"/Users/brian.lai/dev/airbyte-python-cdk/airbyte_cdk/sources/declarative/stream_slicers/declarative_partition_generator.py\", line 59, in read\n for stream_data in self._retriever.read_records(self._json_schema, self._stream_slice):\n File \"/Users/brian.lai/dev/airbyte-python-cdk/airbyte_cdk/sources/declarative/retrievers/async_retriever.py\", line 119, in read_records\n records: Iterable[Mapping[str, Any]] = self._job_orchestrator.fetch_records(partition)\n File \"/Users/brian.lai/dev/airbyte-python-cdk/airbyte_cdk/sources/declarative/retrievers/async_retriever.py\", line 60, in _job_orchestrator\n raise AirbyteTracedException(\nairbyte_cdk.utils.traced_exception.AirbyteTracedException: AsyncPartitionRepository is expected to be accessed only after `stream_slices`\n",
"failure_type": "system_error",
"stream_descriptor": {
"name": "contacts"
}
}
}
}
As a temporary fix, we changed the concurrent_declarative_source.py to only run streams using the SimpleRetriever concurrently.
Within the concurrent_declarative_source.py, when we instantiate our StreamSlicerPartitionGenerator, we pass in the declarative_stream.retriever.stream_slicer. However, this will not work for an AsyncRetriever because in this scenario the stream_slicer corresponds to an underlying partition router of the AsyncRetriever which is used to supply partitions when creating async jobs. And in our current implementation, the AsyncRetriever is responsible for generating slices within stream_slices() instead of delegating to the stream_slicer unlike our existing SimpleRetriever.
Create a new StreamSlicer low-code component called AsyncJobStreamSlicer that adheres to the StreamSlicer interface
The implementation of stream_slices() should be the current implementation shown above. It should be instantiated with a parent stream slicer
AsyncRequester should have a stream slicer defined as a field and use it when stream_slices() is called
Within the model_to_component_factory, we instantiate the new async stream_slicer
There should be no impact to the low-code interface
Issue 2:
We are instantiating a new AsyncRetriever every time we create a new partition because the factory method we supply to the StreamSlicerPartitionGenerator instantiates new instances of the declarative stream + retriever. That in turn leads to the partition invoking AsyncRetriever.read_records() on a new instance of the async retriever which has not been instantiated properly. This is because the AsyncRetriever.stream_slices() is not stateless and responsible for setting it up properly to then be called during read records. Also, because we instantiate individual retrievers, they aren't using a shared AsyncJobOrchestrator or AsyncJobRepository which is needed to properly manage the internal state of the retriever.
As evidenced by this error:
{
"type": "TRACE",
"trace": {
"type": "ERROR",
"emitted_at": 1733873724427,
"error": {
"message": "Invalid state within AsyncJobRetriever. Please contact Airbyte Support",
"internal_message": "AsyncPartitionRepository is expected to be accessed only after `stream_slices`",
"stack_trace": "Traceback (most recent call last):\n File \"/Users/brian.lai/dev/airbyte-python-cdk/airbyte_cdk/sources/streams/concurrent/partition_reader.py\", line 40, in process_partition\n for record in partition.read():\n File \"/Users/brian.lai/dev/airbyte-python-cdk/airbyte_cdk/sources/declarative/stream_slicers/declarative_partition_generator.py\", line 59, in read\n for stream_data in self._retriever.read_records(self._json_schema, self._stream_slice):\n File \"/Users/brian.lai/dev/airbyte-python-cdk/airbyte_cdk/sources/declarative/retrievers/async_retriever.py\", line 119, in read_records\n records: Iterable[Mapping[str, Any]] = self._job_orchestrator.fetch_records(partition)\n File \"/Users/brian.lai/dev/airbyte-python-cdk/airbyte_cdk/sources/declarative/retrievers/async_retriever.py\", line 60, in _job_orchestrator\n raise AirbyteTracedException(\nairbyte_cdk.utils.traced_exception.AirbyteTracedException: AsyncPartitionRepository is expected to be accessed only after `stream_slices`\n",
"failure_type": "system_error",
"stream_descriptor": {
"name": "contacts"
}
}
}
}
What we want to actually do is reuse the same AsyncRetriever on each partition. This would allow us to use a properly instantiated AsyncRetriever which has already called stream_slices(). From within concurrent_declarative_source.py, we can instead just pass the original AsyncRetriever instance which has the proper state as well as the shared orchestrators and job repository.
However, there is one major problem with this approach and that is that there are potential ways that the AsyncRetriever is not thread safe. The biggest one being that the DefaultPaginator relies on an internal state. The _token field is overwritten each time we read a page. If we have multiple partitions using the same AsyncRetriever, we can possibly lose records between partitions. If we make this thread safe, we should be able to share the same retriever.
Note:
There are potentially other places where we may not be thread safe. However, we can do some additional analysis about areas that are not. However, rather than drag work out and try to fix everything, we are going to make a calculated bet that the impact on async retriever + thread safety is relatively low blast radius. We may at some point have to revisit making all of our low-code components thread safe in the future.
Acceptance Criteria
[ ]
There should be no breaking changes to the low-code interface
The text was updated successfully, but these errors were encountered:
I have the last PR to get our async streams to concurrent here: #185
It's been tested on some of our existing connectors, but we didn't want to release this during the holidays and want to wait for @maxi297 to get back to do a final sign off since the PR affects quite a few things. We're still on track.
Context
After allowing certain types of low-code streams to be processed within the concurrent framework, we ran into an issue where stream that used the
AsyncRetriever
component would result in errors during processing. One such example isAs a temporary fix, we changed the
concurrent_declarative_source.py
to only run streams using theSimpleRetriever
concurrently.Slack thread with more context:
https://airbytehq-team.slack.com/archives/C063B9A434H/p1733264441227669?thread_ts=1733257895.550789&cid=C063B9A434H
Problem / Solution
The root of the issue is two fold
Issue 1:
Within the
concurrent_declarative_source.py
, when we instantiate ourStreamSlicerPartitionGenerator
, we pass in thedeclarative_stream.retriever.stream_slicer
. However, this will not work for anAsyncRetriever
because in this scenario thestream_slicer
corresponds to an underlying partition router of the AsyncRetriever which is used to supply partitions when creating async jobs. And in our current implementation, the AsyncRetriever is responsible for generating slices withinstream_slices()
instead of delegating to the stream_slicer unlike our existing SimpleRetriever.For example in
simple_retriever.py
:In
async_retriever.py
:What we should do is:
StreamSlicer
low-code component calledAsyncJobStreamSlicer
that adheres to theStreamSlicer
interfacestream_slices()
should be the current implementation shown above. It should be instantiated with a parent stream slicerstream_slices()
is calledmodel_to_component_factory
, we instantiate the new async stream_slicerIssue 2:
We are instantiating a new
AsyncRetriever
every time we create a new partition because the factory method we supply to the StreamSlicerPartitionGenerator instantiates new instances of the declarative stream + retriever. That in turn leads to the partition invokingAsyncRetriever.read_records()
on a new instance of the async retriever which has not been instantiated properly. This is because theAsyncRetriever.stream_slices()
is not stateless and responsible for setting it up properly to then be called during read records. Also, because we instantiate individual retrievers, they aren't using a sharedAsyncJobOrchestrator
orAsyncJobRepository
which is needed to properly manage the internal state of the retriever.As evidenced by this error:
What we want to actually do is reuse the same AsyncRetriever on each partition. This would allow us to use a properly instantiated AsyncRetriever which has already called
stream_slices()
. From withinconcurrent_declarative_source.py
, we can instead just pass the original AsyncRetriever instance which has the proper state as well as the shared orchestrators and job repository.However, there is one major problem with this approach and that is that there are potential ways that the
AsyncRetriever
is not thread safe. The biggest one being that theDefaultPaginator
relies on an internal state. The_token
field is overwritten each time we read a page. If we have multiple partitions using the sameAsyncRetriever
, we can possibly lose records between partitions. If we make this thread safe, we should be able to share the same retriever.Note:
There are potentially other places where we may not be thread safe. However, we can do some additional analysis about areas that are not. However, rather than drag work out and try to fix everything, we are going to make a calculated bet that the impact on async retriever + thread safety is relatively low blast radius. We may at some point have to revisit making all of our low-code components thread safe in the future.
Acceptance Criteria
The text was updated successfully, but these errors were encountered: