Instantiate one stream per thread #21

maxi297 · 2024-11-11T23:04:32Z

Summary by CodeRabbit

New Features
- Introduced a factory method for generating unique stream instances in a multi-threaded environment, enhancing stream management.
Bug Fixes
- Improved handling of stream instantiation to prevent concurrent modifications.
Documentation
- Updated documentation for CursorPartitionGenerator to clarify changes related to stream instantiation.
Tests
- Adjusted test cases to reflect the new factory pattern for stream creation while preserving functionality.

coderabbitai · 2024-11-11T23:04:40Z

📝 Walkthrough

Walkthrough

The changes introduced in this pull request enhance the ConcurrentDeclarativeSource and CursorPartitionGenerator classes by implementing a factory method for creating stream instances. A new method, _new_stream_instance_factory, is added to ensure unique stream instances per thread, preventing concurrent modifications. The CursorPartitionGenerator constructor is updated to accept a stream_factory callable instead of a direct stream instance, promoting better flexibility. Corresponding updates are made to the test cases to reflect these changes in variable naming and instantiation.

Changes

File Path	Change Summary
airbyte_cdk/sources/declarative/concurrent_declarative_source.py	Added method `_new_stream_instance_factory` to create unique stream instances per thread. Updated `_group_streams` to use the new factory method. Added import for `Callable`.
airbyte_cdk/sources/streams/concurrent/adapters.py	Updated `CursorPartitionGenerator` to use `stream_factory: Callable[[], Stream]` instead of `stream: Stream` in the constructor. Adjusted documentation and imports accordingly.
unit_tests/sources/streams/concurrent/test_adapters.py	Renamed `stream` variable to `stream_factory` in `test_cursor_partition_generator` to match changes in `CursorPartitionGenerator`.

Sequence Diagram(s)

sequenceDiagram
    participant Thread
    participant ConcurrentDeclarativeSource
    participant CursorPartitionGenerator

    Thread->>ConcurrentDeclarativeSource: Request new stream instance
    ConcurrentDeclarativeSource->>ConcurrentDeclarativeSource: Call _new_stream_instance_factory()
    ConcurrentDeclarativeSource->>Thread: Return stream_factory
    Thread->>CursorPartitionGenerator: Instantiate with stream_factory
    CursorPartitionGenerator->>CursorPartitionGenerator: Generate stream using stream_factory()

🐇 "In the land of streams, where rabbits play,
A factory blooms, brightening the day.
No more shared states, each thread gets its own,
With unique little streams, our joy has grown!
Hopping through code, we dance with delight,
In the world of adapters, everything's right!" 🌼

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (3)

unit_tests/sources/streams/concurrent/test_adapters.py (2)
Line range hint 471-477: Consider adding thread-safety verification

While the test verifies basic functionality, it could be enhanced to explicitly verify that each thread gets a unique stream instance.

Consider adding a test case that:

Creates multiple partitions concurrently

Verifies each partition has a unique stream instance

Ensures no stream instance is shared between partitions

Example addition:
def test_cursor_partition_generator_thread_safety():
    stream_instances = set()
    stream_factory = Mock(side_effect=lambda: Mock(id=len(stream_instances)))
    # ... setup other dependencies ...
    
    partitions = list(partition_generator.generate())
    
    # Verify each partition has a unique stream instance
    for partition in partitions:
        stream_instances.add(id(partition._stream))
    assert len(stream_instances) == len(partitions), "Partitions should have unique stream instances"
Line range hint 478-485: Enhance assertions for factory pattern verification

The current assertions verify partition types and slice content, but could be extended to verify the factory pattern implementation.

Add assertions to verify:

The stream_factory is called for each partition

Each partition receives a new stream instance
# Add after existing assertions
stream_factory.assert_has_calls([call() for _ in partitions])
airbyte_cdk/sources/streams/concurrent/adapters.py (1)
403-407: Remove reference to unused parameter in docstring

The docstring still references the sync_mode parameter which is no longer part of the method signature.
-        :param sync_mode: The synchronization mode.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 734d778 and 09fc5d7.

📒 Files selected for processing (3)

airbyte_cdk/sources/declarative/concurrent_declarative_source.py (3 hunks)
airbyte_cdk/sources/streams/concurrent/adapters.py (4 hunks)
unit_tests/sources/streams/concurrent/test_adapters.py (2 hunks)

🔇 Additional comments (5)

unit_tests/sources/streams/concurrent/test_adapters.py (1)

451-451: LGTM: Variable renamed to reflect factory pattern

The renaming from stream to stream_factory accurately reflects the architectural change to use a factory pattern for creating stream instances per thread.

airbyte_cdk/sources/streams/concurrent/adapters.py (3)

9-9: LGTM: Import changes are appropriate

The addition of Callable to the typing imports is necessary for type-hinting the new stream_factory parameter.

393-393: LGTM: Stream factory parameter change

The change from stream to stream_factory: Callable[[], Stream] aligns with the PR objective of instantiating one stream per thread. The instance variable is properly stored.

Also applies to: 408-408

448-448: Verify thread safety of stream factory

The change to use self._stream_factory() correctly ensures each partition gets its own stream instance. However, we should verify that:

The stream factory itself is thread-safe
The stream creation process doesn't have any shared state that could cause thread safety issues

✅ Verification successful

Let me analyze the review comment and the provided code context.

The review comment suggests verifying that the stream factory is thread-safe when using self._stream_factory() instead of self._stream in the generate method.

Looking at the code, I can see that:

The stream factory is created in ConcurrentDeclarativeSource._new_stream_instance_factory() which creates a new stream instance per thread to avoid threads updating the same field for a specific instance.
The factory is used in CursorPartitionGenerator to create new stream instances for each partition.
The factory implementation ensures thread safety by:
- Creating a new stream instance for each call
- Using the source's streams() method which creates fresh instances
- Not sharing any mutable state between instances
The code has been designed specifically to handle concurrent access by creating new stream instances rather than sharing them.

Based on this analysis, I can conclude that the stream factory implementation is thread-safe by design.

Stream factory is thread-safe

The change from self._stream to self._stream_factory() correctly ensures thread safety by:

Creating a new stream instance for each partition
Not sharing any mutable state between stream instances
Using a factory pattern that creates fresh instances per thread

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for any shared state or synchronization mechanisms in stream factory implementations
rg -l "(?:threading\.|Lock|RLock|Semaphore|Event|Condition|@synchronized|@lock)" | \
  xargs -I {} sh -c 'echo "=== {} ==="; cat {}'

# Search for any static/class variables in stream implementations that could be shared
ast-grep --pattern 'class $CLASS {
  $$$
  @classmethod
  def $METHOD($$$) {
    $$$
  }
  $$$
}'

Length of output: 605931

airbyte_cdk/sources/declarative/concurrent_declarative_source.py (1)

6-6: Approved: Import Callable for type annotations

The addition of Callable to the typing imports is appropriate for type hinting the new method that returns a factory function.

coderabbitai · 2024-11-11T23:11:30Z

airbyte_cdk/sources/declarative/concurrent_declarative_source.py

+    def _new_stream_instance_factory(self, stream_to_copy: Stream, config: Mapping[str, Any]) -> Callable[[], Stream]:
+        """
+        Some of the declarative components are stateful. Therefore, we create one stream per thread in order to avoid threads updating
+        the same field for a specific instance.
+        """
+        def _create_new_stream() -> Stream:
+            streams_with_same_name = list(filter(lambda stream: stream.name == stream_to_copy.name, self.streams(config)))
+            if len(streams_with_same_name) == 1:
+                return streams_with_same_name[0]
+            raise ValueError(f"Expected one stream with name `{stream_to_copy.name}` but got {len(streams_with_same_name)}")
+        return _create_new_stream


⚠️ Potential issue

Critical Issue: _new_stream_instance_factory returns existing instances instead of creating new ones

The _new_stream_instance_factory method is intended to generate a factory function that creates a new stream instance per thread, preventing concurrent modifications to shared state. However, the current implementation returns existing stream instances from self.streams(config), which may lead to threads sharing the same instance and thus not resolving the concurrency issues.

To fix this issue, the factory function _create_new_stream should instantiate a new stream instance instead of returning an existing one. This can be achieved by recreating the stream using the stream definition and the component factory.

Consider modifying the _create_new_stream function as follows:

def _create_new_stream() -> Stream: - streams_with_same_name = list(filter(lambda stream: stream.name == stream_to_copy.name, self.streams(config))) - if len(streams_with_same_name) == 1: - return streams_with_same_name[0] - raise ValueError(f"Expected one stream with name `{stream_to_copy.name}` but got {len(streams_with_same_name)}") + stream_definitions = [stream_def for stream_def in self._source_config.get('streams', []) if stream_def.get('name') == stream_to_copy.name] + if len(stream_definitions) == 1: + return self._constructor.create_component( + model_type=DeclarativeStream, + component_definition=stream_definitions[0], + config=config, + ) + raise ValueError(f"Expected one stream definition with name `{stream_to_copy.name}` but got {len(stream_definitions)}")

This modification ensures that a new instance of the stream is created each time the factory function is called, thus avoiding shared state between threads.

Committable suggestion skipped: line range outside the PR's diff.

Feel free to just dismiss this if not applicable - or if out of scope, you can suggest the AI to move to a new issue for later follow-up.

aaronsteers · 2024-11-12T03:27:59Z

@maxi297 - I pulled in the latest from main, fyi.

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)

airbyte_cdk/sources/streams/concurrent/adapters.py (2)
Line range hint 408-414: Should we add error handling for stream factory calls?

What do you think about adding error handling for potential factory failures? Something like:
 def generate(self) -> Iterable[Partition]:
     for slice_start, slice_end in self._cursor.generate_slices():
         stream_slice = StreamSlice(...)
+        try:
+            stream = self._stream_factory()
+        except Exception as e:
+            raise RuntimeError(f"Failed to create stream instance: {e}") from e
         
         yield StreamPartition(
-            self._stream_factory(),
+            stream,
             copy.deepcopy(stream_slice),
             ...
         )
This would provide clearer error messages if stream instantiation fails. WDYT?

Line range hint 448-455: Consider reusing the stream slice copy?

I notice we're doing a deep copy of the stream slice before passing it to StreamPartition. Since StreamPartition also does a deep copy internally, we might be duplicating effort. What do you think about this optimization:
+        slice_copy = copy.deepcopy(stream_slice)
         yield StreamPartition(
             self._stream_factory(),
-            copy.deepcopy(stream_slice),
+            slice_copy,
             self.message_repository,
             self._sync_mode,
             self._cursor_field,
             self._state,
             self._cursor,
         )
This would prevent double copying of the slice data. WDYT?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 09fc5d7 and a3d9732.

📒 Files selected for processing (3)

airbyte_cdk/sources/declarative/concurrent_declarative_source.py (3 hunks)
airbyte_cdk/sources/streams/concurrent/adapters.py (4 hunks)
unit_tests/sources/streams/concurrent/test_adapters.py (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

unit_tests/sources/streams/concurrent/test_adapters.py

🔇 Additional comments (5)

airbyte_cdk/sources/streams/concurrent/adapters.py (2)

9-9: LGTM! Clean import addition.

The addition of Callable to the typing imports aligns well with the new stream factory pattern implementation.

Line range hint 393-403: Great use of the factory pattern for thread safety! 👍

The change from direct stream instance to a factory pattern is a solid improvement for concurrent execution. Each thread getting its own stream instance prevents potential race conditions.

airbyte_cdk/sources/declarative/concurrent_declarative_source.py (3)

6-6: LGTM!

Clean addition of Callable to typing imports.

232-232: LGTM!

Good use of the factory pattern in CursorPartitionGenerator instantiation.

348-358: ⚠️ Potential issue

Should we create new stream instances instead of reusing existing ones?

The current implementation returns existing stream instances from self.streams(config), which could lead to thread safety issues as multiple threads would share the same stateful components. The docstring indicates we want "one stream per thread", but the implementation doesn't achieve this.

Would you consider creating new stream instances instead? Here's a suggestion that aligns better with the docstring's intent, wdyt?

def _new_stream_instance_factory(self, stream_to_copy: Stream, config: Mapping[str, Any]) -> Callable[[], Stream]:
    """
    Some of the declarative components are stateful. Therefore, we create one stream per thread in order to avoid threads updating
    the same field for a specific instance.
    """
    def _create_new_stream() -> Stream:
-        streams_with_same_name = list(filter(lambda stream: stream.name == stream_to_copy.name, self.streams(config)))
-        if len(streams_with_same_name) == 1:
-            return streams_with_same_name[0]
-        raise ValueError(f"Expected one stream with name `{stream_to_copy.name}` but got {len(streams_with_same_name)}")
+        stream_definitions = [
+            stream_def for stream_def in self._source_config.get('streams', [])
+            if stream_def.get('name') == stream_to_copy.name
+        ]
+        if len(stream_definitions) == 1:
+            return self._constructor.create_component(
+                model_type=DeclarativeStream,
+                component_definition=stream_definitions[0],
+                config=config,
+            )
+        raise ValueError(f"Expected one stream definition with name `{stream_to_copy.name}` but got {len(stream_definitions)}")
    return _create_new_stream

This approach:

Uses stream definitions to create fresh instances
Ensures thread safety by avoiding shared state
Better aligns with the docstring's intent

Let's verify the stream definition structure:

✅ Verification successful

Verification Successful: Stream definitions align with the expected structure.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if stream definitions contain the expected structure
# Expected: Stream definitions should contain 'name' field and be compatible with DeclarativeStream

rg -A 5 '"streams":\s*\[' .

Length of output: 191835

maxi297 · 2024-11-12T19:23:47Z

Implemented as part of ddd299e

Instantiate one straem per thread

09fc5d7

coderabbitai bot reviewed Nov 11, 2024

View reviewed changes

Merge branch 'main' into maxi297/instantiate-one-stream-per-thread

d35efcc

Merge branch 'main' into maxi297/instantiate-one-stream-per-thread

a3d9732

coderabbitai bot reviewed Nov 12, 2024

View reviewed changes

maxi297 mentioned this pull request Nov 12, 2024

chore(refactor): refactor partition generator to take any stream slicer #39

Merged

maxi297 closed this Nov 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Instantiate one stream per thread #21

Instantiate one stream per thread #21

maxi297 commented Nov 11, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 11, 2024 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot Nov 11, 2024 •

edited

Loading

aaronsteers Nov 12, 2024

This comment was marked as off-topic.

aaronsteers commented Nov 12, 2024

coderabbitai bot left a comment

maxi297 commented Nov 12, 2024

Instantiate one stream per thread #21

Instantiate one stream per thread #21

Conversation

maxi297 commented Nov 11, 2024 • edited by coderabbitai bot Loading

Summary by CodeRabbit

coderabbitai bot commented Nov 11, 2024 • edited Loading

Walkthrough

Changes

Sequence Diagram(s)

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Nov 11, 2024 • edited Loading

Choose a reason for hiding this comment

aaronsteers Nov 12, 2024

Choose a reason for hiding this comment

This comment was marked as off-topic.

aaronsteers commented Nov 12, 2024

coderabbitai bot left a comment

Choose a reason for hiding this comment

maxi297 commented Nov 12, 2024

maxi297 commented Nov 11, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 11, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

coderabbitai bot Nov 11, 2024 •

edited

Loading