Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instantiate one stream per thread #21

Closed
wants to merge 3 commits into from

Conversation

maxi297
Copy link
Contributor

@maxi297 maxi297 commented Nov 11, 2024

Summary by CodeRabbit

  • New Features

    • Introduced a factory method for generating unique stream instances in a multi-threaded environment, enhancing stream management.
  • Bug Fixes

    • Improved handling of stream instantiation to prevent concurrent modifications.
  • Documentation

    • Updated documentation for CursorPartitionGenerator to clarify changes related to stream instantiation.
  • Tests

    • Adjusted test cases to reflect the new factory pattern for stream creation while preserving functionality.

Copy link
Contributor

coderabbitai bot commented Nov 11, 2024

📝 Walkthrough
📝 Walkthrough

Walkthrough

The changes introduced in this pull request enhance the ConcurrentDeclarativeSource and CursorPartitionGenerator classes by implementing a factory method for creating stream instances. A new method, _new_stream_instance_factory, is added to ensure unique stream instances per thread, preventing concurrent modifications. The CursorPartitionGenerator constructor is updated to accept a stream_factory callable instead of a direct stream instance, promoting better flexibility. Corresponding updates are made to the test cases to reflect these changes in variable naming and instantiation.

Changes

File Path Change Summary
airbyte_cdk/sources/declarative/concurrent_declarative_source.py Added method _new_stream_instance_factory to create unique stream instances per thread. Updated _group_streams to use the new factory method. Added import for Callable.
airbyte_cdk/sources/streams/concurrent/adapters.py Updated CursorPartitionGenerator to use stream_factory: Callable[[], Stream] instead of stream: Stream in the constructor. Adjusted documentation and imports accordingly.
unit_tests/sources/streams/concurrent/test_adapters.py Renamed stream variable to stream_factory in test_cursor_partition_generator to match changes in CursorPartitionGenerator.

Sequence Diagram(s)

sequenceDiagram
    participant Thread
    participant ConcurrentDeclarativeSource
    participant CursorPartitionGenerator

    Thread->>ConcurrentDeclarativeSource: Request new stream instance
    ConcurrentDeclarativeSource->>ConcurrentDeclarativeSource: Call _new_stream_instance_factory()
    ConcurrentDeclarativeSource->>Thread: Return stream_factory
    Thread->>CursorPartitionGenerator: Instantiate with stream_factory
    CursorPartitionGenerator->>CursorPartitionGenerator: Generate stream using stream_factory()
Loading

🐇 "In the land of streams, where rabbits play,
A factory blooms, brightening the day.
No more shared states, each thread gets its own,
With unique little streams, our joy has grown!
Hopping through code, we dance with delight,
In the world of adapters, everything's right!" 🌼


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (3)
unit_tests/sources/streams/concurrent/test_adapters.py (2)

Line range hint 471-477: Consider adding thread-safety verification

While the test verifies basic functionality, it could be enhanced to explicitly verify that each thread gets a unique stream instance.

Consider adding a test case that:

  1. Creates multiple partitions concurrently
  2. Verifies each partition has a unique stream instance
  3. Ensures no stream instance is shared between partitions

Example addition:

def test_cursor_partition_generator_thread_safety():
    stream_instances = set()
    stream_factory = Mock(side_effect=lambda: Mock(id=len(stream_instances)))
    # ... setup other dependencies ...
    
    partitions = list(partition_generator.generate())
    
    # Verify each partition has a unique stream instance
    for partition in partitions:
        stream_instances.add(id(partition._stream))
    assert len(stream_instances) == len(partitions), "Partitions should have unique stream instances"

Line range hint 478-485: Enhance assertions for factory pattern verification

The current assertions verify partition types and slice content, but could be extended to verify the factory pattern implementation.

Add assertions to verify:

  1. The stream_factory is called for each partition
  2. Each partition receives a new stream instance
# Add after existing assertions
stream_factory.assert_has_calls([call() for _ in partitions])
airbyte_cdk/sources/streams/concurrent/adapters.py (1)

403-407: Remove reference to unused parameter in docstring

The docstring still references the sync_mode parameter which is no longer part of the method signature.

-        :param sync_mode: The synchronization mode.
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 734d778 and 09fc5d7.

📒 Files selected for processing (3)
  • airbyte_cdk/sources/declarative/concurrent_declarative_source.py (3 hunks)
  • airbyte_cdk/sources/streams/concurrent/adapters.py (4 hunks)
  • unit_tests/sources/streams/concurrent/test_adapters.py (2 hunks)
🔇 Additional comments (5)
unit_tests/sources/streams/concurrent/test_adapters.py (1)

451-451: LGTM: Variable renamed to reflect factory pattern

The renaming from stream to stream_factory accurately reflects the architectural change to use a factory pattern for creating stream instances per thread.

airbyte_cdk/sources/streams/concurrent/adapters.py (3)

9-9: LGTM: Import changes are appropriate

The addition of Callable to the typing imports is necessary for type-hinting the new stream_factory parameter.


393-393: LGTM: Stream factory parameter change

The change from stream to stream_factory: Callable[[], Stream] aligns with the PR objective of instantiating one stream per thread. The instance variable is properly stored.

Also applies to: 408-408


448-448: Verify thread safety of stream factory

The change to use self._stream_factory() correctly ensures each partition gets its own stream instance. However, we should verify that:

  1. The stream factory itself is thread-safe
  2. The stream creation process doesn't have any shared state that could cause thread safety issues
✅ Verification successful

Let me analyze the review comment and the provided code context.

The review comment suggests verifying that the stream factory is thread-safe when using self._stream_factory() instead of self._stream in the generate method.

Looking at the code, I can see that:

  1. The stream factory is created in ConcurrentDeclarativeSource._new_stream_instance_factory() which creates a new stream instance per thread to avoid threads updating the same field for a specific instance.

  2. The factory is used in CursorPartitionGenerator to create new stream instances for each partition.

  3. The factory implementation ensures thread safety by:

    • Creating a new stream instance for each call
    • Using the source's streams() method which creates fresh instances
    • Not sharing any mutable state between instances
  4. The code has been designed specifically to handle concurrent access by creating new stream instances rather than sharing them.

Based on this analysis, I can conclude that the stream factory implementation is thread-safe by design.

Stream factory is thread-safe

The change from self._stream to self._stream_factory() correctly ensures thread safety by:

  1. Creating a new stream instance for each partition
  2. Not sharing any mutable state between stream instances
  3. Using a factory pattern that creates fresh instances per thread
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for any shared state or synchronization mechanisms in stream factory implementations
rg -l "(?:threading\.|Lock|RLock|Semaphore|Event|Condition|@synchronized|@lock)" | \
  xargs -I {} sh -c 'echo "=== {} ==="; cat {}'

# Search for any static/class variables in stream implementations that could be shared
ast-grep --pattern 'class $CLASS {
  $$$
  @classmethod
  def $METHOD($$$) {
    $$$
  }
  $$$
}'

Length of output: 605931

airbyte_cdk/sources/declarative/concurrent_declarative_source.py (1)

6-6: Approved: Import Callable for type annotations

The addition of Callable to the typing imports is appropriate for type hinting the new method that returns a factory function.

Comment on lines +348 to +358
def _new_stream_instance_factory(self, stream_to_copy: Stream, config: Mapping[str, Any]) -> Callable[[], Stream]:
"""
Some of the declarative components are stateful. Therefore, we create one stream per thread in order to avoid threads updating
the same field for a specific instance.
"""
def _create_new_stream() -> Stream:
streams_with_same_name = list(filter(lambda stream: stream.name == stream_to_copy.name, self.streams(config)))
if len(streams_with_same_name) == 1:
return streams_with_same_name[0]
raise ValueError(f"Expected one stream with name `{stream_to_copy.name}` but got {len(streams_with_same_name)}")
return _create_new_stream
Copy link
Contributor

@coderabbitai coderabbitai bot Nov 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Critical Issue: _new_stream_instance_factory returns existing instances instead of creating new ones

The _new_stream_instance_factory method is intended to generate a factory function that creates a new stream instance per thread, preventing concurrent modifications to shared state. However, the current implementation returns existing stream instances from self.streams(config), which may lead to threads sharing the same instance and thus not resolving the concurrency issues.

To fix this issue, the factory function _create_new_stream should instantiate a new stream instance instead of returning an existing one. This can be achieved by recreating the stream using the stream definition and the component factory.

Consider modifying the _create_new_stream function as follows:

def _create_new_stream() -> Stream:
-    streams_with_same_name = list(filter(lambda stream: stream.name == stream_to_copy.name, self.streams(config)))
-    if len(streams_with_same_name) == 1:
-        return streams_with_same_name[0]
-    raise ValueError(f"Expected one stream with name `{stream_to_copy.name}` but got {len(streams_with_same_name)}")
+    stream_definitions = [stream_def for stream_def in self._source_config.get('streams', []) if stream_def.get('name') == stream_to_copy.name]
+    if len(stream_definitions) == 1:
+        return self._constructor.create_component(
+            model_type=DeclarativeStream,
+            component_definition=stream_definitions[0],
+            config=config,
+        )
+    raise ValueError(f"Expected one stream definition with name `{stream_to_copy.name}` but got {len(stream_definitions)}")

This modification ensures that a new instance of the stream is created each time the factory function is called, thus avoiding shared state between threads.

Committable suggestion skipped: line range outside the PR's diff.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to just dismiss this if not applicable - or if out of scope, you can suggest the AI to move to a new issue for later follow-up.

This comment was marked as off-topic.

@aaronsteers
Copy link
Contributor

@maxi297 - I pulled in the latest from main, fyi.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)
airbyte_cdk/sources/streams/concurrent/adapters.py (2)

Line range hint 408-414: Should we add error handling for stream factory calls?

What do you think about adding error handling for potential factory failures? Something like:

 def generate(self) -> Iterable[Partition]:
     for slice_start, slice_end in self._cursor.generate_slices():
         stream_slice = StreamSlice(...)
+        try:
+            stream = self._stream_factory()
+        except Exception as e:
+            raise RuntimeError(f"Failed to create stream instance: {e}") from e
         
         yield StreamPartition(
-            self._stream_factory(),
+            stream,
             copy.deepcopy(stream_slice),
             ...
         )

This would provide clearer error messages if stream instantiation fails. WDYT?


Line range hint 448-455: Consider reusing the stream slice copy?

I notice we're doing a deep copy of the stream slice before passing it to StreamPartition. Since StreamPartition also does a deep copy internally, we might be duplicating effort. What do you think about this optimization:

+        slice_copy = copy.deepcopy(stream_slice)
         yield StreamPartition(
             self._stream_factory(),
-            copy.deepcopy(stream_slice),
+            slice_copy,
             self.message_repository,
             self._sync_mode,
             self._cursor_field,
             self._state,
             self._cursor,
         )

This would prevent double copying of the slice data. WDYT?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 09fc5d7 and a3d9732.

📒 Files selected for processing (3)
  • airbyte_cdk/sources/declarative/concurrent_declarative_source.py (3 hunks)
  • airbyte_cdk/sources/streams/concurrent/adapters.py (4 hunks)
  • unit_tests/sources/streams/concurrent/test_adapters.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • unit_tests/sources/streams/concurrent/test_adapters.py
🔇 Additional comments (5)
airbyte_cdk/sources/streams/concurrent/adapters.py (2)

9-9: LGTM! Clean import addition.

The addition of Callable to the typing imports aligns well with the new stream factory pattern implementation.


Line range hint 393-403: Great use of the factory pattern for thread safety! 👍

The change from direct stream instance to a factory pattern is a solid improvement for concurrent execution. Each thread getting its own stream instance prevents potential race conditions.

airbyte_cdk/sources/declarative/concurrent_declarative_source.py (3)

6-6: LGTM!

Clean addition of Callable to typing imports.


232-232: LGTM!

Good use of the factory pattern in CursorPartitionGenerator instantiation.


348-358: ⚠️ Potential issue

Should we create new stream instances instead of reusing existing ones?

The current implementation returns existing stream instances from self.streams(config), which could lead to thread safety issues as multiple threads would share the same stateful components. The docstring indicates we want "one stream per thread", but the implementation doesn't achieve this.

Would you consider creating new stream instances instead? Here's a suggestion that aligns better with the docstring's intent, wdyt?

def _new_stream_instance_factory(self, stream_to_copy: Stream, config: Mapping[str, Any]) -> Callable[[], Stream]:
    """
    Some of the declarative components are stateful. Therefore, we create one stream per thread in order to avoid threads updating
    the same field for a specific instance.
    """
    def _create_new_stream() -> Stream:
-        streams_with_same_name = list(filter(lambda stream: stream.name == stream_to_copy.name, self.streams(config)))
-        if len(streams_with_same_name) == 1:
-            return streams_with_same_name[0]
-        raise ValueError(f"Expected one stream with name `{stream_to_copy.name}` but got {len(streams_with_same_name)}")
+        stream_definitions = [
+            stream_def for stream_def in self._source_config.get('streams', [])
+            if stream_def.get('name') == stream_to_copy.name
+        ]
+        if len(stream_definitions) == 1:
+            return self._constructor.create_component(
+                model_type=DeclarativeStream,
+                component_definition=stream_definitions[0],
+                config=config,
+            )
+        raise ValueError(f"Expected one stream definition with name `{stream_to_copy.name}` but got {len(stream_definitions)}")
    return _create_new_stream

This approach:

  1. Uses stream definitions to create fresh instances
  2. Ensures thread safety by avoiding shared state
  3. Better aligns with the docstring's intent

Let's verify the stream definition structure:

✅ Verification successful

Verification Successful: Stream definitions align with the expected structure.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if stream definitions contain the expected structure
# Expected: Stream definitions should contain 'name' field and be compatible with DeclarativeStream

rg -A 5 '"streams":\s*\[' .

Length of output: 191835

@maxi297
Copy link
Contributor Author

maxi297 commented Nov 12, 2024

Implemented as part of ddd299e

@maxi297 maxi297 closed this Nov 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants