Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(source-declarative-manifest): add support for custom Python components from dynamic text input #174

Merged
merged 43 commits into from
Jan 22, 2025

Conversation

aaronsteers
Copy link
Contributor

@aaronsteers aaronsteers commented Dec 14, 2024

Summary

Adds ability to parse components.py text passed in the config. The feature is disabled-by-default in all of our current runtimes, but it unblocks the next stage of development and testing of the Platform. See below security section for more info.

New feature

The source-declarative-manifest connector will now accept these config args:

  • __injected_manifest: (already existing config) this is how we receive the manifest yaml.
  • __injected_components_py: (new) the Python components.py text passed from the Builder
  • __injected_components_py_checksums: (new) one or more checksum to ensure code is not tampered with in flight:
    • md5
    • sha256

Notes on security

  • The connector will not parse or accept any custom code unless the AIRBYTE_ALLOW_CUSTOM_CODE env var is set to true.
  • This feature will only be turned on when we know we are in a sandboxed/untrusted execution context (or when the user is ourselves).
  • For now, at least one checksum is required for tamper-detection. Note that this check doesn't attempt to check if the code is "safe" or "approved", only that it hasn't been modified in flight.

Summary by CodeRabbit

  • New Features

    • Enhanced validation for configuration and checksum handling.
    • Implemented a new pagination strategy for API interactions.
    • Added a new configuration for integrating with The Guardian API.
    • New start_date configuration option added for relevant time periods.
    • Introduced a custom exception for testing error handling scenarios.
    • Added support for dynamic module loading and registration based on configuration.
    • Added a new task for locking dependencies without updates.
  • Bug Fixes

    • Improved error handling for custom code execution.
    • Enhanced module loading and component registration mechanisms.
  • Documentation

    • Added README for Guardian API test resources.
    • Updated docstrings for improved clarity.
  • Chores

    • Updated type hinting in various modules.
    • Updated .gitignore to exclude sensitive files.

@aaronsteers aaronsteers changed the title [draft] skeleton: components module from dynamic text input feat: add support for custom Python components from dynamic text input Jan 13, 2025
@aaronsteers aaronsteers changed the title feat: add support for custom Python components from dynamic text input feat(source-declarative-manifest): add support for custom Python components from dynamic text input Jan 13, 2025
@github-actions github-actions bot added the enhancement New feature or request label Jan 13, 2025
@aaronsteers aaronsteers marked this pull request as ready for review January 13, 2025 23:33

This comment was marked as off-topic.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (7)
unit_tests/source_declarative_manifest/resources/source_the_guardian_api/components.py (1)

21-24: Consider using snake_case for variable names to follow Python conventions

Variable names currPage and totalPages are using camelCase, which is less common in Python. Could we rename them to current_page and total_pages to match PEP 8 naming conventions? WDYT?

unit_tests/source_declarative_manifest/conftest.py (1)

13-20: Should we handle invalid hash_type inputs gracefully?

Currently, if an invalid hash_type is provided to hash_text, it raises a KeyError. Would it be better to handle this case and raise a more informative exception or provide a default behavior? WDYT?

airbyte_cdk/cli/source_declarative_manifest/_run.py (1)

174-178: LGTM! Consider enhancing the error message with the actual value?

The type validation is a great addition! Would you consider adding the actual value to the error message to make debugging easier? Something like:

-                f"but got type: {type(config['__injected_declarative_manifest'])}"
+                f"but got type: {type(config['__injected_declarative_manifest'])} with value: {config['__injected_declarative_manifest']!r}"

wdyt?

unit_tests/source_declarative_manifest/test_source_declarative_w_custom_components.py (2)

42-60: Consider adding edge cases to the test?

The basic functionality test looks good! Would you consider adding some edge cases like:

  • Empty string input
  • Invalid Python syntax
  • Module with syntax errors
    wdyt?

89-134: Comprehensive integration test, but might need error cases

The happy path testing is thorough! Would you consider adding test cases for:

  1. Missing or invalid checksums
  2. Invalid manifest structure
  3. Error handling during source creation

Also, the start_date truncation comment could be more descriptive about why 2 days is chosen:

-    # Truncate the start_date to speed up tests
+    # Truncate the start_date to 2 days ago to reduce API calls while still testing recent data
unit_tests/source_declarative_manifest/resources/source_the_guardian_api/manifest.yaml (2)

63-156: Consider reducing duplication in stream definitions?

I notice that base_stream and content_stream have identical configurations for incremental_sync, retriever, and their sub-components. Would it make sense to use YAML anchors and aliases to reduce this duplication? For example:

  base_stream: &base_stream_config
    incremental_sync:
      # ... existing config ...
    retriever:
      # ... existing config ...

  content_stream:
    <<: *base_stream_config
    schema_loader:
      # ... specific schema config ...

This would make the configuration more maintainable and less prone to drift. wdyt?


327-327: Consider tightening the date pattern regex?

The current pattern ^([1-9][0-9]{3})\-(0?[1-9]|1[012])\-(0?[1-9]|[12][0-9]|3[01])$ allows some invalid dates (like 2023-02-31). Would it be helpful to use a more restrictive pattern that validates actual calendar dates? wdyt?

Also applies to: 374-374

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 216cd43 and c837745.

📒 Files selected for processing (11)
  • airbyte_cdk/cli/source_declarative_manifest/_run.py (1 hunks)
  • airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (3 hunks)
  • airbyte_cdk/test/utils/manifest_only_fixtures.py (3 hunks)
  • pyproject.toml (1 hunks)
  • unit_tests/source_declarative_manifest/conftest.py (1 hunks)
  • unit_tests/source_declarative_manifest/resources/source_the_guardian_api/.gitignore (1 hunks)
  • unit_tests/source_declarative_manifest/resources/source_the_guardian_api/README.md (1 hunks)
  • unit_tests/source_declarative_manifest/resources/source_the_guardian_api/components.py (1 hunks)
  • unit_tests/source_declarative_manifest/resources/source_the_guardian_api/manifest.yaml (1 hunks)
  • unit_tests/source_declarative_manifest/resources/source_the_guardian_api/valid_config.yaml (1 hunks)
  • unit_tests/source_declarative_manifest/test_source_declarative_w_custom_components.py (1 hunks)
✅ Files skipped from review due to trivial changes (3)
  • unit_tests/source_declarative_manifest/resources/source_the_guardian_api/.gitignore
  • unit_tests/source_declarative_manifest/resources/source_the_guardian_api/valid_config.yaml
  • unit_tests/source_declarative_manifest/resources/source_the_guardian_api/README.md
⏰ Context from checks skipped due to timeout of 90000ms (6)
  • GitHub Check: Check: 'source-pokeapi' (skip=false)
  • GitHub Check: Check: 'source-the-guardian-api' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
🔇 Additional comments (9)
unit_tests/source_declarative_manifest/conftest.py (1)

23-23: Adding return type annotation looks good!

The addition of the return type annotation -> str improves code readability and type checking.

airbyte_cdk/test/utils/manifest_only_fixtures.py (1)

54-68: ⚠️ Potential issue

Security consideration: Using exec() on user input

The implementation looks clean, but executing arbitrary Python code could be dangerous. Should we add a warning in the docstring about the security implications? Also, consider adding input validation or sandboxing if this will be used with untrusted input.

Let's check if this is used with user input:

unit_tests/source_declarative_manifest/test_source_declarative_w_custom_components.py (1)

28-35: LGTM! Simple and clear test fixture.

The sample components are well-structured and demonstrate both function and class implementations.

pyproject.toml (1)

129-129: LGTM! Useful addition for dependency management.

The new lock task provides a clean way to update the lockfile without updating dependencies.

unit_tests/source_declarative_manifest/resources/source_the_guardian_api/manifest.yaml (5)

312-312: Consider restricting additional properties?

additionalProperties: true allows any unknown properties in the configuration. Given this is an API connector, would it be safer to set this to false to catch potential typos in property names? wdyt?


297-300: LGTM! Simple and effective check configuration.

The check configuration properly validates the essential "content" stream.


304-304: LGTM! Well-documented with proper security considerations.

The documentation URLs and descriptions are clear, and the API key is properly marked as airbyte_secret: true.

Also applies to: 317-319


1-1: Consider using a more specific version format?

The version "4.3.2" seems to be using semantic versioning, but it might be helpful to align it with the actual Guardian API version if one exists. wdyt?

Let me check if there's a specific version format used by The Guardian API:


45-45: Verify the availability of the custom pagination class?

The CustomPageIncrement class is referenced in multiple places. Let's verify its existence and implementation:

Also applies to: 91-92, 139-139, 231-231

✅ Verification successful

The CustomPageIncrement class is properly implemented and available! 👍

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if the custom pagination class exists and examine its implementation
rg -l "class CustomPageIncrement" 
ast-grep --pattern 'class CustomPageIncrement { $$$ }'

Length of output: 179


Script:

#!/bin/bash
# Examine the implementation of CustomPageIncrement class
rg -A 10 "class CustomPageIncrement" unit_tests/source_declarative_manifest/resources/source_the_guardian_api/components.py

Length of output: 577

coderabbitai bot added a commit that referenced this pull request Jan 13, 2025
Docstrings generation was requested by @aaronsteers.

* #174 (comment)

The following files were modified:

* `airbyte_cdk/cli/source_declarative_manifest/_run.py`
* `airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py`
* `airbyte_cdk/test/utils/manifest_only_fixtures.py`
* `unit_tests/source_declarative_manifest/conftest.py`
* `unit_tests/source_declarative_manifest/resources/source_the_guardian_api/components.py`
* `unit_tests/source_declarative_manifest/test_source_declarative_w_custom_components.py`
Copy link
Contributor

coderabbitai bot commented Jan 13, 2025

Note

We have generated docstrings for this pull request, at #218

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (2)

990-993: Consider adding error handling for missing components module.

The code assumes the components module will be available. Should we add more descriptive error messages for common failure scenarios? WDYT?

     custom_component_class = self._get_class_from_fully_qualified_class_name(
         full_qualified_class_name=model.class_name,
-        components_module=self._get_components_module_object(config=config),
+        components_module=self._get_components_module_object(config=config),
+    ) if hasattr(model, 'class_name') else None
+    
+    if not custom_component_class:
+        raise ValueError(f"Could not create custom component. Missing class_name in {model}")

1084-1107: Consider documenting security implications of executing code from configuration.

Based on the retrieved learnings, custom Python components are intentionally unrestricted to support unpredictable use cases. Should we add a docstring warning about the security implications? WDYT?

     def _get_components_module_object(
         config: Config,
     ) -> types.ModuleType:
-        """Get a components module object based on the provided config.
+        """Get a components module object based on the provided config and execute the code.
 
         If custom python components is provided, this will be loaded. Otherwise, we will
         attempt to load from the `components` module already imported.
+        
+        Warning:
+            This method executes Python code from configuration without restrictions.
+            Ensure that the configuration is from a trusted source as it has full access
+            to Python's capabilities including file operations and network access.
         """
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c837745 and c54a73d.

📒 Files selected for processing (1)
  • airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (3 hunks)
🧰 Additional context used
📓 Learnings (1)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (3)
Learnt from: aaronsteers
PR: airbytehq/airbyte-python-cdk#174
File: airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py:1063-1067
Timestamp: 2025-01-13T23:49:48.658Z
Learning: When implementing code execution from configuration, use a combination of security measures:
1. AST validation to detect dangerous operations
2. RestrictedPython or custom restricted execution environment
3. Limited builtins to prevent access to dangerous functions
4. Import hooks and global restrictions to prevent file/network access
Learnt from: aaronsteers
PR: airbytehq/airbyte-python-cdk#174
File: airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py:1063-1067
Timestamp: 2025-01-13T23:38:28.436Z
Learning: When executing Python code from configuration, implement security measures like AST validation, sandboxing, or restricted execution environments to prevent security vulnerabilities.
Learnt from: aaronsteers
PR: airbytehq/airbyte-python-cdk#174
File: airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py:1093-1102
Timestamp: 2025-01-14T00:20:32.310Z
Learning: In the `airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py` file, the strict module name checks in `_get_class_from_fully_qualified_class_name` (requiring `module_name` to be "components" and `module_name_full` to be "source_declarative_manifest.components") are intentionally designed to provide early, clear feedback when class declarations won't be found later in execution. These restrictions may be loosened in the future if the requirements for class definition locations change.
⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: Check: 'source-pokeapi' (skip=false)
  • GitHub Check: Check: 'source-the-guardian-api' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Analyze (python)
🔇 Additional comments (2)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (2)

10-11: LGTM! The new imports support the custom components functionality.

The addition of sys and types modules is appropriate for dynamic module creation and management.


1046-1083: The module name validation aligns with the documented requirements.

Based on the retrieved learnings, the strict module name checks are intentionally designed to provide early, clear feedback when class declarations won't be found later in execution.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ceab6fd and c8de81a.

📒 Files selected for processing (6)
  • airbyte_cdk/cli/source_declarative_manifest/_run.py (1 hunks)
  • airbyte_cdk/connector_builder/connector_builder_handler.py (1 hunks)
  • airbyte_cdk/sources/declarative/manifest_declarative_source.py (3 hunks)
  • airbyte_cdk/sources/declarative/parsers/custom_code_compiler.py (1 hunks)
  • airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (3 hunks)
  • airbyte_cdk/test/utils/manifest_only_fixtures.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
  • airbyte_cdk/test/utils/manifest_only_fixtures.py
  • airbyte_cdk/cli/source_declarative_manifest/_run.py
  • airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
  • airbyte_cdk/sources/declarative/parsers/custom_code_compiler.py
🧰 Additional context used
🪛 GitHub Actions: Linters
airbyte_cdk/connector_builder/connector_builder_handler.py

[error] 55-55: Argument "config" to "ManifestDeclarativeSource" has incompatible type "Mapping[str, Any]"; expected "dict[str, Any] | None"

airbyte_cdk/sources/declarative/manifest_declarative_source.py

[error] 86-86: Argument "config" to "get_registered_components_module" has incompatible type "dict[str, Any] | None"; expected "dict[str, Any]"

⏰ Context from checks skipped due to timeout of 90000ms (7)
  • GitHub Check: Check: 'source-pokeapi' (skip=false)
  • GitHub Check: Check: 'source-the-guardian-api' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Analyze (python)
🔇 Additional comments (1)
airbyte_cdk/sources/declarative/manifest_declarative_source.py (1)

71-76: Great job on the docstring improvements!

The docstring follows a clear format with the Args section and provides detailed descriptions for each parameter.

Copy link
Contributor

@bnchrch bnchrch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides there being one very minor place we could use a unit test, this was so well written it brought a tear to my eye.

Well structured, with simple single purpose functions, and great comments when necessary.

This was a joy to read. Thanks @aaronsteers .

a-man-in-a-suit-and-tie-is-looking-at-the-camera-with-a-serious-expression-on-his-face

Copy link
Contributor

@lmossman lmossman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes make sense to me, though I did not test it locally.

Thank you for the thorough documentation throughout the code!

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
airbyte_cdk/sources/declarative/parsers/custom_code_compiler.py (1)

131-136: 🛠️ Refactor suggestion

Consider adding security measures around exec(), wdyt?

The use of exec() without restrictions could be dangerous. We should consider:

  1. Adding a restricted globals dict
  2. Running the code in a separate namespace
     # Create a new module object
     components_module = ModuleType(name=COMPONENTS_MODULE_NAME)
 
+    # Create a restricted globals dict
+    restricted_globals = {
+        '__builtins__': {
+            name: getattr(__builtins__, name)
+            for name in ['True', 'False', 'None', 'len', 'dict', 'list', 'str', 'int', 'float']
+        }
+    }
+
     # Execute the module text in the module's namespace
-    exec(components_py_text, components_module.__dict__)
+    exec(components_py_text, restricted_globals, components_module.__dict__)
🧹 Nitpick comments (2)
airbyte_cdk/sources/declarative/parsers/custom_code_compiler.py (2)

12-16: Consider using a more restrictive type for ChecksumType, wdyt?

The type alias could be more specific by using the exact string literals that are used in CHECKSUM_FUNCTIONS.

-ChecksumType = Literal["md5", "sha256"]
+ChecksumType = Literal["md5", "sha256"]  # type: ignore
 CHECKSUM_FUNCTIONS: dict[ChecksumType, Any] = {
     "md5": hashlib.md5,
     "sha256": hashlib.sha256,
 }

25-30: Consider adding more details to the error message, wdyt?

The docstring explains the purpose well, but the error message could include more specific details about what to do when this error occurs.

 class AirbyteCodeTamperedError(Exception):
     """Raised when the connector's components module does not match its checksum.
 
     This is a fatal error, as it can be a sign of code tampering.
     """
+    def __init__(self, message: str) -> None:
+        super().__init__(
+            f"{message} This could indicate code tampering. Please verify the source "
+            "of your connector code and ensure it hasn't been modified in transit."
+        )
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c8de81a and bff4dc4.

📒 Files selected for processing (3)
  • airbyte_cdk/sources/declarative/manifest_declarative_source.py (3 hunks)
  • airbyte_cdk/sources/declarative/parsers/custom_code_compiler.py (1 hunks)
  • unit_tests/source_declarative_manifest/test_source_declarative_w_custom_components.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • airbyte_cdk/sources/declarative/manifest_declarative_source.py
  • unit_tests/source_declarative_manifest/test_source_declarative_w_custom_components.py
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Analyze (python)
🔇 Additional comments (2)
airbyte_cdk/sources/declarative/parsers/custom_code_compiler.py (2)

63-82: Add tests for checksum validation, wdyt?

We should add unit tests to verify that:

  1. Checksums are correctly validated
  2. Tampering attempts are detected
  3. Different hash types work as expected
#!/bin/bash
# Check if tests exist for checksum validation
rg -l "test.*checksum" "test_custom_code_compiler.py"

63-82: 🛠️ Refactor suggestion

Consider adding additional security validations for the code text, wdyt?

The function validates checksums but could benefit from additional security measures:

  1. Basic syntax validation before execution
  2. Size limits for the code text
  3. Pattern matching for potentially dangerous imports or function calls
 def validate_python_code(
     code_text: str,
     checksums: dict[str, str] | None,
 ) -> None:
     """Validate the provided Python code text against the provided checksums.
 
     Currently we fail if no checksums are provided, although this may change in the future.
     """
+    # Add size limit
+    if len(code_text) > 1_000_000:  # 1MB
+        raise ValueError("Code text is too large.")
+
+    # Basic syntax validation
+    try:
+        compile(code_text, '<string>', 'exec')
+    except SyntaxError as e:
+        raise ValueError(f"Invalid Python syntax: {e}")
+
+    # Check for potentially dangerous patterns
+    dangerous_patterns = ['os.system', 'subprocess', 'eval(', 'exec(']
+    for pattern in dangerous_patterns:
+        if pattern in code_text:
+            raise ValueError(f"Potentially dangerous code pattern found: {pattern}")
+
     if not checksums:
         raise ValueError(f"A checksum is required to validate the code. Received: {checksums}")

Likely invalid or redundant comment.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
unit_tests/source_declarative_manifest/test_source_declarative_w_custom_components.py (4)

102-107: Add error handling for YAML parsing.

The YAML loading could fail with malformed input. Should we add try-except blocks for better error messages? Wdyt?

-    manifest_dict = yaml.safe_load(manifest_yml_path.read_text())
-    assert manifest_dict, "Failed to load the manifest file."
+    try:
+        manifest_dict = yaml.safe_load(manifest_yml_path.read_text())
+        if not manifest_dict:
+            raise ValueError("Manifest file is empty")
+    except yaml.YAMLError as e:
+        raise ValueError(f"Failed to parse manifest YAML: {e}")

298-301: Specify the expected exception type.

The bare except could mask important errors like KeyboardInterrupt. Should we specify the expected exception type? Wdyt?

-            with pytest.raises(Exception):
+            with pytest.raises((ValueError, RuntimeError)):  # specify expected exceptions

267-273: Ensure proper cleanup of temporary files.

The delete=False flag means temporary files aren't automatically cleaned up. Should we use a context manager to ensure cleanup? Wdyt?

-    with NamedTemporaryFile(delete=False, suffix=".json") as temp_config_file:
+    temp_config_file = NamedTemporaryFile(delete=False, suffix=".json")
+    try:
         json_str = json.dumps(py_components_config_dict)
         Path(temp_config_file.name).write_text(json_str)
         temp_config_file.flush()
         source = create_declarative_source(
             ["check", "--config", temp_config_file.name],
         )
+    finally:
+        os.unlink(temp_config_file.name)

237-241: Consider using a mock for credentials.

The test is skipped if secrets.yaml is missing. Should we consider using a mock instead of requiring real credentials? This would make the tests more reliable and faster. Wdyt?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bff4dc4 and 3921341.

📒 Files selected for processing (5)
  • airbyte_cdk/sources/declarative/concurrent_declarative_source.py (1 hunks)
  • airbyte_cdk/sources/declarative/manifest_declarative_source.py (3 hunks)
  • airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (3 hunks)
  • pyproject.toml (1 hunks)
  • unit_tests/source_declarative_manifest/test_source_declarative_w_custom_components.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
  • pyproject.toml
  • airbyte_cdk/sources/declarative/manifest_declarative_source.py
  • airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
  • airbyte_cdk/sources/declarative/concurrent_declarative_source.py
⏰ Context from checks skipped due to timeout of 90000ms (7)
  • GitHub Check: Check: 'source-pokeapi' (skip=false)
  • GitHub Check: Check: 'source-the-guardian-api' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Analyze (python)
🔇 Additional comments (1)
unit_tests/source_declarative_manifest/test_source_declarative_w_custom_components.py (1)

1-45: LGTM! Well-organized imports and clear test fixture.

The imports are comprehensive and properly typed. The SAMPLE_COMPONENTS_PY_TEXT provides a clear and simple test fixture.

@aaronsteers aaronsteers enabled auto-merge (squash) January 21, 2025 16:38
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
unit_tests/source_declarative_manifest/resources/source_the_guardian_api/components_failing.py (3)

15-17: Consider inheriting from a more specific exception class?

The exception class looks good! Though I was wondering if inheriting from RuntimeError might be more appropriate since this is an intentional runtime failure, wdyt? 🤔

-class IntentionalException(Exception):
+class IntentionalException(RuntimeError):
     """This exception is raised intentionally in order to test error handling."""

31-39: Could we make the page_size validation more robust?

The page size validation looks good, but I have a few suggestions that might make it even better:

  1. Should we add a check for negative page sizes?
  2. The error message could be more descriptive, maybe include the actual value?

What do you think about something like this? 🤔

     def __post_init__(self, parameters: Mapping[str, Any]) -> None:
         if isinstance(self.page_size, int) or (self.page_size is None):
+            if self.page_size is not None and self.page_size <= 0:
+                raise ValueError(f"Page size must be positive, got: {self.page_size}")
             self._page_size = self.page_size
         else:
             page_size = InterpolatedString(self.page_size, parameters=parameters).eval(self.config)
             if not isinstance(page_size, int):
-                raise Exception(f"{page_size} is of type {type(page_size)}. Expected {int}")
+                raise ValueError(f"Page size must be an integer, got: {page_size} of type {type(page_size).__name__}")
             self._page_size = page_size

40-52: Add context to intentional failures?

Both initial_token and next_page_token raise IntentionalException, but would it be helpful to add some context about which method failed? This could make debugging test failures easier, wdyt? 🤔

     @property
     def initial_token(self) -> Optional[Any]:
-        raise IntentionalException()
+        raise IntentionalException("Intentional failure in initial_token")

     def next_page_token(
         self,
         response: requests.Response,
         last_page_size: int,
         last_record: Optional[Record],
         last_page_token_value: Optional[Any],
     ) -> Optional[Any]:
-        raise IntentionalException()
+        raise IntentionalException("Intentional failure in next_page_token")
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6c4e01f and 1c35577.

📒 Files selected for processing (5)
  • airbyte_cdk/sources/declarative/concurrent_declarative_source.py (1 hunks)
  • airbyte_cdk/sources/declarative/manifest_declarative_source.py (3 hunks)
  • unit_tests/connector_builder/test_connector_builder_handler.py (3 hunks)
  • unit_tests/source_declarative_manifest/resources/source_the_guardian_api/components.py (1 hunks)
  • unit_tests/source_declarative_manifest/resources/source_the_guardian_api/components_failing.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • airbyte_cdk/sources/declarative/manifest_declarative_source.py
  • airbyte_cdk/sources/declarative/concurrent_declarative_source.py
🧰 Additional context used
📓 Learnings (1)
unit_tests/source_declarative_manifest/resources/source_the_guardian_api/components.py (1)
Learnt from: aaronsteers
PR: airbytehq/airbyte-python-cdk#174
File: unit_tests/source_declarative_manifest/resources/source_the_guardian_api/components.py:21-29
Timestamp: 2025-01-13T23:39:15.457Z
Learning: The CustomPageIncrement class in unit_tests/source_declarative_manifest/resources/source_the_guardian_api/components.py is imported from another connector definition and should not be modified in this context.
⏰ Context from checks skipped due to timeout of 90000ms (5)
  • GitHub Check: Check: 'source-the-guardian-api' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
🔇 Additional comments (7)
unit_tests/connector_builder/test_connector_builder_handler.py (3)

347-347: LGTM! Nice update to use keyword argument.

The change from positional to keyword argument source_config improves code clarity and maintainability.


508-508: LGTM! Consistent with the previous change.

The update maintains consistency with the earlier change to use the source_config keyword argument.


595-595: LGTM! Completes the consistent updates.

The change aligns with the pattern of using the source_config keyword argument across all test cases.

unit_tests/source_declarative_manifest/resources/source_the_guardian_api/components_failing.py (3)

1-13: Clean imports and proper copyright notice!

The imports are well-organized and properly separated between standard library, third-party, and local modules. Nice work! 👍


19-23: Docstring seems inconsistent with implementation

The docstring mentions starting from page 1, but start_from_page defaults to 0. Should we align these, or am I missing something? 🤔


53-54: LGTM!

The get_page_size method is simple and does exactly what it should. 👍

unit_tests/source_declarative_manifest/resources/source_the_guardian_api/components.py (1)

1-61: Code looks good!

The CustomPageIncrement class is implemented correctly, and all methods align with the intended functionality.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
unit_tests/sources/declarative/test_manifest_declarative_source.py (2)

19-19: Would you consider expanding the comment to be more specific about why it's needed?

The current comment indicates it's needed for dynamic imports, but it might be helpful to explain which specific dynamic imports rely on this module, wdyt?

-import unit_tests.sources.declarative.external_component  # Needed for dynamic imports to work
+import unit_tests.sources.declarative.external_component  # Required for custom component dynamic imports in test_valid_manifest

268-271: LGTM! Consider extracting these assertions into a helper method?

The assertions are well-structured and verify the module loading hierarchy. To improve maintainability and reusability, would you consider extracting them into a helper method, wdyt?

+def assert_modules_loaded():
+    """Verify that all required modules are properly loaded."""
+    assert "unit_tests" in sys.modules
+    assert "unit_tests.sources" in sys.modules
+    assert "unit_tests.sources.declarative" in sys.modules
+    assert "unit_tests.sources.declarative.external_component" in sys.modules
+
 def test_valid_manifest():
     manifest = {
         # ... manifest configuration ...
     }
+    assert_modules_loaded()
-    assert "unit_tests" in sys.modules
-    assert "unit_tests.sources" in sys.modules
-    assert "unit_tests.sources.declarative" in sys.modules
-    assert "unit_tests.sources.declarative.external_component" in sys.modules
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1c35577 and e6b28b6.

📒 Files selected for processing (1)
  • unit_tests/sources/declarative/test_manifest_declarative_source.py (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (4)
  • GitHub Check: Check: 'source-the-guardian-api' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
🔇 Additional comments (1)
unit_tests/sources/declarative/test_manifest_declarative_source.py (1)

Line range hint 272-1207: No changes in this section.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
unit_tests/source_declarative_manifest/test_source_declarative_w_custom_components.py (3)

140-148: Consider using context manager for temporary file cleanup.

The temporary files created with delete=False aren't automatically cleaned up. Should we ensure cleanup using a try-finally block? Wdyt?

-    with NamedTemporaryFile(delete=False, suffix=".json") as temp_config_file:
+    temp_config_file = NamedTemporaryFile(delete=False, suffix=".json")
+    try:
         json_str = json.dumps(py_components_config_dict)
         Path(temp_config_file.name).write_text(json_str)
         temp_config_file.flush()
         source = create_declarative_source(
             ["check", "--config", temp_config_file.name],
         )
+    finally:
+        os.unlink(temp_config_file.name)

Also applies to: 174-182, 267-274


298-301: Specify exception type in except block.

The bare except could catch and mask important exceptions like KeyboardInterrupt. Should we specify the expected exception type? Wdyt?

-            with pytest.raises(Exception):
+            with pytest.raises((ValueError, RuntimeError)):  # specify expected exceptions
                 for msg in msg_iterator:
                     assert msg
             return

237-241: Consider creating a mock source for testing.

The TODO comment suggests creating a test source that doesn't require credentials. This would make the tests more reliable and faster. Would you like me to help create a mock source implementation?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e6b28b6 and f29f616.

📒 Files selected for processing (1)
  • unit_tests/source_declarative_manifest/test_source_declarative_w_custom_components.py (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: Check: 'source-pokeapi' (skip=false)
  • GitHub Check: Check: 'source-the-guardian-api' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Analyze (python)
🔇 Additional comments (2)
unit_tests/source_declarative_manifest/test_source_declarative_w_custom_components.py (2)

1-45: LGTM! Well-organized imports and clear test data.

The imports are comprehensive and properly typed, and the sample component code provides a good test case.


102-120: Consider adding error handling for YAML parsing.

The YAML parsing could fail with malformed files. Should we add try-except blocks to provide more informative error messages? Wdyt?

-    manifest_dict = yaml.safe_load(manifest_yml_path.read_text())
+    try:
+        manifest_dict = yaml.safe_load(manifest_yml_path.read_text())
+    except yaml.YAMLError as e:
+        raise ValueError(f"Failed to parse manifest file: {e}")

@aaronsteers aaronsteers merged commit 3a9ab87 into main Jan 22, 2025
19 of 20 checks passed
@aaronsteers aaronsteers deleted the aj/feat/accept-components-text-input branch January 22, 2025 02:35
rpopov pushed a commit to rpopov/airbyte-python-cdk that referenced this pull request Jan 23, 2025
* remotes/airbyte/main:
  fix(airbyte-cdk): Fix RequestOptionsProvider for PerPartitionWithGlobalCursor (airbytehq#254)
  feat(low-code): add profile assertion flow to oauth authenticator component (airbytehq#236)
  feat(Low-Code Concurrent CDK): Add ConcurrentPerPartitionCursor (airbytehq#111)
  fix: don't mypy unit_tests (airbytehq#241)
  fix: handle backoff_strategies in CompositeErrorHandler (airbytehq#225)
  feat(concurrent cursor): attempt at clamping datetime (airbytehq#234)
  ci: use `ubuntu-24.04` explicitly (resolves CI warnings) (airbytehq#244)
  Fix(sdm): module ref issue in python components import (airbytehq#243)
  feat(source-declarative-manifest): add support for custom Python components from dynamic text input (airbytehq#174)
  chore(deps): bump avro from 1.11.3 to 1.12.0 (airbytehq#133)
  docs: comments on what the `Dockerfile` is for (airbytehq#240)
  chore: move ruff configuration to dedicated ruff.toml file (airbytehq#237)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants