Skip to content

Ducklake support in DuckDB scripts #6035

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 10 commits into
base: main
Choose a base branch
from
Draft

Ducklake support in DuckDB scripts #6035

wants to merge 10 commits into from

Conversation

diegoimbert
Copy link
Contributor

@diegoimbert diegoimbert commented Jun 24, 2025

Ducklake works with these syntaxes :

ATTACH 'ducklake:windmill' AS mylake (DATA_PATH 's3:///datalake');
USE mylake;
ATTACH 'ducklake:postgres:$res:u/my/db_resource' AS mylake (DATA_PATH 's3:/storage2//datalake');
USE mylake;

Everything works great with Postgres and S3

Important points :

  • I rely on DATABASE_URL being set in the workers, to avoid having to expose the windmill db credentials in an endpoint
  • Azure does not work as storage but out of my control: AzureBlobStorageFileSystem: DirectoryExists is not implemented!
  • Neither MySQL as a catalog, similar to MySQL catalog fails the second time it is attached duckdb/ducklake#214
  • Ducklake creates quite a few tables (see below)
    Screenshot 2025-06-24 at 10 06 29

Important

Add Ducklake support in DuckDB scripts with new syntax, API endpoints, and UI settings.

  • Behavior:
    • Support for Ducklake syntax in DuckDB scripts, allowing ATTACH 'ducklake:windmill' and ATTACH 'ducklake:postgres:$res:u/my/db_resource'.
    • New endpoints ducklake_catalog_exists and init_ducklake_catalog_db in configs.rs to manage Ducklake catalog.
    • Azure storage not supported due to DirectoryExists not implemented.
  • Dependencies:
    • Update duckdb version in Cargo.toml from 1.2.2 to 1.3.1.
  • Frontend:
    • Add DucklakeSettings.svelte for managing Ducklake settings in the UI.
    • Update InstanceSettings.svelte and instanceSettings.ts to include Ducklake settings.
  • Misc:
    • Add parse_postgres_url() function in lib.rs to parse PostgreSQL URLs.
    • Add POWERSHELL_INSTALL_CODE in bash_executor.rs for PowerShell module management.

This description was created by Ellipsis for bb3ca52. You can customize this summary. It will automatically update as commits are pushed.

Copy link

cloudflare-workers-and-pages bot commented Jun 24, 2025

Deploying windmill with  Cloudflare Pages  Cloudflare Pages

Latest commit: eeb78e6
Status: ✅  Deploy successful!
Preview URL: https://fbcade08.windmill.pages.dev
Branch Preview URL: https://di-ducklake.windmill.pages.dev

View logs

@diegoimbert diegoimbert marked this pull request as ready for review June 24, 2025 08:46
@rubenfiszel
Copy link
Contributor

/ai review this PR

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Changes requested ❌

Reviewed everything up to 1389172 in 2 minutes and 28 seconds. Click for details.
  • Reviewed 306 lines of code in 3 files
  • Skipped 0 files when reviewing.
  • Skipped posting 5 draft comments. View those below.
  • Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. backend/Cargo.toml:247
  • Draft comment:
    Upgrade to duckdb 1.3.1 looks intentional. Please double-check that the new version is fully backwards‐compatible with existing queries and features.
  • Reason this comment was not posted:
    Comment did not seem useful. Confidence is useful = 0% <= threshold 50% The comment is asking the PR author to double-check compatibility, which violates the rule against asking for confirmation or verification of intentions. It also relates to a dependency change, which should not be commented on unless it's a recognized issue.
2. backend/windmill-worker/src/duckdb_executor.rs:597
  • Draft comment:
    The regex used in transform_attach_ducklake is hard-coded. Validate that it robustly matches all supported ducklake attachment syntaxes and consider documenting the expected format.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50% The regex pattern is indeed hard-coded and parses a specific format. However, this appears to be an internal implementation detail for parsing a specific SQL syntax. The code handles the parsing results appropriately with error messages. The comment is suggesting documentation but doesn't point out any actual issues with the implementation. The regex could potentially miss some valid ducklake attachment syntaxes, and documentation would help future maintainers understand the expected format. The code already has clear error handling and the format appears to be an internal implementation detail rather than a public API that needs documentation. The regex pattern itself is relatively simple and its purpose is clear from the surrounding code. The comment should be deleted as it suggests documentation for an internal implementation detail without identifying any actual issues. The code handles parsing and errors appropriately.
3. backend/windmill-worker/src/duckdb_executor.rs:605
  • Draft comment:
    When handling the 'windmill' db_type, defaulting to 'NO_USER'/'NO_PWD' may hide missing credential issues. Consider failing fast or logging a warning if credentials are absent.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50% This is a new code block that handles Postgres URL parsing for DuckDB attachments. The concern about masking credential issues is valid - using placeholder values could hide real authentication problems. However, this appears to be an intentional design choice to handle optional credentials in a Postgres URL. The code is properly handling Option types and providing fallback values that would be clear to debug. The comment raises a legitimate security concern about error visibility. However, I may be overlooking whether these default values are actually problematic in practice - perhaps they're expected to work in some valid use cases. Given that this is handling optional fields in a URL parser, and the defaults are clearly marked as missing values rather than empty strings, this appears to be a reasonable implementation that won't hide real issues. The comment should be deleted. While it raises a valid concern, the current implementation appears to be an intentional and reasonable design choice for handling optional credentials in URLs.
4. backend/windmill-worker/src/duckdb_executor.rs:619
  • Draft comment:
    Enforcing that resource paths start with '$res:' is clear; consider adding a function-level doc-comment describing this requirement for maintainability.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 0% vs. threshold = 50% The comment is about code documentation rather than a bug or issue. While documentation is good, the requirement is already very clear from the error message and validation check in the code. The comment doesn't point out any actual problems that need fixing. Adding a doc comment would be nice-to-have but not essential since the code is self-documenting through its validation and error message. The requirement for '$res:' prefix could be considered an important API contract that should be documented at the function level. Future maintainers might benefit from seeing this requirement up front before reading the implementation. While documentation is valuable, the code already clearly communicates this requirement through explicit validation and a descriptive error message. The comment doesn't identify a real issue that needs fixing. This comment should be deleted as it suggests purely informational documentation without identifying any actual problems that need to be fixed. The code is already clear about the requirement.
5. backend/windmill-worker/src/duckdb_executor.rs:633
  • Draft comment:
    Using query.replacen with a formatted literal to replace the ducklake attachment string might be brittle if spacing or formatting vary. Ensure the input strictly follows the expected syntax.
  • Reason this comment was not posted:
    Comment looked like it was already resolved.

Workflow ID: wflow_edJ9eEs9JDwldR2e

You can customize Ellipsis by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.

}

pub fn parse_postgres_url(url: &str) -> Result<PostgresUrlComponents, Error> {
let parsed_url = url::Url::parse(url).map_err(|_| Error::BadConfig("Invalid PostgreSQL URL".to_string()))?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In parse_postgres_url(), consider including more context from the underlying URL parse error to improve debugging, rather than using a generic message.

Suggested change
let parsed_url = url::Url::parse(url).map_err(|_| Error::BadConfig("Invalid PostgreSQL URL".to_string()))?;
let parsed_url = url::Url::parse(url).map_err(|e| Error::BadConfig(format!("Invalid PostgreSQL URL: {e}")))?;

Copy link
Contributor

claude bot commented Jun 24, 2025

Claude finished @rubenfiszel's task —— View job


PR Review: Ducklake Support in DuckDB Scripts

Reviewing the code changes for Ducklake integration...

Todo List:

  • Read and analyze changed files
  • Review implementation for bugs and security issues
  • Check adherence to coding standards
  • Provide comprehensive feedback

Overall Assessment

This PR adds solid support for Ducklake syntax in DuckDB scripts. The implementation follows existing patterns well and integrates cleanly into the current query processing pipeline. However, there are several areas that need attention before merging.

Detailed Review

Strengths

  1. Good Architecture: The feature integrates well with the existing transform_attach_db_resource_query pattern
  2. Comprehensive Error Handling: Most edge cases are covered with appropriate error messages
  3. Follows Existing Patterns: Reuses existing resource resolution and database connection logic
  4. Clear Documentation: The PR description clearly explains supported syntax

⚠️ Issues & Concerns

High Priority

  1. Credential Security Issue (duckdb_executor.rs:610-612)

    o.username.as_ref().map(String::as_str).unwrap_or("NO_USER"),
    o.password.as_ref().map(String::as_str).unwrap_or("NO_PWD"),

    Problem: Using placeholder values like "NO_USER"/"NO_PWD" could mask authentication failures and potentially create confusing error messages downstream.

    Recommendation: Consider failing fast if required credentials are missing, or at minimum log a warning when using placeholders.

  2. Error Context Loss (lib.rs:287)

    let parsed_url = url::Url::parse(url).map_err(|_| Error::BadConfig("Invalid PostgreSQL URL".to_string()))?;

    Problem: The underlying URL parsing error details are lost, making debugging harder.

    Recommendation: Include the original error in the message:

    let parsed_url = url::Url::parse(url).map_err(|e| Error::BadConfig(format!("Invalid PostgreSQL URL: {}", e)))?;

Medium Priority

  1. Regex Robustness (duckdb_executor.rs:597)

    static ref RE: regex::Regex = regex::Regex::new(r"ATTACH 'ducklake:([^':]+):?([^']*)'").unwrap();

    Problem: The regex is fairly simple and might not handle all variations of SQL syntax (whitespace, case variations, etc.).

    Recommendation: Consider more robust parsing or add validation that the input strictly follows expected syntax.

  2. String Replacement Fragility (duckdb_executor.rs:617, 632-636)

    query.replacen("ATTACH 'ducklake:windmill'", &attach_str, 1)

    Problem: Using string replacement could fail if there are multiple similar patterns or slight formatting differences.

    Recommendation: Consider using the regex capture groups to replace the exact matched text.

  3. Code Duplication
    The get_attach_db_conn_str function is called from both old and new code paths with slightly different parameter structures. Consider refactoring for better reusability.

Low Priority

  1. Missing Documentation

    • The transform_attach_ducklake function could benefit from documentation explaining the expected URI format
    • The PostgresUrlComponents struct is public but lacks documentation
  2. Error Message Consistency
    Error messages should be consistent in format and detail level across the codebase.

Code Quality & Standards

  • ✅ Follows Rust best practices with proper error handling using Result<T, Error>
  • ✅ Uses appropriate async/await patterns
  • ✅ Leverages existing error types from windmill_common::error
  • ✅ Good use of lazy_static for regex compilation
  • ⚠️ Could use more comprehensive testing (no test changes in PR)

Security Considerations

  • ✅ Validates resource path requirements ($res: prefix)
  • ✅ Uses existing resource resolution mechanisms
  • ⚠️ The credential placeholder approach needs review (see issue feat: Typescript support for scripts (alpha) #1 above)
  • ✅ No apparent SQL injection risks (uses parameterized queries)

Performance Impact

  • ✅ Minimal performance impact - regex compilation is cached
  • ✅ Processing is only done when ducklake syntax is detected
  • ✅ Reuses existing connection and resource caching

Recommendations

  1. Fix credential handling - Address the "NO_USER"/"NO_PWD" placeholder issue
  2. Improve error messages - Include original error context in URL parsing
  3. Add validation - Ensure input strictly follows expected syntax format
  4. Consider adding tests - Unit tests for the new parsing functions would increase confidence
  5. Document the feature - Add function-level documentation for public APIs

Conclusion

This is a solid implementation that adds valuable functionality. The main concerns are around credential handling and error messaging, which should be addressed before merging. The architectural approach is sound and follows existing patterns well.

Recommendation: Request changes to address the credential security issue and error context loss, then approve.


Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Changes requested ❌

Reviewed bb3ca52 in 2 minutes and 21 seconds. Click for details.
  • Reviewed 185 lines of code in 6 files
  • Skipped 0 files when reviewing.
  • Skipped posting 6 draft comments. View those below.
  • Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. backend/windmill-api/openapi.yaml:11075
  • Draft comment:
    New ducklake endpoints added. Ensure that their authentication and documentation clearly state any permission requirements.
  • Reason this comment was not posted:
    Confidence changes required: 33% <= threshold 50% None
2. backend/windmill-worker/src/duckdb_executor.rs:607
  • Draft comment:
    Hardcoding 'dbname=ducklake_catalog' in the ATTACH statement seems intentional; ensure that the default port fallback (5432) is acceptable for all deployments.
  • Reason this comment was not posted:
    Confidence changes required: 33% <= threshold 50% None
3. backend/windmill-worker/src/duckdb_executor.rs:617
  • Draft comment:
    When processing the resource path in transform_attach_ducklake, consider trimming whitespace before checking that it starts with '$res:'.
  • Reason this comment was not posted:
    Confidence changes required: 33% <= threshold 50% None
4. frontend/src/lib/components/InstanceSettings.svelte:359
  • Draft comment:
    New 'Ducklake' tab added to the settings UI. Verify that its user experience and integration are consistent with the rest of the settings.
  • Reason this comment was not posted:
    Confidence changes required: 0% <= threshold 50% None
5. frontend/src/lib/components/instanceSettings.ts:404
  • Draft comment:
    A new empty settings category 'Ducklake' is added. Ensure that any future Ducklake-specific settings are populated as needed.
  • Reason this comment was not posted:
    Confidence changes required: 0% <= threshold 50% None
6. frontend/src/lib/components/settings/DucklakeSettings.svelte:10
  • Draft comment:
    The DucklakeSettings UI for initializing the ducklake_catalog database looks good. Consider if additional persistent error handling or user guidance is needed for repeated failures.
  • Reason this comment was not posted:
    Confidence changes required: 33% <= threshold 50% None

Workflow ID: wflow_V5QU1BVoSJojGXGL

You can customize Ellipsis by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.

@@ -245,3 +247,25 @@ async fn list_configs() -> error::JsonResult<String> {
"Config listing available only in the enterprise version".to_string(),
))
}

async fn ducklake_catalog_exists(
_authed: ApiAuthed,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ducklake_catalog_exists endpoint accepts an '_authed' parameter but does not enforce any role checks. Consider adding an auth check (or document why it's left open).

_authed: ApiAuthed,
Extension(db): Extension<DB>,
) -> error::Result<()> {
sqlx::query!("CREATE DATABASE ducklake_catalog")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The init_ducklake_catalog_db function executes 'CREATE DATABASE ducklake_catalog' without checking if it exists. Consider graceful handling for repeated initialization.

@diegoimbert diegoimbert marked this pull request as draft June 24, 2025 17:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants