Skip to content

Conversation

@tonyalaribe
Copy link
Contributor

Closes #

How to test

Checklist

  • Make sure you have described your changes and added all relevant screenshots or data.
  • Make sure your changes are tested (stories and/or unit, integration, or end-to-end tests).
  • Make sure to add/update documentation regarding your changes (or request one from the team).
  • You are NOT deprecating/removing a feature.

@claude
Copy link

claude bot commented Dec 24, 2025

Code Review: Improve CI Behavior

Summary

This PR makes sensible improvements to the CI/CD workflow by separating slow integration tests and removing deployment triggers from pull requests. Overall, this is a positive change that should improve CI feedback time and workflow clarity.


🟢 Strengths

1. Better CI Performance (.github/workflows/ci.yml:96-102)

Separating regular tests from ignored tests is a good practice:

  • Fast tests provide quick feedback to developers
  • Slow integration tests (marked with #[ignore]) run separately and don't block the main test suite
  • The continue-on-error: true for ignored tests is pragmatic, though see concerns below

2. Logical Workflow Separation (.github/workflows/deploy.yml:3-5)

Removing pull_request trigger from the deploy workflow makes sense:

  • Deployments should only happen on master branch pushes
  • PRs shouldn't trigger deployments (waste of resources, potential security concern)
  • This follows CI/CD best practices

🟡 Concerns & Suggestions

1. Ignored Tests Being Optional (.github/workflows/ci.yml:99-102)

- name: Run ignored tests (optional)
  continue-on-error: true
  timeout-minutes: 10
  run: cargo test --all-features -- --ignored

Concern: The continue-on-error: true means these tests can fail silently. Looking at the codebase, the ignored tests include:

  • test_postgres_integration (tests/integration_test.rs:102)
  • test_concurrent_postgres_requests (tests/integration_test.rs:174)
  • test_update_operations (tests/integration_test.rs:276)
  • test_delete_operations (tests/integration_test.rs:356)
  • run_sqllogictest (tests/sqllogictest.rs:221)

These are critical integration tests for core functionality (CRUD operations, concurrency).

Recommendations:

  1. Consider making ignored tests required on master/main branches:

    - name: Run ignored tests (optional)
      continue-on-error: ${{ github.event_name == 'pull_request' }}
      timeout-minutes: 10
      run: cargo test --all-features -- --ignored

    This way PRs get fast feedback, but master branch ensures full test coverage.

  2. Alternative: Run ignored tests on a schedule (nightly) to catch regressions:

    on:
      schedule:
        - cron: '0 2 * * *'  # 2 AM daily
  3. Add visibility: Consider adding a step to report test results even when they fail:

    - name: Upload test results
      if: always()
      uses: actions/upload-artifact@v3
      with:
        name: ignored-test-results
        path: target/debug/test-results/

2. Timeout Value (.github/workflows/ci.yml:101)

The 10-minute timeout seems reasonable for integration tests, but verify it's sufficient:

  • Check historical CI runs to ensure tests complete within this window
  • If tests are timing out, investigate whether they're hanging or legitimately slow

3. Missing Context in PR Description

The PR description is incomplete:

  • No description of what changes were made
  • Checklist items are unchecked
  • No linked issue (Closes #)
  • No test instructions

Recommendation: Update the PR description to explain:

  • Why this change is needed (CI was too slow? Tests were flaky?)
  • What the expected behavior is now
  • Whether the ignored tests will be monitored elsewhere

🟢 Best Practices Observed

  1. ✅ Maintains all existing test coverage - just reorganizes when tests run
  2. ✅ Uses timeout-minutes to prevent hung tests from blocking CI indefinitely
  3. ✅ Preserves the --all-features flag for comprehensive testing
  4. ✅ Follows GitHub Actions best practices with clear step names

🔒 Security & Performance

  • Security: ✅ No security concerns. Removing deployment from PRs is actually a security improvement
  • Performance: ✅ Should significantly improve CI feedback time by not running slow integration tests on every commit
  • Reliability: ⚠️ Potential concern if ignored tests start failing silently (see recommendations above)

📊 Test Coverage Impact

Current state:

  • 6 tests are marked with #[ignore] across 3 test files
  • These tests cover critical functionality: Postgres integration, concurrent requests, CRUD operations
  • With this PR, these tests become optional rather than required

Risk: Core integration tests could break without immediate visibility.

Mitigation: Implement one of the recommendations above to ensure these tests still run and are monitored, even if not on every PR commit.


✅ Verdict

Approve with recommendations. The changes are sound and improve CI efficiency. However, I strongly recommend addressing the test coverage visibility concern to ensure critical integration tests don't silently fail.

Priority: Consider implementing conditional continue-on-error or scheduled test runs before merging.

- Fix delta-rs write hang in PGWire context by using block_in_place
  with Handle::block_on to avoid tokio runtime conflicts
- Apply same fix to DML operations (UPDATE/DELETE) in dml.rs
- Fix to_json function to properly parse JSON strings into objects
  instead of double-escaping them
- Refactor at_time_zone and to_char UDFs to use flexible type
  signatures (ScalarUDFImpl with Signature::any) instead of strict
  type matching that caused runtime failures
- Fix at_time_zone return type mismatch (was returning None but
  promising UTC timezone annotation)
- Update sqllogictest expected values for proper JSON output format
- Fix custom_functions.slt query column count (query T -> query TT)
- Clean up debug eprintln! statements from production code
- Add test for sequential SQL inserts matching integration pattern
@claude
Copy link

claude bot commented Dec 24, 2025

Code Review: PR #11 - Improve CI Behavior

Overview

This PR addresses critical runtime issues with PGWire query handling and improves CI workflow efficiency. The changes include fixing tokio runtime conflicts, refactoring UDF implementations for DataFusion compatibility, and reorganizing the CI test execution strategy.

Verdict: ✅ Approve with Minor Recommendations


🔍 Detailed Analysis

1. Critical Fix: Tokio Runtime Conflicts (src/database.rs, src/dml.rs)

The Problem

The most important change addresses a deadlock/hang issue when handling PGWire queries that involve delta-rs operations. The issue occurs when:

  1. PGWire spawns a task to handle an incoming query
  2. That task uses the tokio runtime to execute async operations
  3. Delta-rs internally tries to spawn its own tokio runtime
  4. Nested runtime conflict → hang/deadlock

The Solution (src/database.rs:1240-1253)

The fix uses tokio::task::block_in_place to tell tokio "I'm going to block here, spawn other tasks elsewhere" which prevents the current worker thread from being used for other tasks and allows delta-rs to safely use block_on without deadlocking. This is the correct pattern for calling blocking code from async context.

Applied consistently:

  • insert_records_batch() in database.rs:1242
  • perform_delta_operation() in dml.rs:456
  • ✅ Both UPDATE and DELETE operations use the same pattern

Assessment:Excellent fix - This is a proper solution to a complex concurrency issue.


2. UDF Refactoring (src/functions.rs)

Changes Summary

Migrated from deprecated create_udf() to the new ScalarUDFImpl trait:

  • to_char()ToCharUDF struct (lines 56-126)
  • at_time_zone()AtTimeZoneUDF struct (lines 189-267)

Improvements:

  1. Better type safety: Signature validation at compile time
  2. DataFusion compatibility: Uses newer API that's actively maintained
  3. More flexible: Signature::any(2, ...) allows runtime type checking
  4. Added array argument support in addition to scalar values (lines 107-118)

Potential Issue: ⚠️ Using Signature::any(2, ...) accepts any two arguments, then validates at runtime. While this works, it prevents early query planning errors.

Recommendation: Consider using Signature::one_of() to specify acceptable type combinations for better error messages at query planning time.

Assessment: ✅ Good modernization, with room for minor improvement.


3. Logging Cleanup (src/database.rs)

Removed excessive debug logging:

  • Per-batch debug logs
  • Schema validation debug logs
  • Version tracking debug logs

Why this is good:

  • ✅ Reduces log noise in production
  • ✅ Tracing spans still provide observability where needed
  • ✅ Keeps critical error logs

New diagnostic logging added (lines 974-976, 1161-1163)

Assessment: ⚠️ The new info! logs should probably be debug! or trace!. They'll be noisy in production when creating many tables.


4. CI/CD Workflow Changes

.github/workflows/ci.yml

Pros:

  • ✅ Fast feedback on PRs (regular tests run quickly)
  • ✅ Timeout prevents hung tests from blocking CI forever
  • ✅ Separation of concerns (fast vs. slow tests)

Cons:

  • ⚠️ Integration tests can fail silently (continue-on-error: true)
  • ⚠️ No mechanism to track which ignored tests are failing

Ignored tests include critical functionality:

  • test_postgres_integration - Core PGWire integration
  • test_concurrent_postgres_requests - Concurrency handling
  • test_update_operations / test_delete_operations - CRUD operations
  • run_sqllogictest - SQL compatibility tests

Recommendation: Make ignored tests required on master branch or run them on a schedule.

.github/workflows/deploy.yml

Removed pull_request trigger - deployments now only on master pushes.

Assessment: ✅ Correct - PRs shouldn't trigger deployments.


5. Test Coverage (tests/integration_test.rs)

Added 3 new comprehensive test scenarios (+91 lines):

  1. test_sequential_sql_inserts() - Simulates integration test pattern of sequential SQL inserts
  2. test_spawned_database_insert() - Tests the exact pattern that was causing hangs
  3. test_cross_spawn_insert_creates_table() - Most comprehensive, simulates the exact PGWire pattern

Assessment:Excellent test coverage - These tests directly validate the fix and prevent regressions.


🔒 Security Analysis

No security vulnerabilities introduced:

  • ✅ No SQL injection risks (uses parameterized queries)
  • ✅ No authentication/authorization changes
  • ✅ No exposed secrets or credentials
  • block_in_place is safe (standard tokio pattern)

🚀 Performance Considerations

Positive impacts:

  • ✅ Fixing runtime conflicts improves throughput (no more hangs)
  • ✅ Reduced logging overhead
  • ✅ CI runs faster for developers

Potential concerns:

  • ⚠️ block_in_place temporarily blocks a worker thread
    • Mitigation: This is acceptable for I/O-bound delta operations
  • ⚠️ Cloning data before block_in_place (lines 1235-1238)
    • Impact: Minor - batches are Arc<> internally, so cheap to clone

Assessment: ✅ Performance impact is net positive.


🎯 Recommendations

High Priority

  1. Update PR description - Add summary of what was fixed (runtime deadlock), testing performed, and link to related issue
  2. Change info! to debug! for diagnostic logs (database.rs:977, 983, 987, 1161, 1163)
  3. Make ignored tests required on master (or schedule them)

Medium Priority

  1. Improve error handling - Don't silently drop update_state() errors
  2. Consider stricter UDF signatures - Use Signature::one_of() instead of any()

Low Priority

  1. Add execution time assertions to new tests (not just timeouts)
  2. Consider extracting block_in_place pattern to a helper function to reduce duplication

✅ Summary

This PR solves a critical production issue (PGWire query hangs) with a well-understood, correct solution. The test coverage is excellent, and the CI improvements are sensible.

The main concern is making ignored tests optional - these cover critical functionality and shouldn't fail silently.

Recommendation: Approve and merge after addressing the logging level issue and CI test strategy.


Files reviewed:

  • ✅ .github/workflows/ci.yml (CI improvements)
  • ✅ .github/workflows/deploy.yml (Deployment trigger fix)
  • ✅ src/database.rs (Runtime fix + logging cleanup)
  • ✅ src/dml.rs (Runtime fix for UPDATE/DELETE)
  • ✅ src/functions.rs (UDF modernization)
  • ✅ src/object_store_cache.rs (Minor logging)
  • ✅ tests/integration_test.rs (Excellent new test coverage)
  • ✅ tests/slt/*.slt (Test updates)

block_in_place requires multi-threaded tokio runtime to work.
@claude
Copy link

claude bot commented Dec 25, 2025

Pull Request Review: Improve CI Behavior

Summary

This PR makes significant changes to address concurrency and runtime issues in the database layer, particularly around Delta Lake operations in PGWire contexts. The changes primarily focus on using tokio::task::block_in_place to avoid nested runtime conflicts and improving CI test handling.


Critical Issues

1. Potential Deadlock in perform_delta_operation (src/dml.rs:451-464)

Location: src/dml.rs:451-464

Issue: The code acquires a write lock, then drops it, then immediately acquires it again. Between dropping the first lock and acquiring the second, another task could acquire the lock and modify the table, leading to lost updates or inconsistent state.

Recommendation: Hold the write lock for the entire operation to prevent race conditions.


2. Silently Ignoring update_state() Errors (src/database.rs:1232)

Location: src/database.rs:1232

Issue: let _ = table.update_state().await; completely ignores errors, which could lead to writing to a stale table version, potentially causing data loss or corruption in concurrent scenarios.

Recommendation: At minimum, log the error with warn! or debug! level.


3. Incomplete PR Description

The PR description template is unfilled with no explanation of what CI issue is being fixed, why block_in_place is needed, or what problem the extensive test additions are addressing.

Recommendation: Fill out the PR description with context about the nested runtime issue being solved and the integration test hanging problem.


Major Concerns

4. Missing Error Context in block_in_place Usage

Locations: src/database.rs:1242-1253, src/dml.rs:456-459

Using block_in_place can panic if called from a single-threaded runtime or outside a runtime context. No error handling or documentation explains the runtime requirements.

Recommendation: Add documentation explaining why block_in_place is required and document that this requires a multi-threaded tokio runtime.


5. Excessive Debug Logging Removal

Locations: Throughout src/database.rs

Many useful debug logs were removed, making debugging production issues significantly harder, especially for the complex concurrency scenarios this PR is trying to fix.

Recommendation: Keep debug-level logs but perhaps reduce info-level logs. Debug logs have minimal performance impact and are invaluable for troubleshooting.


6. Function Signature Changes (src/functions.rs)

Complete rewrite from simple function implementation to full ScalarUDFImpl trait implementation, making the code much more complex (from ~30 lines to ~100+ lines per function).

Questions: What specific bug does this fix? Why is the array format string handling needed? Is there a DataFusion version requirement for this change?


Moderate Issues

7. CI Test Separation (.github/workflows/ci.yml:96-101)

Splitting ignored tests into a separate step with continue-on-error: true hides test failures rather than fixing them. The timeout is good, but continuing on error may hide real regressions.


8. Deploy Workflow Change (.github/workflows/deploy.yml:5-6)

Removed PR trigger from deployment workflow. Was this intentional? If the workflow was running on PRs for testing deployment, this removes that safety check.


Positive Aspects

  1. Good test coverage expansion - New integration tests directly test the problematic patterns (spawned database insert, cross-spawn query execution, on-demand table creation)

  2. Proper use of block_in_place - This is the correct solution for avoiding nested runtime conflicts with delta-rs

  3. Informative logging during investigation - Shows good troubleshooting methodology


Security Considerations

No critical security issues identified. The changes are primarily internal concurrency fixes. The unsafe block in tests for setting environment variables is appropriate for test code.


Test Coverage

Good: New tests directly address the integration test hanging issue with three specific scenarios.

Gap: Missing tests for the deadlock scenario in perform_delta_operation, concurrent writes with retry logic, and error cases in update_state().


Recommendations Summary

Must Fix Before Merge:

  1. Fix the deadlock potential in perform_delta_operation (dml.rs:451-464)
  2. Handle or log update_state() errors (database.rs:1232)
  3. Fill out the PR description with context and rationale

Should Fix:

  1. Restore debug logging
  2. Document why block_in_place is necessary and runtime requirements
  3. Document or fix the ignored test behavior in CI

Nice to Have:

  1. Add documentation for the UDF pattern change reasoning
  2. Add more comprehensive assertions to integration tests
  3. Consider test timeout standardization

Overall Assessment

This PR addresses a real and important issue with nested tokio runtimes in the PGWire/delta-rs interaction. The block_in_place solution is appropriate. However, the implementation has a critical deadlock bug in dml.rs and removes valuable debugging capabilities. The PR description needs significant improvement.

Recommendation: Request changes to address the critical deadlock issue and improve documentation before merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants