-
Notifications
You must be signed in to change notification settings - Fork 5
Add Airbyte-Couchbase Integration Tutorial #76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Add detailed tutorial covering: - Couchbase as source and destination - All sync modes and configuration options - Common integration patterns - Performance optimization and troubleshooting - Capella-focused with 35-minute read time
Summary of ChangesHello @teetangh, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a new, in-depth tutorial designed to guide users through the process of integrating Airbyte with Couchbase. The tutorial provides step-by-step instructions for configuring Couchbase as both a data source and a destination, covering essential aspects from initial setup and sync mode selection to advanced integration patterns and troubleshooting. Its purpose is to empower users to build robust data pipelines leveraging Airbyte's capabilities with Couchbase, ensuring efficient data flow and optimal performance. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
…enhanced integration coverage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This PR adds a very comprehensive tutorial for Airbyte and Couchbase integration. The tutorial is well-structured and covers a lot of ground, from basic setup to advanced patterns and troubleshooting. I've found a few issues, mostly related to correctness of code snippets and consistency in examples. The most critical ones are incorrect SQL queries for data extraction and a bug in a timestamp conversion query, which would lead to incorrect data. I've also suggested improvements for formatting and consistency to improve readability. Overall, great work on creating such a detailed guide.
| SELECT META().id as _id, | ||
| TO_NUMBER(meta().xattrs.$document.last_modified) as _ab_cdc_updated_at, | ||
| * | ||
| FROM `bucket`.`scope`.`collection` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The example SQL query for Full Refresh mode uses SELECT *, which will produce a flat JSON structure. This contradicts the nested stream schema described previously, which expects the original document to be nested under a key representing the bucket name. The query should be adjusted to create this nested structure.
| SELECT META().id as _id, | |
| TO_NUMBER(meta().xattrs.$document.last_modified) as _ab_cdc_updated_at, | |
| * | |
| FROM `bucket`.`scope`.`collection` | |
| SELECT META().id as _id, | |
| TO_NUMBER(meta().xattrs.$document.last_modified) as _ab_cdc_updated_at, | |
| c AS `bucket` | |
| FROM `bucket`.`scope`.`collection` AS c |
| SELECT META().id as _id, | ||
| TO_NUMBER(meta().xattrs.$document.last_modified) as _ab_cdc_updated_at, | ||
| * | ||
| FROM `bucket`.`scope`.`collection` | ||
| WHERE TO_NUMBER(meta().xattrs.$document.last_modified) > {last_cursor_value} | ||
| ORDER BY TO_NUMBER(meta().xattrs.$document.last_modified) ASC | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the Full Refresh query, the example SQL query for Incremental (Append) mode uses SELECT *, which produces a flat JSON structure. This contradicts the documented nested stream schema. The query should be adjusted to create the nested structure.
| SELECT META().id as _id, | |
| TO_NUMBER(meta().xattrs.$document.last_modified) as _ab_cdc_updated_at, | |
| * | |
| FROM `bucket`.`scope`.`collection` | |
| WHERE TO_NUMBER(meta().xattrs.$document.last_modified) > {last_cursor_value} | |
| ORDER BY TO_NUMBER(meta().xattrs.$document.last_modified) ASC | |
| ``` | |
| SELECT META().id as _id, | |
| TO_NUMBER(meta().xattrs.$document.last_modified) as _ab_cdc_updated_at, | |
| c AS `bucket` | |
| FROM `bucket`.`scope`.`collection` AS c | |
| WHERE TO_NUMBER(meta().xattrs.$document.last_modified) > {last_cursor_value} | |
| ORDER BY TO_NUMBER(meta().xattrs.$document.last_modified) ASC |
| SELECT | ||
| data.*, | ||
| emitted_at, | ||
| TO_TIMESTAMP(data._ab_cdc_updated_at / 1000000000) as modified_at |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The query divides data._ab_cdc_updated_at by 1,000,000,000. The _ab_cdc_updated_at field is in nanoseconds, and dividing it by 1,000,000,000 converts it to seconds. However, the TO_TIMESTAMP function in N1QL expects a number representing milliseconds since the Unix epoch. This will result in an incorrect timestamp. To convert nanoseconds to milliseconds, you should divide by 1,000,000.
| TO_TIMESTAMP(data._ab_cdc_updated_at / 1000000000) as modified_at | |
| TO_TIMESTAMP(data._ab_cdc_updated_at / 1000000) as modified_at |
| - Airbyte | ||
| - Data Integration | ||
| - ETL | ||
| - Connector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| "travel-sample": { // Collection name | ||
| // Original document fields | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a couple of inconsistencies in the stream schema example:
- The comment on line 179 identifies
travel-sampleas the "Collection name", but it's the bucket name. - The schema uses a specific bucket name (
travel-sample), while the rest of the section uses generic placeholders likebucket. For consistency, the schema example should also be generic.
| "travel-sample": { // Collection name | |
| // Original document fields | |
| } | |
| "bucket": { // Bucket name | |
| // Original document fields | |
| } |
| ``` | ||
| Consider using Airbyte's transformation capabilities or custom dbt models | ||
| to mask PII before syncing to non-production environments | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This security note about data masking is formatted as a code block, which is typically used for code examples. For better readability and semantic correctness, it would be more appropriate to format this as a blockquote or regular text.
| ``` | |
| Consider using Airbyte's transformation capabilities or custom dbt models | |
| to mask PII before syncing to non-production environments | |
| ``` | |
| > Consider using Airbyte's transformation capabilities or custom dbt models | |
| > to mask PII before syncing to non-production environments |
| ClusterTimeoutOptions( | ||
| kv_timeout=5 seconds, | ||
| query_timeout=10 seconds | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Python code snippet for ClusterTimeoutOptions is not valid. Expressions like 5 seconds are not valid in Python. The Couchbase Python SDK typically expects integers or floats representing seconds for timeout values. Please correct the example to be valid Python code.
| ClusterTimeoutOptions( | |
| kv_timeout=5 seconds, | |
| query_timeout=10 seconds | |
| ) | |
| ClusterTimeoutOptions( | |
| kv_timeout=5, # in seconds | |
| query_timeout=10 # in seconds | |
| ) |
|
|
||
| Check existing indexes: | ||
| SELECT * FROM system:indexes | ||
| WHERE keyspace_id = 'collection_name' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shyam-cb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lot of redundant information consolidate them and also when mention about metrics are these metrics are verified ones or generated by AI
| Airbyte is an open-source data integration platform that enables you to move data between various sources and destinations. With Airbyte's Couchbase connectors, you can use Couchbase as both a data source and destination, enabling powerful data integration scenarios including: | ||
|
|
||
| - **Cross-bucket replication**: Sync data between buckets within the same or different Couchbase clusters | ||
| - **Analytics pipelines**: Extract data from Couchbase to data warehouses or analytics platforms |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
verify this statement
| - When tracking changes over time | ||
|
|
||
| **Requirements**: | ||
| - Couchbase automatically maintains the `last_modified` xattr on all documents |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this statement is true ?
|
|
||
| **Performance**: Consider creating custom secondary indexes for better query performance on large collections. | ||
|
|
||
| **Change Detection**: The connector uses Couchbase's built-in `last_modified` xattr, which is automatically maintained by the server. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does couchbase by default add's last_modified when document created or updated?
| - **Permissions**: Assign "Data Reader", "Data Writer", and "Query Manager" roles | ||
| 4. Save the credentials | ||
|
|
||
| **Note**: Query Manager role is required for automatic collection and index creation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cluster credentials and capella creadentials are different, please verify it
|
|
||
| ## Part 1: Configuring Couchbase as a Source | ||
|
|
||
| The Couchbase source connector allows Airbyte to extract data from your Couchbase buckets. It automatically discovers all collections within a bucket and creates individual streams for each. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
explain what is a stream or refer link where it explains about stream
| **Primary Key**: `[["data", "id"]]` (from PostgreSQL primary key) | ||
| **Schedule**: Every 15 minutes | ||
|
|
||
| **Benefits**: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
benefit section use case completely generic, and not compelling.
| AND type = 'airbyte_record' | ||
| ``` | ||
|
|
||
| ### Pattern 4: Real-Time Change Tracking |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
think some other example for Real-Time Change Tracking financial_transactions, user_accounts are completely different entities when you compare them in teams of audit log
| ORDER BY emitted_at DESC | ||
| ``` | ||
|
|
||
| **Storage Consideration**: This pattern will grow continuously. Plan for data lifecycle management: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also check whether airbyte is used for realtime tracking, understand its real world use case
| **Security Note**: Implement data masking for sensitive fields: | ||
| ``` | ||
| Consider using Airbyte's transformation capabilities or custom dbt models | ||
| to mask PII before syncing to non-production environments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Airbyte transformations for PII , what is PII here.
Also what Data Masking denotes here
|
|
||
| **Performance Comparison**: | ||
| ``` | ||
| Full Refresh (100k docs): ~5-10 minutes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are sure about the metrics here mentioned
Summary
This PR adds a comprehensive tutorial for integrating Airbyte with Couchbase as both source and destination.
Tutorial Details
tutorial/markdown/connectors/airbyte/airbyte-couchbase-integration.mdContent Overview
Part 1: Couchbase as Source
Part 2: Couchbase as Destination
Part 3: Creating Connections
Part 4: Common Integration Patterns
Additional Sections
Technical Details
Test Plan
npm run test:frontmatter)This tutorial will be automatically published to the Developer Portal on the next weekly build when merged to main.