Skip to content

Conversation

@teetangh
Copy link
Contributor

Summary

This PR adds a comprehensive tutorial for integrating Airbyte with Couchbase as both source and destination.

Tutorial Details

  • Location: tutorial/markdown/connectors/airbyte/airbyte-couchbase-integration.md
  • Length: ~35 minutes reading time (1,659 lines)
  • Focus: Couchbase Capella (cloud) with references to self-hosted
  • Skill Level: Intermediate

Content Overview

Part 1: Couchbase as Source

  • Connection configuration and authentication
  • Stream discovery and schema inference
  • Full Refresh and Incremental sync modes
  • Understanding cursor fields and CDC

Part 2: Couchbase as Destination

  • Destination setup and permissions
  • Sync modes (Overwrite, Append, Append Dedup)
  • Document structure and metadata
  • Collection management

Part 3: Creating Connections

  • Step-by-step connection setup
  • Stream configuration and sync mode selection
  • Primary key configuration
  • Scheduling options

Part 4: Common Integration Patterns

  • Cross-bucket replication (Couchbase to Couchbase)
  • Analytics pipelines (Couchbase to Data Warehouse)
  • Data ingestion (SaaS/Database to Couchbase)
  • Real-time change tracking
  • Multi-environment sync

Additional Sections

  • Performance optimization and best practices
  • Security considerations
  • Comprehensive troubleshooting guide
  • Monitoring strategies
  • Resource links and next steps

Technical Details

  • Based on analysis of Couchbase source connector v0.1.8 and destination connector v0.1.9
  • Includes real-world configuration examples
  • SQL++ query patterns for data verification
  • Network and security configuration guidance

Test Plan

  • Review tutorial content for accuracy
  • Verify all code examples and configuration snippets
  • Check links to external documentation
  • Validate frontmatter schema (run npm run test:frontmatter)
  • Test tutorial flow with actual Airbyte and Couchbase setup
  • Review for consistency with other connector tutorials

This tutorial will be automatically published to the Developer Portal on the next weekly build when merged to main.

Add detailed tutorial covering:
- Couchbase as source and destination
- All sync modes and configuration options
- Common integration patterns
- Performance optimization and troubleshooting
- Capella-focused with 35-minute read time
@teetangh teetangh requested a review from a team as a code owner November 20, 2025 09:31
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @teetangh, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new, in-depth tutorial designed to guide users through the process of integrating Airbyte with Couchbase. The tutorial provides step-by-step instructions for configuring Couchbase as both a data source and a destination, covering essential aspects from initial setup and sync mode selection to advanced integration patterns and troubleshooting. Its purpose is to empower users to build robust data pipelines leveraging Airbyte's capabilities with Couchbase, ensuring efficient data flow and optimal performance.

Highlights

  • New Comprehensive Tutorial: Introduces a new, in-depth tutorial for integrating Airbyte with Couchbase, covering its use as both a data source and destination.
  • Detailed Configuration Guide: Provides step-by-step instructions for configuring Couchbase Capella, including user permissions, network access, and connection string setup for Airbyte connectors.
  • Sync Mode Explanations: Explains various sync modes (Full Refresh, Incremental, Overwrite, Append, Append Dedup) and their appropriate use cases for efficient data replication.
  • Common Integration Patterns: Outlines practical data integration patterns such as cross-bucket replication, analytics pipelines, SaaS/database ingestion to Couchbase, real-time change tracking, and multi-environment synchronization.
  • Performance, Security, and Troubleshooting: Includes extensive guidance on performance optimization, security best practices, data quality guidelines, and a comprehensive troubleshooting section for Airbyte-Couchbase integrations.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR adds a very comprehensive tutorial for Airbyte and Couchbase integration. The tutorial is well-structured and covers a lot of ground, from basic setup to advanced patterns and troubleshooting. I've found a few issues, mostly related to correctness of code snippets and consistency in examples. The most critical ones are incorrect SQL queries for data extraction and a bug in a timestamp conversion query, which would lead to incorrect data. I've also suggested improvements for formatting and consistency to improve readability. Overall, great work on creating such a detailed guide.

Comment on lines +195 to +198
SELECT META().id as _id,
TO_NUMBER(meta().xattrs.$document.last_modified) as _ab_cdc_updated_at,
*
FROM `bucket`.`scope`.`collection`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The example SQL query for Full Refresh mode uses SELECT *, which will produce a flat JSON structure. This contradicts the nested stream schema described previously, which expects the original document to be nested under a key representing the bucket name. The query should be adjusted to create this nested structure.

Suggested change
SELECT META().id as _id,
TO_NUMBER(meta().xattrs.$document.last_modified) as _ab_cdc_updated_at,
*
FROM `bucket`.`scope`.`collection`
SELECT META().id as _id,
TO_NUMBER(meta().xattrs.$document.last_modified) as _ab_cdc_updated_at,
c AS `bucket`
FROM `bucket`.`scope`.`collection` AS c

Comment on lines +218 to +224
SELECT META().id as _id,
TO_NUMBER(meta().xattrs.$document.last_modified) as _ab_cdc_updated_at,
*
FROM `bucket`.`scope`.`collection`
WHERE TO_NUMBER(meta().xattrs.$document.last_modified) > {last_cursor_value}
ORDER BY TO_NUMBER(meta().xattrs.$document.last_modified) ASC
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to the Full Refresh query, the example SQL query for Incremental (Append) mode uses SELECT *, which produces a flat JSON structure. This contradicts the documented nested stream schema. The query should be adjusted to create the nested structure.

Suggested change
SELECT META().id as _id,
TO_NUMBER(meta().xattrs.$document.last_modified) as _ab_cdc_updated_at,
*
FROM `bucket`.`scope`.`collection`
WHERE TO_NUMBER(meta().xattrs.$document.last_modified) > {last_cursor_value}
ORDER BY TO_NUMBER(meta().xattrs.$document.last_modified) ASC
```
SELECT META().id as _id,
TO_NUMBER(meta().xattrs.$document.last_modified) as _ab_cdc_updated_at,
c AS `bucket`
FROM `bucket`.`scope`.`collection` AS c
WHERE TO_NUMBER(meta().xattrs.$document.last_modified) > {last_cursor_value}
ORDER BY TO_NUMBER(meta().xattrs.$document.last_modified) ASC

SELECT
data.*,
emitted_at,
TO_TIMESTAMP(data._ab_cdc_updated_at / 1000000000) as modified_at
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The query divides data._ab_cdc_updated_at by 1,000,000,000. The _ab_cdc_updated_at field is in nanoseconds, and dividing it by 1,000,000,000 converts it to seconds. However, the TO_TIMESTAMP function in N1QL expects a number representing milliseconds since the Unix epoch. This will result in an incorrect timestamp. To convert nanoseconds to milliseconds, you should divide by 1,000,000.

Suggested change
TO_TIMESTAMP(data._ab_cdc_updated_at / 1000000000) as modified_at
TO_TIMESTAMP(data._ab_cdc_updated_at / 1000000) as modified_at

Comment on lines +15 to +18
- Airbyte
- Data Integration
- ETL
- Connector
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The tags Airbyte, Data Integration, and ETL are not present in the list of allowed tags defined in test/test-markdown-frontmatter.js. Using invalid tags will cause the frontmatter validation test to fail. Please replace them with valid tags from the allowed list.

Comment on lines +179 to +181
"travel-sample": { // Collection name
// Original document fields
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There are a couple of inconsistencies in the stream schema example:

  1. The comment on line 179 identifies travel-sample as the "Collection name", but it's the bucket name.
  2. The schema uses a specific bucket name (travel-sample), while the rest of the section uses generic placeholders like bucket. For consistency, the schema example should also be generic.
Suggested change
"travel-sample": { // Collection name
// Original document fields
}
"bucket": { // Bucket name
// Original document fields
}

Comment on lines +752 to +755
```
Consider using Airbyte's transformation capabilities or custom dbt models
to mask PII before syncing to non-production environments
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This security note about data masking is formatted as a code block, which is typically used for code examples. For better readability and semantic correctness, it would be more appropriate to format this as a blockquote or regular text.

Suggested change
```
Consider using Airbyte's transformation capabilities or custom dbt models
to mask PII before syncing to non-production environments
```
> Consider using Airbyte's transformation capabilities or custom dbt models
> to mask PII before syncing to non-production environments

Comment on lines +879 to +882
ClusterTimeoutOptions(
kv_timeout=5 seconds,
query_timeout=10 seconds
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The Python code snippet for ClusterTimeoutOptions is not valid. Expressions like 5 seconds are not valid in Python. The Couchbase Python SDK typically expects integers or floats representing seconds for timeout values. Please correct the example to be valid Python code.

Suggested change
ClusterTimeoutOptions(
kv_timeout=5 seconds,
query_timeout=10 seconds
)
ClusterTimeoutOptions(
kv_timeout=5, # in seconds
query_timeout=10 # in seconds
)


Check existing indexes:
SELECT * FROM system:indexes
WHERE keyspace_id = 'collection_name'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The placeholder collection_name is used here, which is inconsistent with collection used in other query examples throughout the document (e.g., line 198). For consistency, it would be better to use collection.

Suggested change
WHERE keyspace_id = 'collection_name'
WHERE keyspace_id = 'collection'

Copy link

@shyam-cb shyam-cb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lot of redundant information consolidate them and also when mention about metrics are these metrics are verified ones or generated by AI

Airbyte is an open-source data integration platform that enables you to move data between various sources and destinations. With Airbyte's Couchbase connectors, you can use Couchbase as both a data source and destination, enabling powerful data integration scenarios including:

- **Cross-bucket replication**: Sync data between buckets within the same or different Couchbase clusters
- **Analytics pipelines**: Extract data from Couchbase to data warehouses or analytics platforms

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

verify this statement

- When tracking changes over time

**Requirements**:
- Couchbase automatically maintains the `last_modified` xattr on all documents

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this statement is true ?


**Performance**: Consider creating custom secondary indexes for better query performance on large collections.

**Change Detection**: The connector uses Couchbase's built-in `last_modified` xattr, which is automatically maintained by the server.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does couchbase by default add's last_modified when document created or updated?

- **Permissions**: Assign "Data Reader", "Data Writer", and "Query Manager" roles
4. Save the credentials

**Note**: Query Manager role is required for automatic collection and index creation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cluster credentials and capella creadentials are different, please verify it


## Part 1: Configuring Couchbase as a Source

The Couchbase source connector allows Airbyte to extract data from your Couchbase buckets. It automatically discovers all collections within a bucket and creates individual streams for each.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explain what is a stream or refer link where it explains about stream

**Primary Key**: `[["data", "id"]]` (from PostgreSQL primary key)
**Schedule**: Every 15 minutes

**Benefits**:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

benefit section use case completely generic, and not compelling.

AND type = 'airbyte_record'
```

### Pattern 4: Real-Time Change Tracking

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

think some other example for Real-Time Change Tracking financial_transactions, user_accounts are completely different entities when you compare them in teams of audit log

ORDER BY emitted_at DESC
```

**Storage Consideration**: This pattern will grow continuously. Plan for data lifecycle management:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also check whether airbyte is used for realtime tracking, understand its real world use case

**Security Note**: Implement data masking for sensitive fields:
```
Consider using Airbyte's transformation capabilities or custom dbt models
to mask PII before syncing to non-production environments

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Airbyte transformations for PII , what is PII here.

Also what Data Masking denotes here


**Performance Comparison**:
```
Full Refresh (100k docs): ~5-10 minutes

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are sure about the metrics here mentioned

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants