Add Airbyte-Couchbase Integration Tutorial #76

teetangh · 2025-11-20T09:31:54Z

Summary

This PR adds a comprehensive tutorial for integrating Airbyte with Couchbase as both source and destination.

Tutorial Details

Location: tutorial/markdown/connectors/airbyte/airbyte-couchbase-integration.md
Length: ~35 minutes reading time (1,659 lines)
Focus: Couchbase Capella (cloud) with references to self-hosted
Skill Level: Intermediate

Content Overview

Part 1: Couchbase as Source

Connection configuration and authentication
Stream discovery and schema inference
Full Refresh and Incremental sync modes
Understanding cursor fields and CDC

Part 2: Couchbase as Destination

Destination setup and permissions
Sync modes (Overwrite, Append, Append Dedup)
Document structure and metadata
Collection management

Part 3: Creating Connections

Step-by-step connection setup
Stream configuration and sync mode selection
Primary key configuration
Scheduling options

Part 4: Common Integration Patterns

Cross-bucket replication (Couchbase to Couchbase)
Analytics pipelines (Couchbase to Data Warehouse)
Data ingestion (SaaS/Database to Couchbase)
Real-time change tracking
Multi-environment sync

Additional Sections

Performance optimization and best practices
Security considerations
Comprehensive troubleshooting guide
Monitoring strategies
Resource links and next steps

Technical Details

Based on analysis of Couchbase source connector v0.1.8 and destination connector v0.1.9
Includes real-world configuration examples
SQL++ query patterns for data verification
Network and security configuration guidance

Test Plan

Review tutorial content for accuracy
Verify all code examples and configuration snippets
Check links to external documentation
Validate frontmatter schema (run npm run test:frontmatter)
Test tutorial flow with actual Airbyte and Couchbase setup
Review for consistency with other connector tutorials

This tutorial will be automatically published to the Developer Portal on the next weekly build when merged to main.

Add detailed tutorial covering: - Couchbase as source and destination - All sync modes and configuration options - Common integration patterns - Performance optimization and troubleshooting - Capella-focused with 35-minute read time

gemini-code-assist · 2025-11-20T09:32:13Z

Summary of Changes

Hello @teetangh, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new, in-depth tutorial designed to guide users through the process of integrating Airbyte with Couchbase. The tutorial provides step-by-step instructions for configuring Couchbase as both a data source and a destination, covering essential aspects from initial setup and sync mode selection to advanced integration patterns and troubleshooting. Its purpose is to empower users to build robust data pipelines leveraging Airbyte's capabilities with Couchbase, ensuring efficient data flow and optimal performance.

Highlights

New Comprehensive Tutorial: Introduces a new, in-depth tutorial for integrating Airbyte with Couchbase, covering its use as both a data source and destination.
Detailed Configuration Guide: Provides step-by-step instructions for configuring Couchbase Capella, including user permissions, network access, and connection string setup for Airbyte connectors.
Sync Mode Explanations: Explains various sync modes (Full Refresh, Incremental, Overwrite, Append, Append Dedup) and their appropriate use cases for efficient data replication.
Common Integration Patterns: Outlines practical data integration patterns such as cross-bucket replication, analytics pipelines, SaaS/database ingestion to Couchbase, real-time change tracking, and multi-environment synchronization.
Performance, Security, and Troubleshooting: Includes extensive guidance on performance optimization, security best practices, data quality guidelines, and a comprehensive troubleshooting section for Airbyte-Couchbase integrations.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

…enhanced integration coverage

gemini-code-assist

Code Review

This PR adds a very comprehensive tutorial for Airbyte and Couchbase integration. The tutorial is well-structured and covers a lot of ground, from basic setup to advanced patterns and troubleshooting. I've found a few issues, mostly related to correctness of code snippets and consistency in examples. The most critical ones are incorrect SQL queries for data extraction and a bug in a timestamp conversion query, which would lead to incorrect data. I've also suggested improvements for formatting and consistency to improve readability. Overall, great work on creating such a detailed guide.

gemini-code-assist · 2025-11-20T09:34:07Z

tutorial/markdown/connectors/airbyte/airbyte-couchbase-integration.md

+SELECT META().id as _id,
+       TO_NUMBER(meta().xattrs.$document.last_modified) as _ab_cdc_updated_at,
+       *
+FROM `bucket`.`scope`.`collection`


The example SQL query for Full Refresh mode uses SELECT *, which will produce a flat JSON structure. This contradicts the nested stream schema described previously, which expects the original document to be nested under a key representing the bucket name. The query should be adjusted to create this nested structure.

Suggested change

SELECT META().id as _id,

TO_NUMBER(meta().xattrs.$document.last_modified) as _ab_cdc_updated_at,

*

FROM `bucket`.`scope`.`collection`

SELECT META().id as _id,

TO_NUMBER(meta().xattrs.$document.last_modified) as _ab_cdc_updated_at,

c AS `bucket`

FROM `bucket`.`scope`.`collection` AS c

gemini-code-assist · 2025-11-20T09:34:07Z

tutorial/markdown/connectors/airbyte/airbyte-couchbase-integration.md

+SELECT META().id as _id,
+       TO_NUMBER(meta().xattrs.$document.last_modified) as _ab_cdc_updated_at,
+       *
+FROM `bucket`.`scope`.`collection`
+WHERE TO_NUMBER(meta().xattrs.$document.last_modified) > {last_cursor_value}
+ORDER BY TO_NUMBER(meta().xattrs.$document.last_modified) ASC
+```


Similar to the Full Refresh query, the example SQL query for Incremental (Append) mode uses SELECT *, which produces a flat JSON structure. This contradicts the documented nested stream schema. The query should be adjusted to create the nested structure.

Suggested change

SELECT META().id as _id,

TO_NUMBER(meta().xattrs.$document.last_modified) as _ab_cdc_updated_at,

*

FROM `bucket`.`scope`.`collection`

WHERE TO_NUMBER(meta().xattrs.$document.last_modified) > {last_cursor_value}

ORDER BY TO_NUMBER(meta().xattrs.$document.last_modified) ASC

```

SELECT META().id as _id,

TO_NUMBER(meta().xattrs.$document.last_modified) as _ab_cdc_updated_at,

c AS `bucket`

FROM `bucket`.`scope`.`collection` AS c

WHERE TO_NUMBER(meta().xattrs.$document.last_modified) > {last_cursor_value}

ORDER BY TO_NUMBER(meta().xattrs.$document.last_modified) ASC

gemini-code-assist · 2025-11-20T09:34:07Z

tutorial/markdown/connectors/airbyte/airbyte-couchbase-integration.md

+SELECT
+  data.*,
+  emitted_at,
+  TO_TIMESTAMP(data._ab_cdc_updated_at / 1000000000) as modified_at


The query divides data._ab_cdc_updated_at by 1,000,000,000. The _ab_cdc_updated_at field is in nanoseconds, and dividing it by 1,000,000,000 converts it to seconds. However, the TO_TIMESTAMP function in N1QL expects a number representing milliseconds since the Unix epoch. This will result in an incorrect timestamp. To convert nanoseconds to milliseconds, you should divide by 1,000,000.

Suggested change

TO_TIMESTAMP(data._ab_cdc_updated_at / 1000000000) as modified_at

TO_TIMESTAMP(data._ab_cdc_updated_at / 1000000) as modified_at

gemini-code-assist · 2025-11-20T09:34:07Z

tutorial/markdown/connectors/airbyte/airbyte-couchbase-integration.md

+  - Airbyte
+  - Data Integration
+  - ETL
+  - Connector


The tags Airbyte, Data Integration, and ETL are not present in the list of allowed tags defined in test/test-markdown-frontmatter.js. Using invalid tags will cause the frontmatter validation test to fail. Please replace them with valid tags from the allowed list.

gemini-code-assist · 2025-11-20T09:34:07Z

tutorial/markdown/connectors/airbyte/airbyte-couchbase-integration.md

+  "travel-sample": {            // Collection name
+    // Original document fields
+  }


There are a couple of inconsistencies in the stream schema example:

The comment on line 179 identifies travel-sample as the "Collection name", but it's the bucket name.

The schema uses a specific bucket name (travel-sample), while the rest of the section uses generic placeholders like bucket. For consistency, the schema example should also be generic.

Suggested change

"travel-sample": { // Collection name

// Original document fields

}

"bucket": { // Bucket name

// Original document fields

}

gemini-code-assist · 2025-11-20T09:34:07Z

tutorial/markdown/connectors/airbyte/airbyte-couchbase-integration.md

+```
+Consider using Airbyte's transformation capabilities or custom dbt models
+to mask PII before syncing to non-production environments
+```


This security note about data masking is formatted as a code block, which is typically used for code examples. For better readability and semantic correctness, it would be more appropriate to format this as a blockquote or regular text.

Suggested change

```

Consider using Airbyte's transformation capabilities or custom dbt models

to mask PII before syncing to non-production environments

```

> Consider using Airbyte's transformation capabilities or custom dbt models

> to mask PII before syncing to non-production environments

gemini-code-assist · 2025-11-20T09:34:07Z

tutorial/markdown/connectors/airbyte/airbyte-couchbase-integration.md

+ClusterTimeoutOptions(
+    kv_timeout=5 seconds,
+    query_timeout=10 seconds
+)


The Python code snippet for ClusterTimeoutOptions is not valid. Expressions like 5 seconds are not valid in Python. The Couchbase Python SDK typically expects integers or floats representing seconds for timeout values. Please correct the example to be valid Python code.

Suggested change

ClusterTimeoutOptions(

kv_timeout=5 seconds,

query_timeout=10 seconds

)

ClusterTimeoutOptions(

kv_timeout=5, # in seconds

query_timeout=10 # in seconds

)

gemini-code-assist · 2025-11-20T09:34:07Z

tutorial/markdown/connectors/airbyte/airbyte-couchbase-integration.md

+
+   Check existing indexes:
+   SELECT * FROM system:indexes
+   WHERE keyspace_id = 'collection_name'


The placeholder collection_name is used here, which is inconsistent with collection used in other query examples throughout the document (e.g., line 198). For consistency, it would be better to use collection.

Suggested change

WHERE keyspace_id = 'collection_name'

WHERE keyspace_id = 'collection'

shyam-cb

Lot of redundant information consolidate them and also when mention about metrics are these metrics are verified ones or generated by AI

shyam-cb · 2025-11-25T10:08:29Z

tutorial/markdown/connectors/airbyte/airbyte-couchbase-integration.md

+Airbyte is an open-source data integration platform that enables you to move data between various sources and destinations. With Airbyte's Couchbase connectors, you can use Couchbase as both a data source and destination, enabling powerful data integration scenarios including:
+
+- **Cross-bucket replication**: Sync data between buckets within the same or different Couchbase clusters
+- **Analytics pipelines**: Extract data from Couchbase to data warehouses or analytics platforms


verify this statement

shyam-cb · 2025-11-25T10:12:58Z

tutorial/markdown/connectors/airbyte/airbyte-couchbase-integration.md

+- When tracking changes over time
+
+**Requirements**:
+- Couchbase automatically maintains the `last_modified` xattr on all documents


does this statement is true ?

shyam-cb · 2025-11-25T10:16:20Z

tutorial/markdown/connectors/airbyte/airbyte-couchbase-integration.md

+
+**Performance**: Consider creating custom secondary indexes for better query performance on large collections.
+
+**Change Detection**: The connector uses Couchbase's built-in `last_modified` xattr, which is automatically maintained by the server.


does couchbase by default add's last_modified when document created or updated?

shyam-cb · 2025-11-25T10:18:51Z

tutorial/markdown/connectors/airbyte/airbyte-couchbase-integration.md

+   - **Permissions**: Assign "Data Reader", "Data Writer", and "Query Manager" roles
+4. Save the credentials
+
+**Note**: Query Manager role is required for automatic collection and index creation.


cluster credentials and capella creadentials are different, please verify it

shyam-cb · 2025-11-25T13:37:50Z

tutorial/markdown/connectors/airbyte/airbyte-couchbase-integration.md

+
+## Part 1: Configuring Couchbase as a Source
+
+The Couchbase source connector allows Airbyte to extract data from your Couchbase buckets. It automatically discovers all collections within a bucket and creates individual streams for each.


explain what is a stream or refer link where it explains about stream

shyam-cb · 2025-11-25T13:54:34Z

tutorial/markdown/connectors/airbyte/airbyte-couchbase-integration.md

+**Primary Key**: `[["data", "id"]]` (from PostgreSQL primary key)
+**Schedule**: Every 15 minutes
+
+**Benefits**:


benefit section use case completely generic, and not compelling.

shyam-cb · 2025-11-25T13:57:06Z

tutorial/markdown/connectors/airbyte/airbyte-couchbase-integration.md

+  AND type = 'airbyte_record'
+```
+
+### Pattern 4: Real-Time Change Tracking


think some other example for Real-Time Change Tracking financial_transactions, user_accounts are completely different entities when you compare them in teams of audit log

shyam-cb · 2025-11-25T13:59:27Z

tutorial/markdown/connectors/airbyte/airbyte-couchbase-integration.md

+ORDER BY emitted_at DESC
+```
+
+**Storage Consideration**: This pattern will grow continuously. Plan for data lifecycle management:


Also check whether airbyte is used for realtime tracking, understand its real world use case

shyam-cb · 2025-11-25T14:00:49Z

tutorial/markdown/connectors/airbyte/airbyte-couchbase-integration.md

+**Security Note**: Implement data masking for sensitive fields:
+```
+Consider using Airbyte's transformation capabilities or custom dbt models
+to mask PII before syncing to non-production environments


Airbyte transformations for PII , what is PII here.

Also what Data Masking denotes here

shyam-cb · 2025-11-25T14:02:31Z

tutorial/markdown/connectors/airbyte/airbyte-couchbase-integration.md

+
+**Performance Comparison**:
+```
+Full Refresh (100k docs): ~5-10 minutes


Are sure about the metrics here mentioned

teetangh requested a review from a team as a code owner November 20, 2025 09:31

Update tags in test-markdown-frontmatter.js to include 'Airbyte' for …

51dda13

…enhanced integration coverage

gemini-code-assist bot reviewed Nov 20, 2025

View reviewed changes

shyam-cb requested changes Nov 26, 2025

View reviewed changes

	TO_TIMESTAMP(data._ab_cdc_updated_at / 1000000000) as modified_at
	TO_TIMESTAMP(data._ab_cdc_updated_at / 1000000) as modified_at

	WHERE keyspace_id = 'collection_name'
	WHERE keyspace_id = 'collection'


		Performance: Consider creating custom secondary indexes for better query performance on large collections.

		Change Detection: The connector uses Couchbase's built-in `last_modified` xattr, which is automatically maintained by the server.


		## Part 1: Configuring Couchbase as a Source

		The Couchbase source connector allows Airbyte to extract data from your Couchbase buckets. It automatically discovers all collections within a bucket and creates individual streams for each.

Add Airbyte-Couchbase Integration Tutorial #76

Are you sure you want to change the base?

Add Airbyte-Couchbase Integration Tutorial #76

Conversation

teetangh commented Nov 20, 2025

Summary

Tutorial Details

Content Overview

Technical Details

Test Plan

Uh oh!

gemini-code-assist bot commented Nov 20, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

shyam-cb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants