From 73f040121d67277dc81739c2334e45a11a158f11 Mon Sep 17 00:00:00 2001 From: Shiva Date: Thu, 4 Dec 2025 18:26:22 +0530 Subject: [PATCH 1/2] add README with detailed usage and import instructions for dgraph-import --- dgraph-import/README.md | 117 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 117 insertions(+) create mode 100644 dgraph-import/README.md diff --git a/dgraph-import/README.md b/dgraph-import/README.md new file mode 100644 index 0000000..408a90a --- /dev/null +++ b/dgraph-import/README.md @@ -0,0 +1,117 @@ +# Dgraph Import + +## Overview + +The `dgraph import` command bulk loads RDF/JSON data into a Dgraph cluster via snapshot-based import. It supports two workflows: generating a snapshot from data files or streaming an existing snapshot to a running cluster. + +## Command Syntax + +``` +dgraph import [flags] +``` + +### Essential Flags + +| Flag | Description | +|------|-------------| +| `--files, -f` | Path to RDF/JSON data files (e.g., `data.rdf`, `data.json`) | +| `--schema, -s` | Path to DQL schema file | +| `--graphql_schema, -g` | Path to GraphQL schema file | +| `--format` | File format: `rdf` or `json` | +| `--snapshot-dir, -p` | Path to existing snapshot output directory for direct import | +| `--drop-all` | Drop all existing cluster data before import (enables bulk loader) | +| `--drop-all-confirm` | Confirmation flag for `--drop-all` operation | +| `--conn-str, -c` | Dgraph connection string (e.g., `dgraph://localhost:9080`) | + +## Quick Start + +### Bulk Import with Data and Schema + +``` +dgraph import --files data.rdf --schema schema.dql \ + --drop-all --drop-all-confirm \ + --conn-str dgraph://localhost:9080 +``` + +Loads data from `data.rdf`, drops existing cluster data, generates a snapshot, and streams it to the cluster. + +### Import from Existing Snapshot + +``` +dgraph import --snapshot-dir ./out --conn-str dgraph://localhost:9080 +``` + +Directly streams snapshot data without the bulk loading phase. + +## Snapshot Directory Structure + +The bulk loader generates an `out` directory with per-group subdirectories: + +``` +out/ +├── 0/ +│ └── p/ # BadgerDB files for group 0 +├── 1/ +│ └── p/ # BadgerDB files for group 1 +└── N/ + └── p/ # BadgerDB files for group N +``` + +When using `--snapshot-dir`, provide the `out` directory path. The import tool automatically locates `p` directories within each group folder. + +**Important:** Do not specify the `p` directory directly. + +## How It Works + +1. **Drop-All Mode**: With `--drop-all` and `--drop-all-confirm`, the bulk loader generates a snapshot from provided data and schema files. +2. **Snapshot Streaming**: The snapshot is streamed to the cluster via gRPC. +3. **Consistency**: The cluster enters drain mode during import. On error, all data is dropped for safety. + +## Import Examples + +**RDF with DQL schema:** +``` +dgraph import --files data.rdf --schema schema.dql \ + --drop-all --drop-all-confirm \ + --conn-str dgraph://localhost:9080 +``` + +**JSON with GraphQL schema:** +``` +dgraph import --files data.json --schema schema.dql \ + --graphql-schema schema.graphql --format json \ + --drop-all --drop-all-confirm \ + --conn-str dgraph://localhost:9080 +``` + +**Existing snapshot:** +``` +dgraph import --snapshot-dir ./out --conn-str dgraph://localhost:9080 +``` + +## Benchmark Import + +For testing with large datasets, Dgraph provides sample 1-million-record datasets. + +**Download benchmark files:** + +``` +wget https://github.com/dgraph-io/dgraph-benchmarks/blob/main/data/1million.rdf.gz?raw=true +wget https://github.com/dgraph-io/dgraph-benchmarks/blob/main/data/1million.schema?raw=true +``` + +**Run benchmark import:** + +``` +dgraph import --files 1million.rdf.gz --schema 1million.schema \ + --drop-all --drop-all-confirm \ + --conn-str dgraph://localhost:9080 +``` + +## Important Notes + +- When `--drop-all` and `--drop-all-confirm` flags are set, **all existing data in the cluster will be dropped** before the import begins. +- Both `--drop-all` and `--drop-all-confirm` flags are required for bulk loading; the command aborts without them. +- Live loader mode is not supported; only snapshot/bulk import is available. +- Ensure sufficient disk space for snapshot generation. +- Connection string must use gRPC format: `dgraph://localhost:9080`. \ No newline at end of file From 64a5d5c67bdf51b4e1000addf61ee76d27bc87e2 Mon Sep 17 00:00:00 2001 From: Shiva Date: Mon, 8 Dec 2025 11:55:12 +0530 Subject: [PATCH 2/2] resolve review comments --- dgraph-import/README.md | 20 ++++++++++++++++---- 1 file changed, 16 insertions(+), 4 deletions(-) diff --git a/dgraph-import/README.md b/dgraph-import/README.md index 408a90a..bf4de39 100644 --- a/dgraph-import/README.md +++ b/dgraph-import/README.md @@ -2,7 +2,19 @@ ## Overview -The `dgraph import` command bulk loads RDF/JSON data into a Dgraph cluster via snapshot-based import. It supports two workflows: generating a snapshot from data files or streaming an existing snapshot to a running cluster. +The `dgraph import` command, introduced in **v25.0.0** is designed to unify and simplify bulk and live data loading into Dgraph. Previously, users had to choose between `dgraph bulk` and `dgraph live`. With `dgraph import`, you now have a single command for both workflows, eliminating manual steps and reducing operational complexity. + +> **Note:** +> The original intent was to support both bulk and live loading, but **live loader mode is not yet supported**. Only bulk/snapshot import is available. + +## How Data Is Imported + +When you run `dgraph import`, the tool first runs the bulk loader using your provided RDF/JSON and schema files. This generates the snapshot data in the form of `p` directories (BadgerDB files) for each group. +After the bulk loader completes, `dgraph import` connects to the Alpha endpoint, puts the cluster into drain mode, and **streams the contents of the generated `p` directories directly to the running cluster using gRPC bidirectional streaming**. Once the import is complete, the cluster exits drain mode and resumes normal operation. + +If you already have a snapshot directory (from a previous bulk load), you can use the `--snapshot-dir` flag to skip the bulk loading phase and directly stream the snapshot data to the cluster. + +This means you no longer need to stop Alpha nodes or manually manage files—`dgraph import` handles everything automatically. ## Command Syntax @@ -33,7 +45,7 @@ dgraph import --files data.rdf --schema schema.dql \ --conn-str dgraph://localhost:9080 ``` -Loads data from `data.rdf`, drops existing cluster data, generates a snapshot, and streams it to the cluster. +Loads data from `data.rdf`, drops existing cluster data, runs the bulk loader to generate a snapshot, and streams it to the cluster. ### Import from Existing Snapshot @@ -41,7 +53,7 @@ Loads data from `data.rdf`, drops existing cluster data, generates a snapshot, a dgraph import --snapshot-dir ./out --conn-str dgraph://localhost:9080 ``` -Directly streams snapshot data without the bulk loading phase. +Directly streams snapshot data (output of a previous bulk load) into the cluster, without running the bulk loader again. ## Snapshot Directory Structure @@ -64,7 +76,7 @@ When using `--snapshot-dir`, provide the `out` directory path. The import tool a ## How It Works 1. **Drop-All Mode**: With `--drop-all` and `--drop-all-confirm`, the bulk loader generates a snapshot from provided data and schema files. -2. **Snapshot Streaming**: The snapshot is streamed to the cluster via gRPC. +2. **Snapshot Streaming**: The snapshot (contents of `p` directories) is streamed to the cluster via gRPC, copying all data directly into the running cluster. 3. **Consistency**: The cluster enters drain mode during import. On error, all data is dropped for safety. ## Import Examples