diff --git a/pages/data-migration.mdx b/pages/data-migration.mdx index 370bd8ee1..321b9c1a0 100644 --- a/pages/data-migration.mdx +++ b/pages/data-migration.mdx @@ -15,7 +15,7 @@ instance. Whether your data is structured in files, relational databases, or other graph databases, Memgraph provides the flexibility to integrate and analyze your data efficiently. -Memgraph supports file system imports like Parquet and CSV files, offering efficient and +Memgraph supports file system imports like Parquet, CSV and JSONL files, offering efficient and structured data ingestion. **However, if you want to migrate directly from another data source, you can use the [`migrate` module](/advanced-algorithms/available-algorithms/migrate)** from Memgraph MAGE @@ -58,6 +58,9 @@ semi-structured data to be efficiently loaded, using the [`json_util` module](/advanced-algorithms/available-algorithms/json_util) and [`import_util` module](/advanced-algorithms/available-algorithms/import_util). +Memgraph also support JSONL files in which every line is formatted as a separate JSON document. Such JSONL +files can be efficiently imported using the [LOAD JSONL clause](/querying/clauses/load-jsonl). + Check out the [JSON import guide](/data-migration/json). ### Cypherl file diff --git a/pages/data-migration/csv.mdx b/pages/data-migration/csv.mdx index efe163e1b..b024de40b 100644 --- a/pages/data-migration/csv.mdx +++ b/pages/data-migration/csv.mdx @@ -59,18 +59,30 @@ LOAD CSV FROM "https://example.com/path/to/your-data.csv" WITH HEADER AS row The syntax of the `LOAD CSV` clause is: ```cypher -LOAD CSV FROM ( WITH | NO ) HEADER [IGNORE BAD] [DELIMITER ] [QUOTE ] [NULLIF ] AS +LOAD CSV FROM ( WITH CONFIG configsMap=configMap ) ? ( WITH | NO ) HEADER [IGNORE BAD] [DELIMITER ] [QUOTE ] [NULLIF ] AS ``` - `` is a string of the location of the CSV file.
Without a URL protocol, it refers to a file path. There are no restrictions on where in your file system the file can be located, as long as the path is valid (i.e., - the file exists). If you are using Docker to run Memgraph, you will need to + the file exists). CSV files can also be imported from the S3. In that case you need + to set AWS authentication config options. + + If you are using Docker to run Memgraph, you will need to [copy the files from your local directory into Docker](/getting-started/first-steps-with-docker#copy-files-from-and-to-a-docker-container) container where Memgraph can access them.
If using `http://`, `https://`, or `ftp://` the CSV file will be fetched over the network. +- `` Represents an optional configuration map through which you can + specify configuration options: `aws_region`, `aws_access_key`, + `aws_secret_key` and `aws_endpoint_url`. + - ``: The region in which your S3 service is being located + - ``: Access key used to connect to S3 service + - ``: Secret key used to connect S3 service + - `: Optional configuration parameter. Can be used to set + the URL of the S3 compatible storage. + - `( WITH | NO ) HEADER` flag specifies whether the CSV file has a header, in which case it will be parsed as a map, or it doesn't have a header, in which case it will be parsed as a list. @@ -160,6 +172,112 @@ When using the `LOAD CSV` clause please keep in mind: CREATE (n:A {p1 : x, p2 : y}); ``` +### Loading from HTTP and S3 + +The `LOAD CSV` clause supports loading files from HTTP/HTTPS/FTP URLs and S3 buckets. + +#### Loading from HTTP/HTTPS/FTP + +When loading from HTTP, HTTPS, or FTP URLs, the file will be downloaded to the `/tmp` directory before being imported: + +```cypher +LOAD CSV FROM "https://public-assets.memgraph.com/import-data/load-csv-cypher/one-type-nodes/people_nodes_wh.csv" WITH HEADER AS row +CREATE (n:Person {id: row.id, name: row.name}); +``` + +You can also use FTP URLs: + +```cypher +LOAD CSV FROM "ftp://example.com/data/nodes.csv" WITH HEADER AS row +CREATE (n:Node) SET n += row; +``` + +#### Loading from S3 + +To load files from S3, you can provide AWS credentials in three ways: + +1. Using WITH CONFIG clause (Recommended for query-specific credentials) + +```cypher +LOAD CSV FROM "s3://my-bucket/path/to/file.csv" +WITH CONFIG { + aws_region: "us-east-1", + aws_access_key: "YOUR_ACCESS_KEY", + aws_secret_key: "YOUR_SECRET_KEY" +} +WITH HEADER AS row +CREATE (n:Node) SET n += row; +``` + +For S3-compatible services (like MinIO), you can also specify the endpoint URL: + +```cypher +LOAD CSV FROM "s3://my-bucket/data/nodes.csv" +WITH CONFIG { + aws_region: "us-east-1", + aws_access_key: "YOUR_ACCESS_KEY", + aws_secret_key: "YOUR_SECRET_KEY", + aws_endpoint_url: "https://s3-compatible-service.example.com" +} +WITH HEADER AS row +CREATE (n:Node) SET n += row; +``` + +2. Using environment variables + +Set environment variables before starting Memgraph: + +``` +export AWS_REGION="us-east-1" +export AWS_ACCESS_KEY="YOUR_ACCESS_KEY" +export AWS_SECRET_KEY="YOUR_SECRET_KEY" +export AWS_ENDPOINT_URL="https://s3-compatible-service.example.com" # Optional +``` + +Then you can load files without specifying credentials in the query: + +```cypher +LOAD CSV FROM "s3://my-bucket/path/to/file.csv" WITH HEADER AS row +CREATE (n:Node) SET n += row; +``` + +3. Using database settings + +Set database-level AWS credentials: + +```cypher +SET DATABASE SETTING 'aws.region' TO 'us-east-1'; +SET DATABASE SETTING 'aws.access_key' TO 'YOUR_ACCESS_KEY'; +SET DATABASE SETTING 'aws.secret_key' TO 'YOUR_SECRET_KEY'; +SET DATABASE SETTING 'aws.endpoint_url' TO 'https://s3-compatible-service.example.com'; -- Optional +``` + +Then load files without credentials in the query: + +```cypher +LOAD CSV FROM "s3://my-bucket/path/to/file.csv" WITH HEADER AS row +CREATE (n:Node) SET n += row; +``` + +Credential precedence: If credentials are provided in multiple ways, the order of precedence is: +1. WITH CONFIG clause in the query (highest priority) +2. Environment variables +3. Database settings (lowest priority) + + +When loading files from remote locations (HTTP, FTP, or S3), the file is first downloaded to `/tmp` before being loaded into memory. Ensure you have sufficient disk space for large files. +The download can be interrupted using `TERMINATE TRANSACTIONS ` without waiting for the full download to complete. + +Use `file.download_conn_timeout_sec` run-time configuration to specify the connection timeout when establishing a connection to the remote server. + +| Option | Required | Description | +|------------------|----------|--------------------------------------------------------------------| +| aws_region | Yes | The AWS region where your S3 bucket is located (e.g., "us-east-1") | +| aws_access_key | Yes | Your AWS access key ID | +| aws_secret_key | Yes | Your AWS secret access key | +| aws_endpoint_url | No | Custom endpoint URL for S3-compatible services | + + ### Increase import speed The `LOAD CSV` clause will create relationships much faster and consequently diff --git a/pages/data-migration/json.mdx b/pages/data-migration/json.mdx index 7bcdd5671..08f348baf 100644 --- a/pages/data-migration/json.mdx +++ b/pages/data-migration/json.mdx @@ -1,9 +1,352 @@ --- -title: Import data from JSON files -description: Integrate JSON effortlessly with Memgraph. Detailed documentation guiding you every step of the way towards graph use cases. +title: Import data from JSON(L) files +description: Integrate JSON(L) effortlessly with Memgraph. Detailed documentation guiding you every step of the way towards graph use cases. --- import { Callout } from 'nextra/components' +import { Steps } from 'nextra/components' + +# Import data from JSONL files + +A JSONL file is a file in which every line is a separate JSON document. Each line is parsed as node +or edge and each key in the JSON document is used as a node's or edge's property. The data from JSONL files +can be imported using `LOAD JSONL` clause. + +## `LOAD JSONL` Cypher clause + +The `LOAD JSONL` clause uses [simdjson library](https://github.com/simdjson/simdjson) to parse JSON documents as +fast as possible. It can be used to load JSONL documents from the local disk, http, https, ftp and S3 servers. + + +### `LOAD JSONL` clause syntax + +The syntax of the `LOAD JSONL` clause is: + +```cypher +LOAD JSONL FROM ( WITH CONFIG configsMap=configMap ) ? AS +``` + +- `` is a string representing the path from which JSONL file should be loaded. There are no restrictions on where in + your file system the file can be located, as long as the path is valid (i.e., + the file exists). Files can be imported directly from the public URL and S3. + If you are using Docker to run Memgraph, you will need to + [copy the files from your local directory into + Docker](/getting-started/first-steps-with-docker#copy-files-from-and-to-a-docker-container) + container where Memgraph can access them.
+- `` Represents an optional configuration map through which you can + specify configuration options: `aws_region`, `aws_access_key`, + `aws_secret_key` and `aws_endpoint_url`. + - ``: The region in which your S3 service is being located + - ``: Access key used to connect to S3 service + - ``: Secret key used to connect S3 service + - `: Optional configuration parameter. Can be used to set + the URL of the S3 compatible storage. +- `` is a symbolic name representing the variable to which the + contents of the parsed row will be bound to, enabling access to the row + contents later in the query. The variable doesn't have to be used in any + subsequent clause. + +### `LOAD JSONL` clause specificities + +When using the `LOAD JSONL` clause please keep in mind: + +- The JSONL parser parses the values in their appropriate type so you should get the same property type in Memgraph as in JSONL file. Memgraph supports the following +JSON types: + - `string`: The property in Memgraph will be of type string. + - `uint64_t`: The property in Memgraph will be cast to int64_t because Cypher standard doesn't support uint64_t. + - `int64_t`: The property in Memgraph will be saved as int64_t. + - `double`: The property in Memgraph will be used as floating point number. + - `boolean`: The property in Memgraph will be saved as bool. + - `array`: The property in Memgraph will be saved as list. + - `object`: The property in Memgraph will be saved as map. + +- **The `LOAD JSONL` clause is not a standalone clause**, meaning a valid query must contain at least one more clause, for example: + +```cypher +LOAD JSONL FROM "./people.jsonl" AS row CREATE (p:Person) SET p += row; +``` + +In this regard, the following query will throw an exception: + +```cypher +LOAD JSONL FROM "./file.jsonl" AS row; +``` + +**Adding a `MATCH` or `MERGE` clause before LOAD JSONL** allows you to match certain entities in the graph before running `LOAD JSONL`, optimizing the process as +matched entities do not need to be searched for every row in the JSONL file. + +But, the `MATCH` or `MERGE` clause can be used prior the `LOAD JSONL` clause only +if the clause returns only one row. Returning multiple rows before calling the +`LOAD JSONL` clause will cause a Memgraph runtime error. + +- **The `LOAD JSONL` clause can be used at most once per query**, so queries like +the one below will throw an exception: + +```cypher +LOAD JSONL FROM "/x.jsonl" AS x +LOAD JSONL FROM "/y.jsonl" AS y +CREATE (n:A {p1 : x, p2 : y}); +``` + +### Loading from HTTP and S3 + + The `LOAD JSONL` clause supports loading files from HTTP/HTTPS/FTP URLs and S3 buckets. + +#### Loading from HTTP/HTTPS/FTP + +When loading from HTTP, HTTPS, or FTP URLs, the file will be downloaded to the `/tmp` directory before being imported: + +```cypher +LOAD JSONL FROM "https://download.memgraph.com/asset/docs/people_nodes.jsonl" AS row +CREATE (n:Person {id: row.id, name: row.name, age: row.age, city: row.city}); +``` + +You can also use FTP URLs: + +```cypher +LOAD JSONL FROM "ftp://example.com/data/nodes.jsonl" AS row +CREATE (n:Node) SET n += row; +``` + +#### Loading from S3 + +To load files from S3, you can provide AWS credentials in three ways: + +1. Using WITH CONFIG clause (Recommended for query-specific credentials) + +```cypher +LOAD JSONL FROM "s3://my-bucket/path/to/file.jsonl" +WITH CONFIG { + aws_region: "us-east-1", + aws_access_key: "YOUR_ACCESS_KEY", + aws_secret_key: "YOUR_SECRET_KEY" +} +AS row +CREATE (n:Node) SET n += row; +``` + +For S3-compatible services (like MinIO), you can also specify the endpoint URL: + +```cypher +LOAD JSONL FROM "s3://my-bucket/data/nodes.jsonl" +WITH CONFIG { + aws_region: "us-east-1", + aws_access_key: "YOUR_ACCESS_KEY", + aws_secret_key: "YOUR_SECRET_KEY", + aws_endpoint_url: "https://s3-compatible-service.example.com" +} +AS row +CREATE (n:Node) SET n += row; +``` + +2. Using environment variables + +Set environment variables before starting Memgraph: + +``` +export AWS_REGION="us-east-1" +export AWS_ACCESS_KEY="YOUR_ACCESS_KEY" +export AWS_SECRET_KEY="YOUR_SECRET_KEY" +export AWS_ENDPOINT_URL="https://s3-compatible-service.example.com" # Optional +``` + +Then you can load files without specifying credentials in the query: + +```cypher +LOAD JSONL FROM "s3://my-bucket/path/to/file.jsonl" AS row +CREATE (n:Node) SET n += row; +``` + +3. Using database settings + +Set database-level AWS credentials: + +```cypher +SET DATABASE SETTING 'aws.region' TO 'us-east-1'; +SET DATABASE SETTING 'aws.access_key' TO 'YOUR_ACCESS_KEY'; +SET DATABASE SETTING 'aws.secret_key' TO 'YOUR_SECRET_KEY'; +SET DATABASE SETTING 'aws.endpoint_url' TO 'https://s3-compatible-service.example.com'; -- Optional +``` + +Then load files without credentials in the query: + +```cypher +LOAD JSONL FROM "s3://my-bucket/path/to/file.jsonl" AS row +CREATE (n:Node) SET n += row; +``` + +Credential precedence: If credentials are provided in multiple ways, the order of precedence is: +1. WITH CONFIG clause in the query (highest priority) +2. Environment variables +3. Database settings (lowest priority) + + +When loading files from remote locations (HTTP, FTP, or S3), the file is first downloaded to /tmp before being loaded into memory. Ensure you have sufficient disk space for large files. +The download can be interrupted using `TERMINATE TRANSACTIONS `. + +| Option | Required | Description | +|------------------|----------|--------------------------------------------------------------------| +| aws_region | Yes | The AWS region where your S3 bucket is located (e.g., "us-east-1") | +| aws_access_key | Yes | Your AWS access key ID | +| aws_secret_key | Yes | Your AWS secret access key | +| aws_endpoint_url | No | Custom endpoint URL for S3-compatible services | + + +### Increase import speed + + +The `LOAD JSONL` clause will create relationships much faster and consequently +speed up data import if you [create indexes](/fundamentals/indexes) on nodes or +node properties once you import them: + +```cypher + CREATE INDEX ON :Node(id); +``` + +If the `LOAD JSONL` clause is merging data instead of creating it, create indexes +before running the `LOAD JSONL` clause. + +The construct `USING PERIODIC COMMIT ` also improves the import speed because +it optimizes memory allocation patterns. In our benchmarks, periodic commit +speeds up the execution from 25% to 35%. + +```cypher + USING PERIODIC COMMMIT 1024 LOAD CLAUSE FROM "/x.jsonl" AS x + CREATE (n:A {p1 : x, p2 : y}); +``` + + +You can also speed up the import if you switch Memgraph to [**analytical storage +mode**](/fundamentals/storage-memory-usage#storage-modes). In the analytical +storage mode there are no ACID guarantees besides manually created snapshots. +After import you can switch the storage mode back to +transactional and enable ACID guarantees. + +You can switch between modes within the session using the following query: + +```cypher +STORAGE MODE IN_MEMORY_{TRANSACTIONAL|ANALYTICAL}; +``` + +If you use `IN_MEMORY_ANALYTICAL` mode and have nodes and relationships stored in + separate JSONL files, you can run multiple concurrent `LOAD JSONL` queries to import data even faster. +In order to achieve the best import performance, split your nodes and relationships +files into smaller files and run multiple `LOAD JSONL` queries in parallel. +The key is to run all `LOAD JSONL` queries which create nodes first. After that, run +all `LOAD JSONL` queries that create relationships. + + +### Import multiple JSONL files with distinct graph objects + +In this example, the data is split across four files, each file contains nodes +of a single label or relationships of a single type. Files are downloaded from public URLs. +The same approach works for S3 files when proper credentials are provided. + + + + + {

JSONL files

} + + - [`people_nodes.jsonl`](https://download.memgraph.com/asset/docs/people_nodes.jsonl) is used to create nodes labeled `:Person`.
The file contains the following data: + ```jsonl + {"id": 100, "name": "Daniel", "age": 30, "city": "London"} + {"id": 101, "name": "Alex", "age": 15, "city": "Paris"} + {"id": 102, "name": "Sarah", "age": 17, "city": "London"} + {"id": 103, "name": "Mia", "age": 25, "city": "Zagreb"} + {"id": 104, "name": "Lucy", "age": 21, "city": "Paris"} + ``` +- [`restaurants_nodes.jsonl`](https://download.memgraph.com/asset/docs/restaurants_nodes.jsonl) is used to create nodes labeled `:Restaurants`.
The file contains the following data: + ```jsonl + {"id": 200, "name": "Mc Donalds", "menu": "Fries;BigMac;McChicken;Apple Pie"} + {"id": 201, "name": "KFC", "menu": "Fried Chicken;Fries;Chicken Bucket"} + {"id": 202, "name": "Subway", "menu": "Ham Sandwich;Turkey Sandwich;Foot-long"} + {"id": 203, "name": "Dominos", "menu": "Pepperoni Pizza;Double Dish Pizza;Cheese filled Crust"} + ``` + +- [`people_relationships.jsonl`](https://download.memgraph.com/asset/docs/people_relationships.jsonl) is used to connect people with the `:IS_FRIENDS_WITH` relationship.
The file contains the following data: + ```jsonl + {"first_person": 100, "second_person": 102, "met_in": 2014} + {"first_person": 103, "second_person": 101, "met_in": 2021} + {"first_person": 102, "second_person": 103, "met_in": 2005} + {"first_person": 101, "second_person": 104, "met_in": 2005} + {"first_person": 104, "second_person": 100, "met_in": 2018} + {"first_person": 101, "second_person": 102, "met_in": 2017} + {"first_person": 100, "second_person": 103, "met_in": 2001} + ``` +- [`restaurants_relationships.jsonl`](https://download.memgraph.com/asset/docs/restaurants_relationships.jsonl) is used to connect people with restaurants using the `:ATE_AT` relationship.
The file contains the following data: + ```jsonl + {"PERSON_ID": 100, "REST_ID": 200, "liked": true} + {"PERSON_ID": 103, "REST_ID": 201, "liked": false} + {"PERSON_ID": 104, "REST_ID": 200, "liked": true} + {"PERSON_ID": 101, "REST_ID": 202, "liked": false} + {"PERSON_ID": 101, "REST_ID": 203, "liked": false} + {"PERSON_ID": 101, "REST_ID": 200, "liked": true} + {"PERSON_ID": 102, "REST_ID": 201, "liked": true} + ``` + + {

Import nodes

} + + Each row will be parsed as a map, and the + fields can be accessed using the property lookup syntax (e.g. `id: row.id`). Files should be downloaded and then accessed from the local disk. + + The following query will load row by row from the file, and create a new node + for each row with properties based on the parsed row values: + + ```cypher + LOAD JSONL FROM "https://download.memgraph.com/asset/docs/people_nodes.jsonl" + AS row + CREATE (n:Person {id: row.id, name: row.name, age: row.age, city: row.city}); + ``` + + In the same manner, the following query will create a new node for each restaurant: + + ```cypher + LOAD JSONL FROM "https://download.memgraph.com/asset/docs/restaurants_nodes.jsonl" AS row + CREATE (n:Restaurant {id: row.id, name: row.name, menu: row.menu}); + ``` + + {

Create indexes

} + + Creating an [index](/fundamentals/indexes) on a property used to connect nodes + with relationships, in this case, the `id` property of the `:Person` nodes, + will speed up the import of relationships, especially with large datasets: + + ```cypher + CREATE INDEX ON :Person(id); + ``` + + {

Import relationships

} + The following query will create relationships between the people nodes: + + ```cypher + LOAD JSONL FROM "https://download.memgraph.com/asset/docs/people_relationships.jsonl" AS row + MATCH (p1:Person {id: row.first_person}) + MATCH (p2:Person {id: row.second_person}) + CREATE (p1)-[f:IS_FRIENDS_WITH]->(p2) + SET f.met_in = row.met_in; + ``` + + The following query will create relationships between people and restaurants where they ate: + + ```cypher + LOAD JSONL FROM "https://download.memgraph.com/asset/docs/restaurants_relationships.jsonl" AS row + MATCH (p1:Person {id: row.PERSON_ID}) + MATCH (re:Restaurant {id: row.REST_ID}) + CREATE (p1)-[ate:ATE_AT]->(re) + SET ate.liked = ToBoolean(row.liked); + ``` + + {

Final result

} + Run the following query to see how the imported data looks as a graph: + + ``` + MATCH p=()-[]-() RETURN p; + ``` + + ![](/pages/data-migration/csv/load_csv_restaurants_relationships.png) + +
+ # Import data from JSON files diff --git a/pages/data-migration/parquet.mdx b/pages/data-migration/parquet.mdx index 781cf3e62..b413844fb 100644 --- a/pages/data-migration/parquet.mdx +++ b/pages/data-migration/parquet.mdx @@ -10,8 +10,8 @@ import {CommunityLinks} from '/components/social-card/CommunityLinks' # Import data from Parquet file -The data from Parquet files can be imported using the [`LOAD PARQUET` Cypher clause](#load-parquet-cypher-clause) from the local disk -and from the s3. +The data from Parquet files can be imported using the [`LOAD PARQUET` Cypher clause](#load-parquet-cypher-clause) from the local disk, http, https, ftp +and from the s3 servers. ## `LOAD PARQUET` Cypher clause @@ -20,7 +20,7 @@ in column batches, assembles them into row batches of 64K rows and places those batches into a queue. The main thread then pulls each batch from the queue and processes it row by row. For every row, it binds the parsed values to the specified variables and either populates the database (if it is empty) or -appends the new rows to an existing dataset. +appends the new rows to an existing dataset. ### `LOAD PARQUET` clause syntax @@ -116,6 +116,113 @@ When using the `LOAD PARQUET` clause please keep in mind: CREATE (n:A {p1 : x, p2 : y}); ``` +### Loading from HTTP and S3 + +The `LOAD PARQUET` clause supports loading files from HTTP/HTTPS/FTP URLs and S3 buckets. + +#### Loading from HTTP/HTTPS/FTP + +When loading from HTTP, HTTPS, or FTP URLs, the file will be downloaded to the `/tmp` directory before being imported: + +```cypher +LOAD PARQUET FROM "https://download.memgraph.com/asset/docs/people_nodes.parquet" AS row +CREATE (n:Person {id: row.id, name: row.name, age: row.age, city: row.city}); +``` + +You can also use FTP URLs: + +```cypher +LOAD PARQUET FROM "ftp://example.com/data/nodes.parquet" AS row +CREATE (n:Node) SET n += row; +``` + +#### Loading from S3 + +To load files from S3, you can provide AWS credentials in three ways: + +1. Using WITH CONFIG clause (Recommended for query-specific credentials) + +```cypher +LOAD PARQUET FROM "s3://my-bucket/path/to/file.parquet" +WITH CONFIG { + aws_region: "us-east-1", + aws_access_key: "YOUR_ACCESS_KEY", + aws_secret_key: "YOUR_SECRET_KEY" +} +AS row +CREATE (n:Node) SET n += row; +``` + +For S3-compatible services (like MinIO), you can also specify the endpoint URL: + +```cypher +LOAD PARQUET FROM "s3://my-bucket/data/nodes.parquet" +WITH CONFIG { + aws_region: "us-east-1", + aws_access_key: "YOUR_ACCESS_KEY", + aws_secret_key: "YOUR_SECRET_KEY", + aws_endpoint_url: "https://s3-compatible-service.example.com" +} +AS row +CREATE (n:Node) SET n += row; +``` + +2. Using environment variables + +Set environment variables before starting Memgraph: + +``` +export AWS_REGION="us-east-1" +export AWS_ACCESS_KEY="YOUR_ACCESS_KEY" +export AWS_SECRET_KEY="YOUR_SECRET_KEY" +export AWS_ENDPOINT_URL="https://s3-compatible-service.example.com" # Optional +``` + +Then you can load files without specifying credentials in the query: + +```cypher +LOAD PARQUET FROM "s3://my-bucket/path/to/file.parquet" AS row +CREATE (n:Node) SET n += row; +``` + +3. Using database settings + +Set database-level AWS credentials: + +```cypher +SET DATABASE SETTING 'aws.region' TO 'us-east-1'; +SET DATABASE SETTING 'aws.access_key' TO 'YOUR_ACCESS_KEY'; +SET DATABASE SETTING 'aws.secret_key' TO 'YOUR_SECRET_KEY'; +SET DATABASE SETTING 'aws.endpoint_url' TO 'https://s3-compatible-service.example.com'; -- Optional +``` + +Then load files without credentials in the query: + +```cypher +LOAD PARQUET FROM "s3://my-bucket/path/to/file.parquet" AS row +CREATE (n:Node) SET n += row; +``` + +Credential precedence: If credentials are provided in multiple ways, the order of precedence is: +1. WITH CONFIG clause in the query (highest priority) +2. Environment variables +3. Database settings (lowest priority) + +When loading files from remote locations (HTTP, FTP, or S3), the file is first downloaded to /tmp before being loaded into memory. Ensure you have sufficient disk space for large files. +The download can be interrupted using `TERMINATE TRANSACTIONS ` without +waiting for the full download to complete. + +Use `file.download_conn_timeout_sec` run-time configuration to specify the connection timeout when establishing a connection to the remote server. + + +| Option | Required | Description | +|------------------|----------|--------------------------------------------------------------------| +| aws_region | Yes | The AWS region where your S3 bucket is located (e.g., "us-east-1") | +| aws_access_key | Yes | Your AWS access key ID | +| aws_secret_key | Yes | Your AWS secret access key | +| aws_endpoint_url | No | Custom endpoint URL for S3-compatible services | + + ### Increase import speed You can significantly increase data-import speed when using the `LOAD PARQUET` @@ -181,7 +288,7 @@ single label or relationships of a single type. {

Parquet files

} - - [`people_nodes.parquet`](s3://download.memgraph.com/asset/docs/people_nodes.parquet) is used to create nodes labeled `:Person`.
The file contains the following data: + - [`people_nodes.parquet`](https://download.memgraph.com/asset/docs/people_nodes.parquet) is used to create nodes labeled `:Person`.
The file contains the following data: ```parquet id,name,age,city 100,Daniel,30,London @@ -190,7 +297,7 @@ single label or relationships of a single type. 103,Mia,25,Zagreb 104,Lucy,21,Paris ``` -- [`restaurants_nodes.parquet`](s3://download.memgraph.com/asset/docs/restaurants_nodes.parquet) is used to create nodes labeled `:Restaurants`.
The file contains the following data: +- [`restaurants_nodes.parquet`](https://download.memgraph.com/asset/docs/restaurants_nodes.parquet) is used to create nodes labeled `:Restaurants`.
The file contains the following data: ```parquet id,name,menu 200,Mc Donalds,Fries;BigMac;McChicken;Apple Pie @@ -199,7 +306,7 @@ single label or relationships of a single type. 203,Dominos,Pepperoni Pizza;Double Dish Pizza;Cheese filled Crust ``` -- [`people_relationships.parquet`](s3://download.memgraph.com/asset/docs/people_relationships.parquet) is used to connect people with the `:IS_FRIENDS_WITH` relationship.
The file contains the following data: +- [`people_relationships.parquet`](https://download.memgraph.com/asset/docs/people_relationships.parquet) is used to connect people with the `:IS_FRIENDS_WITH` relationship.
The file contains the following data: ```parquet first_person,second_person,met_in 100,102,2014 @@ -210,7 +317,7 @@ single label or relationships of a single type. 101,102,2017 100,103,2001 ``` -- [`restaurants_relationships.parquet`](s3://download.memgraph.com/asset/docs/restaurants_relationships.parquet) is used to connect people with restaurants using the `:ATE_AT` relationship.
The file contains the following data: +- [`restaurants_relationships.parquet`](https://download.memgraph.com/asset/docs/restaurants_relationships.parquet) is used to connect people with restaurants using the `:ATE_AT` relationship.
The file contains the following data: ```parquet PERSON_ID,REST_ID,liked 100,200,true @@ -231,14 +338,14 @@ single label or relationships of a single type. for each row with properties based on the parsed row values: ```cypher - LOAD PARQUET FROM "s3://download.memgraph.com/asset/docs/people_nodes.parquet" AS row + LOAD PARQUET FROM "https://download.memgraph.com/asset/docs/people_nodes.parquet" AS row CREATE (n:Person {id: row.id, name: row.name, age: row.age, city: row.city}); ``` In the same manner, the following query will create new nodes for each restaurant: ```cypher - LOAD PARQUET FROM "s3://download.memgraph.com/asset/docs/restaurants_nodes.parquet" AS row + LOAD PARQUET FROM "https://download.memgraph.com/asset/docs/restaurants_nodes.parquet" AS row CREATE (n:Restaurant {id: row.id, name: row.name, menu: row.menu}); ``` @@ -256,7 +363,7 @@ single label or relationships of a single type. The following query will create relationships between the people nodes: ```cypher - LOAD PARQUET FROM "s3://download.memgraph.com/asset/docs/people_relationships.parquet" AS row + LOAD PARQUET FROM "https://download.memgraph.com/asset/docs/people_relationships.parquet" AS row MATCH (p1:Person {id: row.first_person}) MATCH (p2:Person {id: row.second_person}) CREATE (p1)-[f:IS_FRIENDS_WITH]->(p2) @@ -266,7 +373,7 @@ single label or relationships of a single type. The following query will create relationships between people and restaurants where they ate: ```cypher - LOAD PARQUET FROM "s3://download.memgraph.com/asset/docs/restaurants_relationships.parquet" AS row + LOAD PARQUET FROM "https://download.memgraph.com/asset/docs/restaurants_relationships.parquet" AS row MATCH (p1:Person {id: row.PERSON_ID}) MATCH (re:Restaurant {id: row.REST_ID}) CREATE (p1)-[ate:ATE_AT]->(re) diff --git a/pages/database-management/authentication-and-authorization/role-based-access-control.mdx b/pages/database-management/authentication-and-authorization/role-based-access-control.mdx index dd7059cf3..326fad5da 100644 --- a/pages/database-management/authentication-and-authorization/role-based-access-control.mdx +++ b/pages/database-management/authentication-and-authorization/role-based-access-control.mdx @@ -161,7 +161,7 @@ of the following commands: | Privilege to enforce [constraints](/fundamentals/constraints). | `CONSTRAINT` | | Privilege to [dump the database](/configuration/data-durability-and-backup#database-dump).| `DUMP` | | Privilege to use [replication](/clustering/replication) queries. | `REPLICATION` | -| Privilege to access files in queries, for example, when using `LOAD CSV` and `LOAD PARQUET` clauses. | `READ_FILE` | +| Privilege to access files in queries, for example, when using `LOAD CSV`, `LOAD JSONL` and `LOAD PARQUET` clauses. | `READ_FILE` | | Privilege to manage [durability files](/configuration/data-durability-and-backup#database-dump). | `DURABILITY` | | Privilege to try and [free memory](/fundamentals/storage-memory-usage#deallocating-memory). | `FREE_MEMORY` | | Privilege to use [trigger queries](/fundamentals/triggers). | `TRIGGER` | diff --git a/pages/database-management/configuration.mdx b/pages/database-management/configuration.mdx index e89ff1f3f..53b5d2a29 100644 --- a/pages/database-management/configuration.mdx +++ b/pages/database-management/configuration.mdx @@ -306,22 +306,23 @@ Memgraph contains settings that can be modified during runtime using a Cypher qu Some runtime settings are persisted between multiple runs, while others will fallback to the value of the command-line argument. -| Setting name | Description | Persistent between runs | -|----------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------| -| organization.name | Name of the organization using the instance of Memgraph (used for verifying the license key). | yes | -| enterprise.license | License key for Memgraph Enterprise. | yes | -| server.name | Bolt server name. | yes | -| query.timeout | Maximum allowed query execution time. Value of 0 means no limit. | yes | -| log.level | Minimum log level. Allowed values: TRACE, DEBUG, INFO, WARNING, ERROR, CRITICAL. | no | -| log.to_stderr | Log messages go to `stderr` in addition to `logfiles`. | no | -| cartesian-product-enabled | Enforces cartesian product operator during query matching. | no | -| hops_limit_partial_results | If set to `true`, partial results are returned when the hops limit is reached. If set to `false`, an exception is thrown when the hops limit is reached. The default value is `true`. | yes | -| timezone | IANA timezone identifier string setting the instance's timezone. | yes | -| storage.snapshot.interval | Define periodic snapshot schedule via cron expression ([crontab](https://crontab.guru/) format, an [Enterprise feature](/database-management/enabling-memgraph-enterprise)) or as a period in seconds. Set to empty string to disable. | no | -| aws.region | AWS region in which your S3 service is located. | yes | -| aws.access_key | Access key used to READ the file from S3. | yes | -| aws.secret_key | Secret key used to READ the file from S3. | yes | -| aws.endpoint_url | URL on which S3 can be accessed (if using some other S3-compatible storage). | yes | +| Setting name | Description | Persistent between runs | +|--------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------| +| organization.name | Name of the organization using the instance of Memgraph (used for verifying the license key). | yes | +| enterprise.license | License key for Memgraph Enterprise. | yes | +| server.name | Bolt server name. | yes | +| query.timeout | Maximum allowed query execution time. Value of 0 means no limit. | yes | +| log.level | Minimum log level. Allowed values: TRACE, DEBUG, INFO, WARNING, ERROR, CRITICAL. | no | +| log.to_stderr | Log messages go to `stderr` in addition to `logfiles`. | no | +| cartesian-product-enabled | Enforces cartesian product operator during query matching. | no | +| hops_limit_partial_results | If set to `true`, partial results are returned when the hops limit is reached. If set to `false`, an exception is thrown when the hops limit is reached. The default value is `true`. | yes | +| timezone | IANA timezone identifier string setting the instance's timezone. | yes | +| storage.snapshot.interval | Define periodic snapshot schedule via cron expression ([crontab](https://crontab.guru/) format, an [Enterprise feature](/database-management/enabling-memgraph-enterprise)) or as a period in seconds. Set to empty string to disable. | no | +| aws.region | AWS region in which your S3 service is located. | yes | +| aws.access_key | Access key used to READ the file from S3. | yes | +| aws.secret_key | Secret key used to READ the file from S3. | yes | +| aws.endpoint_url | URL on which S3 can be accessed (if using some other S3-compatible storage). | yes | +| file.download_conn_timeout_sec | The timeout for establishing a connection to the remote server when downloading a file. | yes | All settings can be fetched by calling the following query: @@ -503,21 +504,22 @@ This section contains the list of flags that are used when connecting to S3-comp This section contains the list of all other relevant flags used within Memgraph. -| Flag | Description | Type | -| --------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ---------- | +| Flag | Description | Type | +| ----------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ---------- | | `--allow-load-csv=true` | Controls whether LOAD CSV clause is allowed in queries. | `[bool]` | | `--also-log-to-stderr=false` | Log messages go to stderr in addition to logfiles. | `[bool]` | | `--data-directory=/var/lib/memgraph` | Path to directory in which to save all permanent data. | `[string]` | | `--data-recovery-on-startup=true` | Facilitates recovery of one or more individual databases and their contents during startup. Replaces `--storage-recover-on-startup` | `[bool]` | -| `--debug-query-plans=false` | Enable DEBUG logging of potential query plans. | `[string]` | +| `--debug-query-plans=false` | Enable DEBUG logging of potential query plans. | `[string]` | | `--delta-chain-cache-threshold=128` | The minimum number of deltas worth caching when rebuilding a certain object's state. Useful when executing parallel transactions dependent on changes of a frequently changed graph object, to lower CPU usage. Must be a positive non-zero integer. | `[uint64]` | +| `--file-download-conn-timeout-sec` | The timeout for establishing a connection to the remote server when downloading a file. | `[uint64]` | | `--flag-file` | Path to the additional configuration file, overrides the default configuration settings. | `[string]` | | `--help` | Show help on all flags and exit. The default values is `false`. | `[bool]` | | `--help-xml` | Produce an XML version of help and exit. The default values is `false`. | `[bool]` | | `--init-file` | Path to the CYPHERL file which contains queries that need to be executed before the Bolt server starts, such as creating users. | `[string]` | | `--init-data-file` | Path to the CYPHERL file, which contains queries that need to be executed after the Bolt server starts. | `[string]` | | `--isolation-level=SNAPSHOT_ISOLATION` | Isolation level used for the transactions. Allowed values: SNAPSHOT_ISOLATION, READ_COMMITTED, READ_UNCOMMITTED. | `[string]` | -| `--log-file=/var/log/memgraph/memgraph.log` | Path to where the log should be stored. If set to an empty string (`--log-file=`), no logs will be saved. | `[string]` | +| `--log-file=/var/log/memgraph/memgraph.log` | Path to where the log should be stored. If set to an empty string (`--log-file=`), no logs will be saved. | `[string]` | | `--log-level=WARNING` | Minimum log level. Allowed values: TRACE, DEBUG, INFO, WARNING, ERROR, CRITICAL. | `[string]` | | `--memory-limit=0` | Total memory limit in MiB. Set to 0 to use the default values which are 100% of the physical memory if the swap is enabled and 90% of the physical memory otherwise. | `[uint64]` | | `--metrics-address` | Host for HTTP server for exposing metrics. | `[string]` | diff --git a/pages/help-center/faq.mdx b/pages/help-center/faq.mdx index a7674c480..5d42f9c64 100644 --- a/pages/help-center/faq.mdx +++ b/pages/help-center/faq.mdx @@ -216,7 +216,7 @@ Currently, the fastest way to import data is from a Parquet file with a [LOAD PA clause](/data-migration/parquet). Check out the [best practices for importing data](/data-migration/best-practices). -[Other import methods](/data-migration) include importing data from CSV, JSON and CYPHERL files, +[Other import methods](/data-migration) include importing data from CSV, JSON, JSONL and CYPHERL files, migrating from relational databases, or connecting to a data stream. ### How to import data from MySQL or PostgreSQL? @@ -227,10 +227,10 @@ You can migrate from [MySQL](/data-migration/migrate-from-rdbms) or ### What file formats does Memgraph support for import? You can import data from [CSV](/data-migration/csv), [PARQUET](/data-migration/parquet) -[JSON](/data-migration/json) or [CYPHERL](/data-migration/cypherl) files. +[JSON and JSONL](/data-migration/json) or [CYPHERL](/data-migration/cypherl) files. CSV files can be imported in on-premise instances using the [LOAD CSV -clause](/data-migration/csv), PARQUET files can be imported using the [LOAD PARQUET](/data-migration/parquet) and JSON files can be imported using a +clause](/data-migration/csv), PARQUET files can be imported using the [LOAD PARQUET](/data-migration/parquet) and JSON(L) files can be imported using a [json_util](/advanced-algorithms/available-algorithms/json_util) module from the MAGE library. On a Cloud instance, data from CSV and JSON files can be imported only from a remote address. diff --git a/pages/querying/query-plan.mdx b/pages/querying/query-plan.mdx index 9ad3ae3a2..081b60fc2 100644 --- a/pages/querying/query-plan.mdx +++ b/pages/querying/query-plan.mdx @@ -241,6 +241,7 @@ The following table lists all the operators currently supported by Memgraph: | `IndexedJoin` | Performs an indexed join of the input from its two input branches. | | `Limit` | Limits certain rows from the pull chain. | | `LoadCsv` | Loads CSV file in order to import files into the database. | +| `LoadJsonl` | Loads JSONL file in order to import files into the database. | | `LoadParquet` | Loads Parqet file in order to import files into the database. | | `Merge` | Applies merge on the input it received. | | `Once` | Forms the beginning of an operator chain with "only once" semantics. The operator will return false on subsequent pulls. |